当前位置：网站首页>Relationship between ASCII, Unicode, GBK, UTF-8

Relationship between ASCII, Unicode, GBK, UTF-8

2022-07-01 01:29:00 【z_ kakaya】

ASCII、Unicode、GBK、UTF-8 The relationship between

One 、ASCII code

ASCII(American Standard Code for Information Interchange, American standard information exchange code ) It's a computer coding system based on the Latin alphabet , Mainly used to show modern English and other western European languages . It's the most common single byte encoding system today , And equivalent to international standards ISO/IEC 646.

ASCII  Code use specified 7  Bit or 8  Bit binary arrays are used to represent 128  or 256  Possible characters . standard ASCII  Code is also called basis ASCII code , Use 7  Bit binary number （ The rest 1 Bit binary is 0） To represent all uppercase and lowercase letters , Numbers 0  To 9、 Punctuation ,  And special control characters used in American English .

among ：

0～31 And 127( common 33 individual ) Is a control character or a communication-specific character （ The rest are displayable characters ）, Such as the controller ：LF（ Line break ）、CR（ enter ）、FF（ Change the page ）、DEL（ Delete ）、BS（ Backspace )、BEL（ Ring the bell ） etc. ; Communication specific characters ：SOH（ Title ）、EOT（ Epilogue ）、ACK（ confirm ） etc. ;ASCII The value is 8、9、10  and 13  Convert to backspace 、 Tabulation 、 Line feed and carriage return characters . They don't have a specific graphic display , But depending on the application , It has different effects on text display .

32～126( common 95 individual ) Is the character (32 Is a space ）, among 48～57 by 0 To 9 Ten Arabic numbers .

65～90 by 26 Capital letters ,97～122 Number is 26 Lowercase letters , The rest are punctuation marks 、 Operation symbols, etc .

 after 128 This is called an extension ASCII code . Many are based on x86 All systems support the use of extensions （ or “ high ”）ASCII. Expand ASCII  The code allows each character's second 8  Bits are used to determine additional 128  A special symbol character 、 Loanwords, letters and graphic symbols

The following is ASCII clock ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ehFaJSsw-1656490148702)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ASCII.png)]

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-P3UwbUjV-1656490148713)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ascii2.png)]

Two 、GBK code

because ASCII The encoding does not support Chinese , therefore , When Chinese people use computers , We need to find a coding method to support Chinese .
therefore , Chinese people have defined a set of coding rules ： When the character is less than 127 When a , And ASCII The same characters as , But when two are greater than 127 When the characters of are connected together , It represents a Chinese character , The first byte is called the high byte （ from 0xA1-0xF7）, The second byte is the low byte （ from 0xA1-0xFE）, This can be combined about 7000 Multiple simplified Chinese characters . This rule is called GB2312.
But because there are many Chinese characters , Some words don't mean , So the rules are redefined ： No, the low byte must be 127 Later coding , As long as the first byte is greater than 127, Fixed means that this is the beginning of a Chinese character , Whether it's followed by the extended character set or not . This extended coding scheme is called GBK mark , It includes GB2312 All of , At the same time, nearly 20000 A new Chinese character （ Including traditional characters ） And symbols .
however , China has a 56 People , therefore , Again, we extend the coding rules , It also added characters of nearly thousands of ethnic minorities , So the code after expansion again is called GB18030. Chinese programmers think this series of coding standards is very good , So they are all called "DBCS"（Double Byte Charecter Set Double byte character set ）.

3、 ... and 、Unicode Character set

 Because there are many countries in the world , Each country defines its own coding standards , As a result, no one knows the code of another , You can't communicate well , So an organization appeared in time ISO（ International Organization for Standardization ） Decided to define a coding scheme to solve the coding problem of all countries , This new coding scheme is called Unicode. Be careful Unicode Not a new coding rule , It's a set of characters （ For every one 「 character 」 Assign a unique  ID（ The scientific name is code point  /  Code points  / Code Point））, Can be Unicode Understand it as a world coded Dictionary .
ISO Regulations ： Each character must use two bytes , The box 16 Bit binary to represent all characters , about ASCII The characters in the code table , Keep its encoding unchanged , It just extends the length to 16 position , All characters in other countries are uniformly recoded . Due to transmission ASCII When the characters in the table , In fact, it can be expressed in only one byte , therefore , This coding scheme wastes bandwidth in data transmission , Storing data wastes hard disk .

Four 、UTF-8 code

because Unicode It wastes network bandwidth and hard disk , So in order to solve this problem , It's just Unicode On the basis of , Defines a set of coding rules （ take 「 Code bits 」 Rules for converting to byte sequence （ code / decode It can be understood as encryption / Decrypt The process of ））, This new coding rule is UTF-8, use 1-4 Characters to transmit and store data .

Encoding rules ： Use the following template for conversion

Unicode Symbol scope （ Hexadecimal ）      |     UTF-8 Encoding mode ( Binary system )
------------------------------------------------------------------------
0000 0000-0000 007F            |     0xxxxxxx
0000 0080-0000 07FF            |     110xxxxx 10xxxxxx
0000 0800-0000 FFFF            |     1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF            |     11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

5、 ... and 、UTF-8 and Unicode transformation

utf-8 The beginning of each character is distinguished according to the high-order byte of the character , For example, a character represented by a byte , The first byte is high with “0” start ; A character represented in two bytes , The high order of the first byte is in “110” start , The last byte is marked with “10 start ”; A character represented by three bytes , The first byte is in “1110” start , The last two bytes are “10” start ; A character represented by four bytes , The first byte is in “11110” start , The last three bytes are written in “10” start .

Like Chinese characters “ wisdom ”,utf-8 Encoding is “\xe6\x99\xba” The corresponding binary is ：“111001101001100110111010”, because utf-8 One of the Chinese characters is 3 Bytes , So the corresponding template is “0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx”.

11100110   10011001     10111010
1110xxxx   10xxxxxx     10xxxxxx
	0110     011001       111010

0110011001111010 For hexadecimal 667A, Therefore, according to the rule transformation “ wisdom ”Unicode The location of is “667A”.

Again , according to Unicode The encoding position of Chinese characters , You can also find the corresponding utf-8 code .

6、 ... and 、Unicode And GBK Code conversion

Like Chinese characters “ road ”, stay gbk The code in is “\xc2\xb7”, The corresponding binary is ：“1100 0010 1011 0111”. meanwhile “ road ” stay Unicode The position in the character set is “\u8def”(python Medium Unicode type ), So you can use “\u8def” stay Unicode Found... In character set “ road ” The corresponding code is “4237”, The corresponding binary is ：“0100 0010 0011 0111”, because gbk The high byte of the two bytes of is to distinguish between Chinese and ASCII, So will “1100 0010 1011 0111” High byte “1” After removal , As the corresponding Unicode In the character set 0100 0010 0011 0111”