当前位置：网站首页>Relationship between Unicode and UTF-8

Relationship between Unicode and UTF-8

2022-07-27 00:01:00 【51CTO】

ASCII code

Inside the computer , All information is ultimately a binary value . Every binary bit （bit） Yes 0 and 1 Two kinds of state , So eight binary bits can be combined 256 States , This is called a byte （byte）. in other words , A byte can be used to represent 256 Different states , Each state corresponds to a symbol , Namely 256 Symbols , from 00000000 To 11111111.

Last century 60 years , The United States has developed a set of character codes , The relationship between English characters and binary bits , Made a unified regulation . This is known as ASCII code , It has been used up to now .

ASCII The code is specified 128 Character encoding , Such as the blank space SPACE yes 32（ Binary system 00100000）, Capital letters A yes 65（ Binary system 01000001）. this 128 Symbols （ Include 32 A control symbol that can't be printed ）, It only takes up the end of one byte 7 position , The first one is uniformly defined as 0.

Not ASCII code

In English 128 A symbol code is enough , But for other languages ,128 A symbol is not enough . such as , In French , There are phonetic symbols above the letters , It won't work ASCII Code said . therefore , Some European countries decided to , Use the highest bit of the byte to program the new symbol . such as , In French é The code of is 130（ Binary system 10000010）. thus , The coding system used by these European countries , Can mean at most 256 Symbols .

however , There are new problems . Different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different . such as ,130 In French coding, it stands for é, In Hebrew code, it stands for the letters Gimel (ג), In Russian code, it will represent another symbol . But anyway , Of all these coding methods ,0--127 The symbols are the same , It's just that 128--255 This part of .

As for the words of Asian countries , More symbols are used , There are as many Chinese characters as 10 All around . A byte can only represent 256 Symbols , It must not be enough , You have to use more than one byte to express a symbol . such as , The common encoding method of simplified Chinese is GB2312, Use two bytes to represent a Chinese character , So in theory, it can at most express 256 x 256 = 65536 Symbols .

The problem of Chinese coding needs special discussion , This note does not cover . Only point out here , Although they all use multiple bytes to represent a symbol , however GB Class Chinese character coding and the following Unicode and UTF-8 It doesn't matter .

Unicode

As we said in the previous section , There are many ways of coding in the world , The same binary number can be interpreted as different symbols . therefore , To open a text file , You have to know how it's encoded , Otherwise, read it in the wrong way , There will be chaos . Why e-mail often appears garbled ？ It's because the sender and the receiver use different coding methods .

As you can imagine , If there's a code , Include all the symbols in the world . Each symbol is given a unique code , Then the confusion will disappear . This is it. Unicode, It's like its name means , It's a code for all the symbols .

Unicode Of course, it's a big collection , The present scale can accommodate 100 More than ten thousand symbols . The coding of each symbol is different , such as ,U+0639 For Arabic letters Ain,U+0041 A capital letter for English A,U+4E25 It means Chinese characters yan . Specific symbol correspondence table , You can query unicode.org, Or special Chinese character correspondence table .

Unicode The problem of

It should be noted that ,Unicode It's just a set of symbols , It only specifies the binary code of the symbol , It doesn't specify how the binary code should be stored .

such as , Chinese characters yan Of Unicode It's a hexadecimal number 4E25, Conversion to binary is enough 15 position （100111000100101）, in other words , The representation of this symbol requires at least 2 Bytes . Other larger symbols , You may need to 3 Bytes or 4 Bytes , Even more .

There are two serious problems , The first question is , How to distinguish Unicode and ASCII ？ How do computers know that three bytes represent a symbol , Instead of three symbols ？ The second question is , We already know , Only one byte is enough for English letters , If Unicode Uniform rules , Each symbol is represented by three or four bytes , Then every letter must be preceded by two or three bytes 0, It's a huge waste of storage , The size of the text file will therefore be two or three times larger , This is unacceptable .

The result is ：1） There is Unicode A variety of storage methods , That is to say, there are many different binary formats , It can be used to express Unicode.2）Unicode Can't promote... For a long time , Until the advent of the Internet .

UTF-8

The popularity of the Internet , A unified coding method is strongly demanded .UTF-8 It's the most widely used on the Internet Unicode How to implement . Other implementations include UTF-16（ Characters are represented by two or four bytes ） and UTF-32（ Characters are represented in four bytes ）, But not on the Internet . Come again , The relationship here is ,UTF-8 yes Unicode One of the ways to realize .

UTF-8 The biggest one , Is it It's a variable length encoding . It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol .

UTF-8 The coding rules of are very simple , There are only two ：

1） For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .

2） about n Symbol of byte （n > 1）, Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .

原网站

版权声明
本文为[51CTO]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/200/202207180830294586.html