当前位置:网站首页>Relationship between Unicode and UTF-8
Relationship between Unicode and UTF-8
2022-07-27 00:01:00 【51CTO】
ASCII code
Inside the computer , All information is ultimately a binary value . Every binary bit (bit) Yes 0 and 1 Two kinds of state , So eight binary bits can be combined 256 States , This is called a byte (byte). in other words , A byte can be used to represent 256 Different states , Each state corresponds to a symbol , Namely 256 Symbols , from 00000000 To 11111111.
Last century 60 years , The United States has developed a set of character codes , The relationship between English characters and binary bits , Made a unified regulation . This is known as ASCII code , It has been used up to now .
ASCII The code is specified 128 Character encoding , Such as the blank space SPACE yes 32( Binary system 00100000), Capital letters A yes 65( Binary system 01000001). this 128 Symbols ( Include 32 A control symbol that can't be printed ), It only takes up the end of one byte 7 position , The first one is uniformly defined as 0.
Not ASCII code
In English 128 A symbol code is enough , But for other languages ,128 A symbol is not enough . such as , In French , There are phonetic symbols above the letters , It won't work ASCII Code said . therefore , Some European countries decided to , Use the highest bit of the byte to program the new symbol . such as , In French é The code of is 130( Binary system 10000010). thus , The coding system used by these European countries , Can mean at most 256 Symbols .
however , There are new problems . Different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different . such as ,130 In French coding, it stands for é, In Hebrew code, it stands for the letters Gimel (ג), In Russian code, it will represent another symbol . But anyway , Of all these coding methods ,0--127 The symbols are the same , It's just that 128--255 This part of .
As for the words of Asian countries , More symbols are used , There are as many Chinese characters as 10 All around . A byte can only represent 256 Symbols , It must not be enough , You have to use more than one byte to express a symbol . such as , The common encoding method of simplified Chinese is GB2312, Use two bytes to represent a Chinese character , So in theory, it can at most express 256 x 256 = 65536 Symbols .
The problem of Chinese coding needs special discussion , This note does not cover . Only point out here , Although they all use multiple bytes to represent a symbol , however GB Class Chinese character coding and the following Unicode and UTF-8 It doesn't matter .
Unicode
As we said in the previous section , There are many ways of coding in the world , The same binary number can be interpreted as different symbols . therefore , To open a text file , You have to know how it's encoded , Otherwise, read it in the wrong way , There will be chaos . Why e-mail often appears garbled ? It's because the sender and the receiver use different coding methods .
As you can imagine , If there's a code , Include all the symbols in the world . Each symbol is given a unique code , Then the confusion will disappear . This is it. Unicode, It's like its name means , It's a code for all the symbols .
Unicode Of course, it's a big collection , The present scale can accommodate 100 More than ten thousand symbols . The coding of each symbol is different , such as ,U+0639 For Arabic letters Ain,U+0041 A capital letter for English A,U+4E25 It means Chinese characters yan . Specific symbol correspondence table , You can query unicode.org, Or special Chinese character correspondence table .
Unicode The problem of
It should be noted that ,Unicode It's just a set of symbols , It only specifies the binary code of the symbol , It doesn't specify how the binary code should be stored .
such as , Chinese characters yan Of Unicode It's a hexadecimal number 4E25, Conversion to binary is enough 15 position (100111000100101), in other words , The representation of this symbol requires at least 2 Bytes . Other larger symbols , You may need to 3 Bytes or 4 Bytes , Even more .
There are two serious problems , The first question is , How to distinguish Unicode and ASCII ? How do computers know that three bytes represent a symbol , Instead of three symbols ? The second question is , We already know , Only one byte is enough for English letters , If Unicode Uniform rules , Each symbol is represented by three or four bytes , Then every letter must be preceded by two or three bytes 0, It's a huge waste of storage , The size of the text file will therefore be two or three times larger , This is unacceptable .
The result is :1) There is Unicode A variety of storage methods , That is to say, there are many different binary formats , It can be used to express Unicode.2)Unicode Can't promote... For a long time , Until the advent of the Internet .
UTF-8
The popularity of the Internet , A unified coding method is strongly demanded .UTF-8 It's the most widely used on the Internet Unicode How to implement . Other implementations include UTF-16( Characters are represented by two or four bytes ) and UTF-32( Characters are represented in four bytes ), But not on the Internet . Come again , The relationship here is ,UTF-8 yes Unicode One of the ways to realize .
UTF-8 The biggest one , Is it It's a variable length encoding . It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol .
UTF-8 The coding rules of are very simple , There are only two :
1) For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .
2) about n Symbol of byte (n > 1), Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .
边栏推荐
- Transformers is a graph neural network
- Part II - C language improvement_ 5. Bit operation
- What scenarios are Tencent cloud lightweight application servers suitable for?
- NFT display guide: how to display your NFT collection
- Share a regular expression
- Use Arthas to locate online problems
- push to origin/master was rejected 错误解决方法
- Related functions of strings
- Tensorflow2.0 deep learning simple tutorial of running code
- Everything you should know about wearable NFT!
猜你喜欢

DHCP, VLAN, NAT, large comprehensive experiment

买不到的数目

文件上传到OSS文件服务器

In depth interpretation of the investment logic of the consortium's participation in the privatization of Twitter

Azure synapse analytics Performance Optimization Guide (3) -- optimize performance using materialized views (Part 2)
![[literature reading] an investigation on hardware aware vision transformer scaling](/img/3d/6f2cf1fc1e9189e7557703820d021f.png)
[literature reading] an investigation on hardware aware vision transformer scaling

push to origin/master was rejected 错误解决方法

Silicon Valley class lesson 6 - Tencent cloud on demand management module (I)

动态sql
![Embedded system migration [8] - device tree and root file system migration](/img/af/5b5d38522f0cc434bdafbf892936ee.png)
Embedded system migration [8] - device tree and root file system migration
随机推荐
上千Tile的倾斜模型浏览提速,告别一块一块往外蹦的尴尬
Hcip day 2_ HCIA review comprehensive experiment
[Gorm] model relationship -hasone
Tensorflow2.0 deep learning simple tutorial of running code
【C语言】经典的递归问题
[2016] [paper notes] differential frequency tunable THz technology——
百度网址收录
会议OA之我的会议
Modulo (remainder) operation in the range of real numbers: how to find the remainder of negative numbers
Method of realizing program startup and self startup through registry
08_ Event modifier
09_ Keyboard events
嵌入式系统移植【8】——设备树和根文件系统移植
DHCP, VLAN, NAT, large comprehensive experiment
[shader realizes swaying effect _shader effect Chapter 4]
In depth interpretation of the investment logic of the consortium's participation in the privatization of Twitter
使用AW9523B芯片驱动16路LED时,LED出现误点亮的问题
Oracle remote connection configuration
10_ Name Case - Calculation attribute
Meeting OA my meeting