当前位置:网站首页>A detailed explanation of ASCII code, Unicode and UTF-8
A detailed explanation of ASCII code, Unicode and UTF-8
2022-07-05 13:12:00 【UPythonFish】
List of articles
One ASCII code
Inside the computer , All information is ultimately a binary value . Every binary bit (bit) Yes 0 and 1 Two kinds of state , So eight binary bits can be combined 256 States , This is called a byte (byte). in other words , A byte can be used to represent 256 Different states , Each state corresponds to a symbol , Namely 256 Symbols , from 00000000 To 11111111.
Last century 60 years , The United States has developed a set of character codes , The relationship between English characters and binary bits , Made a unified regulation . This is known as ASCII code , It has been used up to now .
ASCII The code is specified 128 Character encoding , Such as the blank space SPACE yes 32( Binary system 00100000), Capital letters A yes 65( Binary system 01000001). this 128 Symbols ( Include 32 A control symbol that can't be printed ), It only takes up the end of one byte 7 position , The first one is uniformly defined as 0.
0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
---|---|---|---|---|---|---|---|
The unified stipulation is 0 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Two Not ASCII code
If only English , use 128 A symbol code is enough , But for other languages ,128 A symbol is not enough .
such as , In French , There are phonetic symbols above the letters , It won't work ASCII Code said .
therefore , Some European countries decided to , Use the highest bit of the byte to program the new symbol .
such as , In French é The code of is 130( Binary system 10000010).
thus , The coding system used by these European countries , Can mean at most 256 Symbols .
however , There are new problems .
Different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different .
such as ,130 In French coding, it stands for é, In Hebrew code, it stands for the letters Gimel (ג), In Russian code, it will represent another symbol .
But anyway , Of all these coding methods ,0–127 The symbols are the same , It's just that 128–255 This part of .
As for the words of Asian countries , More symbols are used , There are as many Chinese characters as 10 All around .
A byte can only represent 256 Symbols , It must not be enough , You have to use more than one byte to express a symbol .
such as , The common encoding method of simplified Chinese is GB2312, Use two bytes to represent a Chinese character , So in theory, it can at most express 256 x 256 = 65536 Symbols
3、 ... and Unicode
There are many ways of coding in the world , The same binary number can be interpreted as different symbols .
therefore , To open a text file , You have to know how it's encoded , Otherwise, read it in the wrong way , There will be chaos .
This is why files are often garbled , Because the encoding method used for encoding and decoding is different
As you can imagine , If there's a code , Include all the symbols in the world . Each symbol is given a unique code , Then the confusion will disappear .
This is it. Unicode, It's like its name means , It's a code for all the symbols .
Unicode All characters in the world are assigned a unique numeric number , This number ranges from 0x000000 To 0x10FFFF ( Hexadecimal ), Yes 110 More than ten thousand , Each character has a unique Unicode Number , This number is usually written as 16 Base number , Add in front U+. for example :
U+9A6C Representing Chinese character horse ,
U+4E25 It means Chinese characters are strict
U+0639 For Arabic letters Ain,
U+0041 A capital letter for English A,
Unicode It's like a table , The connection between characters and numbers is established
3.1 Unicode The problem is
Unicode It's just a set of symbols , It only specifies the binary code of the symbol , It doesn't specify how the binary code should be stored .
such as , The Chinese characters are strict Unicode It's a hexadecimal number 4E25, Conversion to binary is enough 15 position (100111000100101), in other words , The representation of this symbol requires at least 2 Bytes . Other larger symbols , You may need to 3 Bytes or 4 Bytes , Even more .
There are two serious problems :
** Question 1 :** How to distinguish Unicode and ASCII ? How do computers know that three bytes represent a symbol , Instead of three symbols ?
** Question two :** Only one byte is enough for English letters , If Unicode Uniform rules , Each symbol is represented by three or four bytes , Then every letter must be preceded by two or three bytes 0, It's a huge waste of storage , The size of the text file will therefore be two or three times larger
3.2 The result is
** Results a :** There is Unicode A variety of storage methods , That is to say, there are many different binary formats , It can be used to express Unicode, There are mainly UTF-8,UTF-16,UTF-32.
Result two : Unicode Can't promote... For a long time , Until the advent of the Internet
Four UTF-8
The popularity of the Internet , A unified coding method is strongly demanded .
UTF-8 It's the most widely used on the Internet Unicode How to implement .
Other implementations include UTF-16( Characters are represented by two or four bytes ) and UTF-32( Characters are represented in four bytes ), But not on the Internet .
Be careful :UTF-8 yes Unicode One of the ways to realize .
###4.1 UTF-8 characteristic
UTF-8 The biggest one , It's a variable length encoding . It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol .
###4.2 UTF-8 The coding rules of
UTF-8 There are two coding rules :
** Rule one :** For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .
** Rule 2 :** about n Symbol of byte (n > 1), Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .
The following table summarizes the coding rules , Letter x The bits that can be encoded .
Serial number range ( Number corresponds to decimal number ) | Binary format |
---|---|
Hexadecimal range :(0x00—0x7f) Decimal range :(0—127) | 0XXXXXXX |
Hexadecimal range :(0x80—0x7ff) Decimal range :(128—2047) | 110XXXXX 10XXXXXX |
Hexadecimal range :(0x800—0xffff) Decimal range :(2048—65535) | 1110XXXX 10XXXXXX 10XXXXXX |
Hexadecimal range :(0x10000—0x10ffff) Decimal range :(65536 above ) | 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX |
5、 ... and Unicode and UTF-8 Conversion between
According to the above table , Reading UTF-8 Coding is very simple . If the first bit of a byte is 0, Then this byte is just a character ; If the first one is 1, How many in a row 1, It means how many bytes the current character occupies .
below , Take the strictness of Chinese characters as an example , Demonstrate how to achieve UTF-8 code .
Yan's Unicode yes 4E25(100111000100101), According to the table , You can find 4E25 In the third line (0000 0800 - 0000 FFFF), So strict UTF-8 Encoding takes three bytes , The format is 1110xxxx 10xxxxxx 10xxxxxx. then , Start with the strict last binary bit , Fill in... In the format from the back to the front x, The extra complement 0. So you get , Yan's UTF-8 Encoding is 11100100 10111000 10100101, Convert it to hexadecimal E4B8A5
边栏推荐
- mysql econnreset_ Nodejs socket error handling error: read econnreset
- Shu tianmeng map × Weiyan technology - Dream map database circle of friends + 1
- The Research Report "2022 RPA supplier strength matrix analysis of China's banking industry" was officially launched
- 同事半个月都没搞懂selenium,我半个小时就给他整明白!顺手秀了一波爬淘宝的操作[通俗易懂]
- Le rapport de recherche sur l'analyse matricielle de la Force des fournisseurs de RPA dans le secteur bancaire chinois en 2022 a été officiellement lancé.
- Detailed explanation of navigation component of openharmony application development
- 程序员成长第八篇:做好测试工作
- Introduction to sap ui5 dynamicpage control
- 946. 验证栈序列
- 155. Minimum stack
猜你喜欢
Detailed explanation of navigation component of openharmony application development
Write macro with word
“百度杯”CTF比赛 九月场,Web:SQL
MySQL giant pit: update updates should be judged with caution by affecting the number of rows!!!
LeetCode20.有效的括号
Introduction to sap ui5 flexiblecolumnlayout control
使用 jMeter 对 SAP Spartacus 进行并发性能测试
量价虽降,商业银行结构性存款为何受上市公司所偏爱?
Shu tianmeng map × Weiyan technology - Dream map database circle of friends + 1
C# 对象存储
随机推荐
Reflection and imagination on the notation like tool
CF:A. The Third Three Number Problem【关于我是位运算垃圾这个事情】
#从源头解决# 自定义头文件在VS上出现“无法打开源文件“XX.h“的问题
Actual combat simulation │ JWT login authentication
精彩速递|腾讯云数据库6月刊
Shu tianmeng map × Weiyan technology - Dream map database circle of friends + 1
逆波兰表达式
Leetcode20. Valid parentheses
SAP UI5 ObjectPageLayout 控件使用方法分享
PyCharm安装第三方库图解
“百度杯”CTF比赛 九月场,Web:SQL
go map
Difference between avc1 and H264
RHCSA10
LB10S-ASEMI整流桥LB10S
It's too convenient. You can complete the code release and approval by nailing it!
go 指针
[Nacos cloud native] the first step of reading the source code is to start Nacos locally
Introduction to sap ui5 dynamicpage control
Realize the addition of all numbers between 1 and number