当前位置:网站首页>A detailed explanation of ASCII code, Unicode and UTF-8
A detailed explanation of ASCII code, Unicode and UTF-8
2022-07-05 13:12:00 【UPythonFish】
List of articles
One ASCII code
Inside the computer , All information is ultimately a binary value . Every binary bit (bit) Yes 0 and 1 Two kinds of state , So eight binary bits can be combined 256 States , This is called a byte (byte). in other words , A byte can be used to represent 256 Different states , Each state corresponds to a symbol , Namely 256 Symbols , from 00000000 To 11111111.
Last century 60 years , The United States has developed a set of character codes , The relationship between English characters and binary bits , Made a unified regulation . This is known as ASCII code , It has been used up to now .
ASCII The code is specified 128 Character encoding , Such as the blank space SPACE yes 32( Binary system 00100000), Capital letters A yes 65( Binary system 01000001). this 128 Symbols ( Include 32 A control symbol that can't be printed ), It only takes up the end of one byte 7 position , The first one is uniformly defined as 0.
| 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
|---|---|---|---|---|---|---|---|
| The unified stipulation is 0 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 | 0 or 1 |
Two Not ASCII code
If only English , use 128 A symbol code is enough , But for other languages ,128 A symbol is not enough .
such as , In French , There are phonetic symbols above the letters , It won't work ASCII Code said .
therefore , Some European countries decided to , Use the highest bit of the byte to program the new symbol .
such as , In French é The code of is 130( Binary system 10000010).
thus , The coding system used by these European countries , Can mean at most 256 Symbols .
however , There are new problems .
Different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different .
such as ,130 In French coding, it stands for é, In Hebrew code, it stands for the letters Gimel (ג), In Russian code, it will represent another symbol .
But anyway , Of all these coding methods ,0–127 The symbols are the same , It's just that 128–255 This part of .
As for the words of Asian countries , More symbols are used , There are as many Chinese characters as 10 All around .
A byte can only represent 256 Symbols , It must not be enough , You have to use more than one byte to express a symbol .
such as , The common encoding method of simplified Chinese is GB2312, Use two bytes to represent a Chinese character , So in theory, it can at most express 256 x 256 = 65536 Symbols
3、 ... and Unicode
There are many ways of coding in the world , The same binary number can be interpreted as different symbols .
therefore , To open a text file , You have to know how it's encoded , Otherwise, read it in the wrong way , There will be chaos .
This is why files are often garbled , Because the encoding method used for encoding and decoding is different
As you can imagine , If there's a code , Include all the symbols in the world . Each symbol is given a unique code , Then the confusion will disappear .
This is it. Unicode, It's like its name means , It's a code for all the symbols .
Unicode All characters in the world are assigned a unique numeric number , This number ranges from 0x000000 To 0x10FFFF ( Hexadecimal ), Yes 110 More than ten thousand , Each character has a unique Unicode Number , This number is usually written as 16 Base number , Add in front U+. for example :
U+9A6C Representing Chinese character horse ,
U+4E25 It means Chinese characters are strict
U+0639 For Arabic letters Ain,
U+0041 A capital letter for English A,
Unicode It's like a table , The connection between characters and numbers is established

3.1 Unicode The problem is
Unicode It's just a set of symbols , It only specifies the binary code of the symbol , It doesn't specify how the binary code should be stored .
such as , The Chinese characters are strict Unicode It's a hexadecimal number 4E25, Conversion to binary is enough 15 position (100111000100101), in other words , The representation of this symbol requires at least 2 Bytes . Other larger symbols , You may need to 3 Bytes or 4 Bytes , Even more .
There are two serious problems :
** Question 1 :** How to distinguish Unicode and ASCII ? How do computers know that three bytes represent a symbol , Instead of three symbols ?
** Question two :** Only one byte is enough for English letters , If Unicode Uniform rules , Each symbol is represented by three or four bytes , Then every letter must be preceded by two or three bytes 0, It's a huge waste of storage , The size of the text file will therefore be two or three times larger
3.2 The result is
** Results a :** There is Unicode A variety of storage methods , That is to say, there are many different binary formats , It can be used to express Unicode, There are mainly UTF-8,UTF-16,UTF-32.
Result two : Unicode Can't promote... For a long time , Until the advent of the Internet
Four UTF-8
The popularity of the Internet , A unified coding method is strongly demanded .
UTF-8 It's the most widely used on the Internet Unicode How to implement .
Other implementations include UTF-16( Characters are represented by two or four bytes ) and UTF-32( Characters are represented in four bytes ), But not on the Internet .
Be careful :UTF-8 yes Unicode One of the ways to realize .
###4.1 UTF-8 characteristic
UTF-8 The biggest one , It's a variable length encoding . It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol .
###4.2 UTF-8 The coding rules of
UTF-8 There are two coding rules :
** Rule one :** For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .
** Rule 2 :** about n Symbol of byte (n > 1), Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .
The following table summarizes the coding rules , Letter x The bits that can be encoded .
| Serial number range ( Number corresponds to decimal number ) | Binary format |
|---|---|
| Hexadecimal range :(0x00—0x7f) Decimal range :(0—127) | 0XXXXXXX |
| Hexadecimal range :(0x80—0x7ff) Decimal range :(128—2047) | 110XXXXX 10XXXXXX |
| Hexadecimal range :(0x800—0xffff) Decimal range :(2048—65535) | 1110XXXX 10XXXXXX 10XXXXXX |
| Hexadecimal range :(0x10000—0x10ffff) Decimal range :(65536 above ) | 11110XXX 10XXXXXX 10XXXXXX 10XXXXXX |
5、 ... and Unicode and UTF-8 Conversion between
According to the above table , Reading UTF-8 Coding is very simple . If the first bit of a byte is 0, Then this byte is just a character ; If the first one is 1, How many in a row 1, It means how many bytes the current character occupies .
below , Take the strictness of Chinese characters as an example , Demonstrate how to achieve UTF-8 code .
Yan's Unicode yes 4E25(100111000100101), According to the table , You can find 4E25 In the third line (0000 0800 - 0000 FFFF), So strict UTF-8 Encoding takes three bytes , The format is 1110xxxx 10xxxxxx 10xxxxxx. then , Start with the strict last binary bit , Fill in... In the format from the back to the front x, The extra complement 0. So you get , Yan's UTF-8 Encoding is 11100100 10111000 10100101, Convert it to hexadecimal E4B8A5
边栏推荐
- Cf:a. the third three number problem
- DataPipeline双料入选中国信通院2022数智化图谱、数据库发展报告
- 【Hot100】33. 搜索旋转排序数组
- LB10S-ASEMI整流桥LB10S
- ASEMI整流桥HD06参数,HD06图片,HD06应用
- go 字符串操作
- leetcode:221. Maximum square [essence of DP state transition]
- 阿里云SLB负载均衡产品基本概念与购买流程
- Realize the addition of all numbers between 1 and number
- mysql econnreset_Nodejs 套接字报错处理 Error: read ECONNRESET
猜你喜欢

Binder通信过程及ServiceManager创建过程

百日完成国产数据库opengausss的开源任务--openGuass极简版3.0.0安装教程

SAP UI5 视图里的 OverflowToolbar 控件

阿里云SLB负载均衡产品基本概念与购买流程

Shi Zhenzhen's 2021 summary and 2022 outlook | colorful eggs at the end of the article

Navigation property and entityset usage in SAP segw transaction code

RHCSA9

Principle and configuration of RSTP protocol
![[cloud native] use of Nacos taskmanager task management](/img/ad/24bdd4572ef9990238913cb7cd16f8.png)
[cloud native] use of Nacos taskmanager task management

RHCSA8
随机推荐
#从源头解决# 自定义头文件在VS上出现“无法打开源文件“XX.h“的问题
事务的基本特性和隔离级别
155. 最小栈
Datapipeline was selected into the 2022 digital intelligence atlas and database development report of China Academy of communications and communications
Get you started with Apache pseudo static configuration
CAN和CAN FD
Talk about my drawing skills in my writing career
【Hot100】33. 搜索旋转排序数组
简单上手的页面请求和解析案例
The solution of outputting 64 bits from printf format%lld of cross platform (32bit and 64bit)
Hundred days to complete the open source task of the domestic database opengauss -- openguass minimalist version 3.0.0 installation tutorial
“百度杯”CTF比赛 九月场,Web:SQL
Sorry, we can't open xxxxx Docx, because there is a problem with the content (repackaging problem)
精彩速递|腾讯云数据库6月刊
前缀、中缀、后缀表达式「建议收藏」
There is no monitoring and no operation and maintenance. The following is the commonly used script monitoring in monitoring
RHCSA9
How to realize batch sending when fishing
Solve Unicode decodeerror: 'GBK' codec can't decode byte 0xa2 in position 107
[深度学习论文笔记]使用多模态MR成像分割脑肿瘤的HNF-Netv2