当前位置:网站首页>Relationship between ASCII, Unicode, GBK, UTF-8
Relationship between ASCII, Unicode, GBK, UTF-8
2022-07-01 01:29:00 【z_ kakaya】
ASCII、Unicode、GBK、UTF-8 The relationship between
One 、ASCII code
ASCII(American Standard Code for Information Interchange, American standard information exchange code ) It's a computer coding system based on the Latin alphabet , Mainly used to show modern English and other western European languages . It's the most common single byte encoding system today , And equivalent to international standards ISO/IEC 646.
ASCII Code use specified 7 Bit or 8 Bit binary arrays are used to represent 128 or 256 Possible characters . standard ASCII Code is also called basis ASCII code , Use 7 Bit binary number ( The rest 1 Bit binary is 0) To represent all uppercase and lowercase letters , Numbers 0 To 9、 Punctuation , And special control characters used in American English .
among :
0~31 And 127( common 33 individual ) Is a control character or a communication-specific character ( The rest are displayable characters ), Such as the controller :LF( Line break )、CR( enter )、FF( Change the page )、DEL( Delete )、BS( Backspace )、BEL( Ring the bell ) etc. ; Communication specific characters :SOH( Title )、EOT( Epilogue )、ACK( confirm ) etc. ;ASCII The value is 8、9、10 and 13 Convert to backspace 、 Tabulation 、 Line feed and carriage return characters . They don't have a specific graphic display , But depending on the application , It has different effects on text display .
32~126( common 95 individual ) Is the character (32 Is a space ), among 48~57 by 0 To 9 Ten Arabic numbers .
65~90 by 26 Capital letters ,97~122 Number is 26 Lowercase letters , The rest are punctuation marks 、 Operation symbols, etc .
after 128 This is called an extension ASCII code . Many are based on x86 All systems support the use of extensions ( or “ high ”)ASCII. Expand ASCII The code allows each character's second 8 Bits are used to determine additional 128 A special symbol character 、 Loanwords, letters and graphic symbols
The following is ASCII clock :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ehFaJSsw-1656490148702)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ASCII.png)]
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-P3UwbUjV-1656490148713)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ascii2.png)]
Two 、GBK code
because ASCII The encoding does not support Chinese , therefore , When Chinese people use computers , We need to find a coding method to support Chinese .
therefore , Chinese people have defined a set of coding rules : When the character is less than 127 When a , And ASCII The same characters as , But when two are greater than 127 When the characters of are connected together , It represents a Chinese character , The first byte is called the high byte ( from 0xA1-0xF7), The second byte is the low byte ( from 0xA1-0xFE), This can be combined about 7000 Multiple simplified Chinese characters . This rule is called GB2312.
But because there are many Chinese characters , Some words don't mean , So the rules are redefined : No, the low byte must be 127 Later coding , As long as the first byte is greater than 127, Fixed means that this is the beginning of a Chinese character , Whether it's followed by the extended character set or not . This extended coding scheme is called GBK mark , It includes GB2312 All of , At the same time, nearly 20000 A new Chinese character ( Including traditional characters ) And symbols .
however , China has a 56 People , therefore , Again, we extend the coding rules , It also added characters of nearly thousands of ethnic minorities , So the code after expansion again is called GB18030. Chinese programmers think this series of coding standards is very good , So they are all called "DBCS"(Double Byte Charecter Set Double byte character set ).
3、 ... and 、Unicode Character set
Because there are many countries in the world , Each country defines its own coding standards , As a result, no one knows the code of another , You can't communicate well , So an organization appeared in time ISO( International Organization for Standardization ) Decided to define a coding scheme to solve the coding problem of all countries , This new coding scheme is called Unicode. Be careful Unicode Not a new coding rule , It's a set of characters ( For every one 「 character 」 Assign a unique ID( The scientific name is code point / Code points / Code Point)), Can be Unicode Understand it as a world coded Dictionary .
ISO Regulations : Each character must use two bytes , The box 16 Bit binary to represent all characters , about ASCII The characters in the code table , Keep its encoding unchanged , It just extends the length to 16 position , All characters in other countries are uniformly recoded . Due to transmission ASCII When the characters in the table , In fact, it can be expressed in only one byte , therefore , This coding scheme wastes bandwidth in data transmission , Storing data wastes hard disk .
Four 、UTF-8 code
because Unicode It wastes network bandwidth and hard disk , So in order to solve this problem , It's just Unicode On the basis of , Defines a set of coding rules ( take 「 Code bits 」 Rules for converting to byte sequence ( code / decode It can be understood as encryption / Decrypt The process of )), This new coding rule is UTF-8, use 1-4 Characters to transmit and store data .
Encoding rules : Use the following template for conversion
Unicode Symbol scope ( Hexadecimal ) | UTF-8 Encoding mode ( Binary system )
------------------------------------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5、 ... and 、UTF-8 and Unicode transformation
utf-8 The beginning of each character is distinguished according to the high-order byte of the character , For example, a character represented by a byte , The first byte is high with “0” start ; A character represented in two bytes , The high order of the first byte is in “110” start , The last byte is marked with “10 start ”; A character represented by three bytes , The first byte is in “1110” start , The last two bytes are “10” start ; A character represented by four bytes , The first byte is in “11110” start , The last three bytes are written in “10” start .
Like Chinese characters “ wisdom ”,utf-8 Encoding is “\xe6\x99\xba” The corresponding binary is :“111001101001100110111010”, because utf-8 One of the Chinese characters is 3 Bytes , So the corresponding template is “0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx”.
11100110 10011001 10111010
1110xxxx 10xxxxxx 10xxxxxx
0110 011001 111010
0110011001111010 For hexadecimal 667A, Therefore, according to the rule transformation “ wisdom ”Unicode The location of is “667A”.
Again , according to Unicode The encoding position of Chinese characters , You can also find the corresponding utf-8 code .
6、 ... and 、Unicode And GBK Code conversion
Like Chinese characters “ road ”, stay gbk The code in is “\xc2\xb7”, The corresponding binary is :“1100 0010 1011 0111”. meanwhile “ road ” stay Unicode The position in the character set is “\u8def”(python Medium Unicode type ), So you can use “\u8def” stay Unicode Found... In character set “ road ” The corresponding code is “4237”, The corresponding binary is :“0100 0010 0011 0111”, because gbk The high byte of the two bytes of is to distinguish between Chinese and ASCII, So will “1100 0010 1011 0111” High byte “1” After removal , As the corresponding Unicode In the character set 0100 0010 0011 0111”
7、 ... and 、UTF-8 and Unicode And GBK The relationship between
utf-8--------decode( decode )----->Unicode type <-------decode( decode )-----gbk
utf-8<--------encode( code )-----Unicode type -------encode( code )----->gbk
Reprinted from :CSDN-longwen_zhi
utf-8<--------encode( code )-----Unicode type -------encode( code )----->gbk
Reprinted from :CSDN-longwen_zhi
边栏推荐
- Call the classic architecture and build the model based on the classic
- 解读创客教育所蕴含的科技素养
- 二十多年来第一次!CVPR最佳学生论文授予中国高校学生!
- Poor students can also play raspberry pie
- 文件服务设计
- Principes de formation de la programmation robotique
- Thinking brought by strictmode -strictmode principle (5)
- Unhandled Exception: MissingPluginException(No implementation found for method launch on channel)
- gin_gorm
- Impact relay zc-23/dc220v
猜你喜欢

DLS-20型双位置继电器 220VDC

fluttertoast

Technical personnel advanced to draw a big picture of business, hand-in-hand teaching is coming

gin_gorm
![Split the linked list [take next first and then cut the linked list to prevent chain breakage]](/img/eb/708ab20c13df75f4dbd2d6461d3602.png)
Split the linked list [take next first and then cut the linked list to prevent chain breakage]

使用 C# 创造 ASCII 艺术

Orb-slam2 source code learning (II) map initialization

用recyclerReview展示Banner,很简单

Exploring the road of steam education innovation in the Internet Era

Analyzing the wisdom principle in maker education practice
随机推荐
"Open math input panel" in MathType editing in win11 is gray and cannot be edited
Pytorch programming knowledge (2)
Split the linked list [take next first and then cut the linked list to prevent chain breakage]
MFC TCP通信服务端客户端Demo备忘vs2019
gin_gorm
Windows环境下安装MongoDB数据库
ASCII、Unicode、GBK、UTF-8之间的关系
TypeError: can‘t convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to
Document service design
[learning notes] double + two points
Basic knowledge II - Basic definitions related to sta
qt5-MVC:数据可视化的层次揭秘
User defined annotation implementation verification
图灵奖得主LeCun指明AI未来的出路在于自主学习,这家公司已踏上征途
【网络丢包,网络延迟?这款神器帮你搞定所有!】
为什么要搭建个人博客
Kongyiji's first question: how much do you know about service communication?
Inspire students' diversified thinking with steam Education
Two position relay st2-2l/ac220v
一些本质的区别