当前位置:网站首页>Relationship between ASCII, Unicode, GBK, UTF-8
Relationship between ASCII, Unicode, GBK, UTF-8
2022-07-01 01:29:00 【z_ kakaya】
ASCII、Unicode、GBK、UTF-8 The relationship between
One 、ASCII code
ASCII(American Standard Code for Information Interchange, American standard information exchange code ) It's a computer coding system based on the Latin alphabet , Mainly used to show modern English and other western European languages . It's the most common single byte encoding system today , And equivalent to international standards ISO/IEC 646.
ASCII Code use specified 7 Bit or 8 Bit binary arrays are used to represent 128 or 256 Possible characters . standard ASCII Code is also called basis ASCII code , Use 7 Bit binary number ( The rest 1 Bit binary is 0) To represent all uppercase and lowercase letters , Numbers 0 To 9、 Punctuation , And special control characters used in American English .
among :
0~31 And 127( common 33 individual ) Is a control character or a communication-specific character ( The rest are displayable characters ), Such as the controller :LF( Line break )、CR( enter )、FF( Change the page )、DEL( Delete )、BS( Backspace )、BEL( Ring the bell ) etc. ; Communication specific characters :SOH( Title )、EOT( Epilogue )、ACK( confirm ) etc. ;ASCII The value is 8、9、10 and 13 Convert to backspace 、 Tabulation 、 Line feed and carriage return characters . They don't have a specific graphic display , But depending on the application , It has different effects on text display .
32~126( common 95 individual ) Is the character (32 Is a space ), among 48~57 by 0 To 9 Ten Arabic numbers .
65~90 by 26 Capital letters ,97~122 Number is 26 Lowercase letters , The rest are punctuation marks 、 Operation symbols, etc .
after 128 This is called an extension ASCII code . Many are based on x86 All systems support the use of extensions ( or “ high ”)ASCII. Expand ASCII The code allows each character's second 8 Bits are used to determine additional 128 A special symbol character 、 Loanwords, letters and graphic symbols
The following is ASCII clock :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ehFaJSsw-1656490148702)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ASCII.png)]
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-P3UwbUjV-1656490148713)(ASCII、Unicode、GBK、UTF-8 The relationship between .assets/ascii2.png)]
Two 、GBK code
because ASCII The encoding does not support Chinese , therefore , When Chinese people use computers , We need to find a coding method to support Chinese .
therefore , Chinese people have defined a set of coding rules : When the character is less than 127 When a , And ASCII The same characters as , But when two are greater than 127 When the characters of are connected together , It represents a Chinese character , The first byte is called the high byte ( from 0xA1-0xF7), The second byte is the low byte ( from 0xA1-0xFE), This can be combined about 7000 Multiple simplified Chinese characters . This rule is called GB2312.
But because there are many Chinese characters , Some words don't mean , So the rules are redefined : No, the low byte must be 127 Later coding , As long as the first byte is greater than 127, Fixed means that this is the beginning of a Chinese character , Whether it's followed by the extended character set or not . This extended coding scheme is called GBK mark , It includes GB2312 All of , At the same time, nearly 20000 A new Chinese character ( Including traditional characters ) And symbols .
however , China has a 56 People , therefore , Again, we extend the coding rules , It also added characters of nearly thousands of ethnic minorities , So the code after expansion again is called GB18030. Chinese programmers think this series of coding standards is very good , So they are all called "DBCS"(Double Byte Charecter Set Double byte character set ).
3、 ... and 、Unicode Character set
Because there are many countries in the world , Each country defines its own coding standards , As a result, no one knows the code of another , You can't communicate well , So an organization appeared in time ISO( International Organization for Standardization ) Decided to define a coding scheme to solve the coding problem of all countries , This new coding scheme is called Unicode. Be careful Unicode Not a new coding rule , It's a set of characters ( For every one 「 character 」 Assign a unique ID( The scientific name is code point / Code points / Code Point)), Can be Unicode Understand it as a world coded Dictionary .
ISO Regulations : Each character must use two bytes , The box 16 Bit binary to represent all characters , about ASCII The characters in the code table , Keep its encoding unchanged , It just extends the length to 16 position , All characters in other countries are uniformly recoded . Due to transmission ASCII When the characters in the table , In fact, it can be expressed in only one byte , therefore , This coding scheme wastes bandwidth in data transmission , Storing data wastes hard disk .
Four 、UTF-8 code
because Unicode It wastes network bandwidth and hard disk , So in order to solve this problem , It's just Unicode On the basis of , Defines a set of coding rules ( take 「 Code bits 」 Rules for converting to byte sequence ( code / decode It can be understood as encryption / Decrypt The process of )), This new coding rule is UTF-8, use 1-4 Characters to transmit and store data .
Encoding rules : Use the following template for conversion
Unicode Symbol scope ( Hexadecimal ) | UTF-8 Encoding mode ( Binary system )
------------------------------------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5、 ... and 、UTF-8 and Unicode transformation
utf-8 The beginning of each character is distinguished according to the high-order byte of the character , For example, a character represented by a byte , The first byte is high with “0” start ; A character represented in two bytes , The high order of the first byte is in “110” start , The last byte is marked with “10 start ”; A character represented by three bytes , The first byte is in “1110” start , The last two bytes are “10” start ; A character represented by four bytes , The first byte is in “11110” start , The last three bytes are written in “10” start .
Like Chinese characters “ wisdom ”,utf-8 Encoding is “\xe6\x99\xba” The corresponding binary is :“111001101001100110111010”, because utf-8 One of the Chinese characters is 3 Bytes , So the corresponding template is “0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx”.
11100110 10011001 10111010
1110xxxx 10xxxxxx 10xxxxxx
0110 011001 111010
0110011001111010 For hexadecimal 667A, Therefore, according to the rule transformation “ wisdom ”Unicode The location of is “667A”.
Again , according to Unicode The encoding position of Chinese characters , You can also find the corresponding utf-8 code .
6、 ... and 、Unicode And GBK Code conversion
Like Chinese characters “ road ”, stay gbk The code in is “\xc2\xb7”, The corresponding binary is :“1100 0010 1011 0111”. meanwhile “ road ” stay Unicode The position in the character set is “\u8def”(python Medium Unicode type ), So you can use “\u8def” stay Unicode Found... In character set “ road ” The corresponding code is “4237”, The corresponding binary is :“0100 0010 0011 0111”, because gbk The high byte of the two bytes of is to distinguish between Chinese and ASCII, So will “1100 0010 1011 0111” High byte “1” After removal , As the corresponding Unicode In the character set 0100 0010 0011 0111”
7、 ... and 、UTF-8 and Unicode And GBK The relationship between
utf-8--------decode( decode )----->Unicode type <-------decode( decode )-----gbk
utf-8<--------encode( code )-----Unicode type -------encode( code )----->gbk
Reprinted from :CSDN-longwen_zhi
utf-8<--------encode( code )-----Unicode type -------encode( code )----->gbk
Reprinted from :CSDN-longwen_zhi
边栏推荐
- Orb-slam2 source code learning (II) map initialization
- Using recyclerreview to show banner is very simple
- Dx-11q signal relay
- Visual studio 2019 Download
- 生意和投资的思考
- Unknown database连接数据库错误
- 解读创客教育所蕴含的科技素养
- JS to convert numbers into Chinese characters for output
- Note d'étude du DC: zéro dans le chapitre officiel - - Aperçu et introduction du processus de base
- 【office办公-pdf篇】pdf合并与拆分让我们摆脱付费软件的功能限制好不好
猜你喜欢
Poor students can also play raspberry pie
uniapp官方组件点击item无效,解决方案
K210 site helmet
基础知识之三——标准单元库
"Open math input panel" in MathType editing in win11 is gray and cannot be edited
DLS-42/6-4 DC110V双位置继电器
【qt5-tab标签精讲】Tab标签及内容分层解析
机器人编程的培训学科类原理
[network packet loss and network delay? This artifact can help you deal with everything!]
Dls-20 double position relay 220VDC
随机推荐
Opencv basic operation 2 realizes label2rgb and converts gray-scale images into color images
Pytorch programming knowledge (2)
Using recyclerreview to show banner is very simple
ASCII、Unicode、GBK、UTF-8之间的关系
About vctk datasets
Docker deployment MySQL 8
双位置继电器ST2-2L/AC220V
DLS-20型双位置继电器 220VDC
Why not two or four TCP handshakes
06. on several ways of redis persistence
Call the classic architecture and build the model based on the classic
Inspire students' diversified thinking with steam Education
孙宇晨接受瑞士媒体Bilan采访:熊市不会持续太久
Service grid ASM year end summary: how do end users use the service grid?
Green, green the reed. dew and frost gleam.
Basic knowledge II - Basic definitions related to sta
【学习笔记】简单dp
OCR的一些项目
MATLAB 最远点采样(FPS改进版)
C# 自定义并动态切换光标