当前位置:网站首页>Why do we use UTF-8 encoding?

Why do we use UTF-8 encoding?

2022-07-07 14:54:00 nsnsttn

Why do we use it UTF-8 code ?

Computers can only recognize binary numbers , So the letters we use and see , Numbers , Chinese characters , Symbol ,emoji etc. All need to be converted into binary numbers in some way for storage , We need to convert it into corresponding letters when we use it , Numbers , Chinese characters , Symbol ,emoji etc.

ASCII code There is

 Support : 
 etc. 128 Characters 

Each character has a corresponding code point , yes 0~127 Number between , A collection of all supported characters and their corresponding code points , Called character set , therefore ASCII The character set appears :

ASCII Character set

 from   character  A  To   Binary system  01000001 The process is called coding 
 from    Binary system  01000001  To   character  A  The process is called decoding 

​ however ASCII Only commonly used symbols and English letters are supported , Words and new symbols in other countries do not support , So other countries began to develop their own coding standards , For example, the mainland has GB2312, Hong Kong, Macao and Taiwan Big5, Later, the simplified Japanese and Korean characters were unified GBK, This leads to the same article , Writing and viewing are encoded differently , Causing confusion , To solve this problem , Unify all characters ,

Unicode Character set There is , At present, it has included more than 14w The characters of , It should be noted that , Character set is just a set of characters and their corresponding code points , It does not mean that characters will be stored in the computer with corresponding code points , Character coding really defines the mapping from characters to computer stored content


​ Of course , The simplest encoding method is to store the code point corresponding to the character directly in binary in the computer ,ASCII and UTF-32 That's what it does

ASCII Only 128 individual , Take up one byte , It is very convenient to use code point coding directly , however Unicode Character set Hundreds of thousands of characters are stored

For example, this character , stay Unicode in

 Insert picture description here

Decimal system :


Binary system :


Binary has 17 The bit , And you can't let binary numbers directly follow the binary numbers of other characters , Because it is impossible to distinguish where each character starts , Where to end

therefore UTF-32 Let each character begin with 32 The bit , Four byte length to store , Fill zero for insufficient high position

 Insert picture description here

32byte Enough to contain unicode Character set All the characters in , Fixed length , It can also help the computer recognize the truncation range of each character (4 byte ), But this causes English users , They used to use Ascii Character set , Each character is stored in only one byte , But now use UTF-32, Each character must use 4 Byte store , The file size directly expanded to 4 times

Chinese character users ,GBK One Chinese character only accounts for 2 Bytes , Now, too 4 Bytes , The file size has expanded to 2 times , This is obviously unacceptable , To save space efficiency

UTF-8 code The birth of ,UTF-8 Is aimed at Unicode Variable length coding of , For different characters , It can be used 1~4 Bytes of storage

 Insert picture description here

The specific rules are :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-LLfAqDFj-1640626663955)(C:\Users\mi\AppData\Roaming\Typora\typora-user-images\image-20211228011742196.png)]

 Simply put, see 
0 start , Just look back for a byte 
110 start , Just look back for two bytes 
1110 start , Just look back for three bytes 
11110 start , Just look back for four bytes 
 The following bytes are 10 start 
 Range calculation :
2 Of (8-1) Time  - 1 = 127
2 Of (16-5) Time  - 1= 2047
2 Of (24-8) Time  - 1= 65535
2 Of (32-11) Time  - 1= 1114111

advantage :

 compatible ASCII
 Variable length , Save space 
 Good scalability , I'm not afraid of more characters in the future 

shortcoming :

 Chinese characters need 3 Bytes , Not as good as GBK2 Bytes 
 Calculate the byte length of a character , inconvenient 

Welcome to add ~

