当前位置:网站首页>Why do we use UTF-8 encoding?
Why do we use UTF-8 encoding?
2022-07-07 14:54:00 【nsnsttn】
Why do we use it UTF-8 code ?
Computers can only recognize binary numbers , So the letters we use and see , Numbers , Chinese characters , Symbol ,emoji etc. All need to be converted into binary numbers in some way for storage , We need to convert it into corresponding letters when we use it , Numbers , Chinese characters , Symbol ,emoji etc.
ASCII code There is
Support :
0-9
a-z
A-Z
!"#$%^&*()
etc. 128 Characters
Each character has a corresponding code point , yes 0~127 Number between , A collection of all supported characters and their corresponding code points , Called character set , therefore ASCII The character set appears :

from character A To Binary system 01000001 The process is called coding
conversely
from Binary system 01000001 To character A The process is called decoding
however ASCII Only commonly used symbols and English letters are supported , Words and new symbols in other countries do not support , So other countries began to develop their own coding standards , For example, the mainland has GB2312, Hong Kong, Macao and Taiwan Big5, Later, the simplified Japanese and Korean characters were unified GBK, This leads to the same article , Writing and viewing are encoded differently , Causing confusion , To solve this problem , Unify all characters ,
Unicode Character set There is , At present, it has included more than 14w The characters of , It should be noted that , Character set is just a set of characters and their corresponding code points , It does not mean that characters will be stored in the computer with corresponding code points , Character coding really defines the mapping from characters to computer stored content
![[(img-BVphQVRp-1640626776070)(C:\Users\mi\AppData\Roaming\Typora\typora-user-images\image-20211228005726732.png)]](/img/2e/df8cbcfef455a369f8f394968fd6e9.jpg)
Of course , The simplest encoding method is to store the code point corresponding to the character directly in binary in the computer ,ASCII and UTF-32 That's what it does
ASCII Only 128 individual , Take up one byte , It is very convenient to use code point coding directly , however Unicode Character set Hundreds of thousands of characters are stored
For example, this character , stay Unicode in

Decimal system :
128169
Binary system :
11111010010101001
Binary has 17 The bit , And you can't let binary numbers directly follow the binary numbers of other characters , Because it is impossible to distinguish where each character starts , Where to end
therefore UTF-32 Let each character begin with 32 The bit , Four byte length to store , Fill zero for insufficient high position

32byte Enough to contain unicode Character set All the characters in , Fixed length , It can also help the computer recognize the truncation range of each character (4 byte ), But this causes English users , They used to use Ascii Character set , Each character is stored in only one byte , But now use UTF-32, Each character must use 4 Byte store , The file size directly expanded to 4 times
Chinese character users ,GBK One Chinese character only accounts for 2 Bytes , Now, too 4 Bytes , The file size has expanded to 2 times , This is obviously unacceptable , To save space efficiency
UTF-8 code The birth of ,UTF-8 Is aimed at Unicode Variable length coding of , For different characters , It can be used 1~4 Bytes of storage

The specific rules are :
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-LLfAqDFj-1640626663955)(C:\Users\mi\AppData\Roaming\Typora\typora-user-images\image-20211228011742196.png)]](/img/e7/5c84bd40054832b6ac7102ac159a70.jpg)
Simply put, see
0 start , Just look back for a byte
110 start , Just look back for two bytes
1110 start , Just look back for three bytes
11110 start , Just look back for four bytes
The following bytes are 10 start
Range calculation :
2 Of (8-1) Time - 1 = 127
2 Of (16-5) Time - 1= 2047
2 Of (24-8) Time - 1= 65535
2 Of (32-11) Time - 1= 1114111
advantage :
compatible ASCII
Variable length , Save space
Good scalability , I'm not afraid of more characters in the future
shortcoming :
Chinese characters need 3 Bytes , Not as good as GBK2 Bytes
Calculate the byte length of a character , inconvenient
Welcome to add ~
边栏推荐
- Andriod --- JetPack :LiveData setValue 和 postValue 的区别
- Navigation - are you sure you want to take a look at such an easy-to-use navigation framework?
- CTFshow,信息搜集:web1
- C 6.0 language specification approved
- PAG体验:十分钟完成AE动效部署上线各平台!
- Decrypt the three dimensional design of the game
- Yyds dry goods inventory # solve the real problem of famous enterprises: cross line
- Niuke real problem programming - day15
- CTFshow,信息搜集:web7
- Because the employee set the password to "123456", amd stolen 450gb data?
猜你喜欢

CTFshow,信息搜集:web12
![[today in history] July 7: release of C; Chrome OS came out;](/img/a6/3170080268a836f2e0973916d737dc.png)
[today in history] July 7: release of C; Chrome OS came out; "Legend of swordsman" issued

拼多多败诉,砍价始终差0.9%一案宣判;微信内测同一手机号可注册两个账号功能;2022年度菲尔兹奖公布|极客头条...

IDA pro逆向工具寻找socket server的IP和port

asp.netNBA信息管理系统VS开发sqlserver数据库web结构c#编程计算机网页源码项目详细设计

广州开发区让地理标志产品助力乡村振兴

《微信小程序-进阶篇》组件封装-Icon组件的实现(一)

CTFshow,信息搜集:web10

13 ux/ui/ue best creative inspiration websites in 2022

Jetson AGX Orin CANFD 使用
随机推荐
Es log error appreciation -- allow delete
Pandora IOT development board learning (HAL Library) - Experiment 12 RTC real-time clock experiment (learning notes)
关于后台动态模板添加内容的总结 Builder使用
CTFshow,信息搜集:web3
Concurrency Control & NoSQL and new database
AWS学习笔记(三)
PyTorch模型训练实战技巧,突破速度瓶颈
Several ways of JS jump link
在软件工程领域,搞科研的这十年!
Cocoscreator resource encryption and decryption
Data Lake (IX): Iceberg features and data types
Ian Goodfellow, the inventor of Gan, officially joined deepmind as research scientist
KITTI数据集简介与使用
Navigation - are you sure you want to take a look at such an easy-to-use navigation framework?
CTFshow,信息搜集:web12
Andriod --- JetPack :LiveData setValue 和 postValue 的区别
Because the employee set the password to "123456", amd stolen 450gb data?
Instructions d'utilisation de la trousse de développement du module d'acquisition d'accord du testeur mictr01
激光雷達lidar知識點滴
属性关键字ServerOnly,SqlColumnNumber,SqlComputeCode,SqlComputed