当前位置:网站首页>Why do we use UTF-8 encoding?
Why do we use UTF-8 encoding?
2022-07-07 14:54:00 【nsnsttn】
Why do we use it UTF-8 code ?
Computers can only recognize binary numbers , So the letters we use and see , Numbers , Chinese characters , Symbol ,emoji etc. All need to be converted into binary numbers in some way for storage , We need to convert it into corresponding letters when we use it , Numbers , Chinese characters , Symbol ,emoji etc.
ASCII code There is
Support :
0-9
a-z
A-Z
!"#$%^&*()
etc. 128 Characters
Each character has a corresponding code point , yes 0~127 Number between , A collection of all supported characters and their corresponding code points , Called character set , therefore ASCII The character set appears :
from character A To Binary system 01000001 The process is called coding
conversely
from Binary system 01000001 To character A The process is called decoding
however ASCII Only commonly used symbols and English letters are supported , Words and new symbols in other countries do not support , So other countries began to develop their own coding standards , For example, the mainland has GB2312, Hong Kong, Macao and Taiwan Big5, Later, the simplified Japanese and Korean characters were unified GBK, This leads to the same article , Writing and viewing are encoded differently , Causing confusion , To solve this problem , Unify all characters ,
Unicode Character set There is , At present, it has included more than 14w The characters of , It should be noted that , Character set is just a set of characters and their corresponding code points , It does not mean that characters will be stored in the computer with corresponding code points , Character coding really defines the mapping from characters to computer stored content
Of course , The simplest encoding method is to store the code point corresponding to the character directly in binary in the computer ,ASCII and UTF-32 That's what it does
ASCII Only 128 individual , Take up one byte , It is very convenient to use code point coding directly , however Unicode Character set Hundreds of thousands of characters are stored
For example, this character , stay Unicode in
Decimal system :
128169
Binary system :
11111010010101001
Binary has 17 The bit , And you can't let binary numbers directly follow the binary numbers of other characters , Because it is impossible to distinguish where each character starts , Where to end
therefore UTF-32 Let each character begin with 32 The bit , Four byte length to store , Fill zero for insufficient high position
32byte Enough to contain unicode Character set All the characters in , Fixed length , It can also help the computer recognize the truncation range of each character (4 byte ), But this causes English users , They used to use Ascii Character set , Each character is stored in only one byte , But now use UTF-32, Each character must use 4 Byte store , The file size directly expanded to 4 times
Chinese character users ,GBK One Chinese character only accounts for 2 Bytes , Now, too 4 Bytes , The file size has expanded to 2 times , This is obviously unacceptable , To save space efficiency
UTF-8 code The birth of ,UTF-8 Is aimed at Unicode Variable length coding of , For different characters , It can be used 1~4 Bytes of storage
The specific rules are :
Simply put, see
0 start , Just look back for a byte
110 start , Just look back for two bytes
1110 start , Just look back for three bytes
11110 start , Just look back for four bytes
The following bytes are 10 start
Range calculation :
2 Of (8-1) Time - 1 = 127
2 Of (16-5) Time - 1= 2047
2 Of (24-8) Time - 1= 65535
2 Of (32-11) Time - 1= 1114111
advantage :
compatible ASCII
Variable length , Save space
Good scalability , I'm not afraid of more characters in the future
shortcoming :
Chinese characters need 3 Bytes , Not as good as GBK2 Bytes
Calculate the byte length of a character , inconvenient
Welcome to add ~
边栏推荐
- electron remote 报错
- Demis hassabis talks about alphafold's future goals
- 潘多拉 IOT 开发板学习(HAL 库)—— 实验12 RTC实时时钟实验(学习笔记)
- Niuke real problem programming - Day12
- AWS学习笔记(三)
- Notes HCIA
- 安恒堡垒机如何启用Radius双因素/双因子(2FA)身份认证
- Wechat applet - Advanced chapter component packaging - Implementation of icon component (I)
- Used by Jetson AgX Orin canfd
- 比尔·盖茨晒48年前简历:“没你们的好看”
猜你喜欢
Niuke real problem programming - Day10
Pinduoduo lost the lawsuit, and the case of bargain price difference of 0.9% was sentenced; Wechat internal test, the same mobile phone number can register two account functions; 2022 fields Awards an
上半年晋升 P8 成功,还买了别墅!
Data connection mode in low code platform (Part 2)
CTFshow,信息搜集:web13
Data Lake (IX): Iceberg features and data types
Stm32cubemx, 68 sets of components, following 10 open source protocols
因员工将密码设为“123456”,AMD 被盗 450Gb 数据?
Niuke real problem programming - Day11
[server data recovery] a case of RAID data recovery of a brand StorageWorks server
随机推荐
PyTorch模型训练实战技巧,突破速度瓶颈
PAG体验:十分钟完成AE动效部署上线各平台!
CPU与chiplet技术杂谈
15、文本编辑工具VIM使用
6. Electron borderless window and transparent window lock mode setting window icon
Classification of regression tests
ES日志报错赏析-maximum shards open
Half an hour of hands-on practice of "live broadcast Lianmai construction", college students' resume of technical posts plus points get!
激光雷达lidar知识点滴
Es log error appreciation -trying to create too many buckets
PD虚拟机教程:如何在ParallelsDesktop虚拟机中设置可使用的快捷键?
CTFshow,信息搜集:web4
Yyds dry goods inventory # solve the real problem of famous enterprises: cross line
How does the database perform dynamic custom sorting?
C# 6.0 语言规范获批
JS in the browser Base64, URL, blob mutual conversion
Read PG in data warehouse in one article_ stat
The world's first risc-v notebook computer is on pre-sale, which is designed for the meta universe!
FFmpeg----图片处理
#yyds干货盘点# 解决名企真题:交叉线