当前位置:网站首页>Relationship between Unicode and UTF-8
Relationship between Unicode and UTF-8
2022-07-27 00:01:00 【51CTO】
ASCII code
Inside the computer , All information is ultimately a binary value . Every binary bit (bit) Yes 0 and 1 Two kinds of state , So eight binary bits can be combined 256 States , This is called a byte (byte). in other words , A byte can be used to represent 256 Different states , Each state corresponds to a symbol , Namely 256 Symbols , from 00000000 To 11111111.
Last century 60 years , The United States has developed a set of character codes , The relationship between English characters and binary bits , Made a unified regulation . This is known as ASCII code , It has been used up to now .
ASCII The code is specified 128 Character encoding , Such as the blank space SPACE yes 32( Binary system 00100000), Capital letters A yes 65( Binary system 01000001). this 128 Symbols ( Include 32 A control symbol that can't be printed ), It only takes up the end of one byte 7 position , The first one is uniformly defined as 0.
Not ASCII code
In English 128 A symbol code is enough , But for other languages ,128 A symbol is not enough . such as , In French , There are phonetic symbols above the letters , It won't work ASCII Code said . therefore , Some European countries decided to , Use the highest bit of the byte to program the new symbol . such as , In French é The code of is 130( Binary system 10000010). thus , The coding system used by these European countries , Can mean at most 256 Symbols .
however , There are new problems . Different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different . such as ,130 In French coding, it stands for é, In Hebrew code, it stands for the letters Gimel (ג), In Russian code, it will represent another symbol . But anyway , Of all these coding methods ,0--127 The symbols are the same , It's just that 128--255 This part of .
As for the words of Asian countries , More symbols are used , There are as many Chinese characters as 10 All around . A byte can only represent 256 Symbols , It must not be enough , You have to use more than one byte to express a symbol . such as , The common encoding method of simplified Chinese is GB2312, Use two bytes to represent a Chinese character , So in theory, it can at most express 256 x 256 = 65536 Symbols .
The problem of Chinese coding needs special discussion , This note does not cover . Only point out here , Although they all use multiple bytes to represent a symbol , however GB Class Chinese character coding and the following Unicode and UTF-8 It doesn't matter .
Unicode
As we said in the previous section , There are many ways of coding in the world , The same binary number can be interpreted as different symbols . therefore , To open a text file , You have to know how it's encoded , Otherwise, read it in the wrong way , There will be chaos . Why e-mail often appears garbled ? It's because the sender and the receiver use different coding methods .
As you can imagine , If there's a code , Include all the symbols in the world . Each symbol is given a unique code , Then the confusion will disappear . This is it. Unicode, It's like its name means , It's a code for all the symbols .
Unicode Of course, it's a big collection , The present scale can accommodate 100 More than ten thousand symbols . The coding of each symbol is different , such as ,U+0639 For Arabic letters Ain,U+0041 A capital letter for English A,U+4E25 It means Chinese characters yan . Specific symbol correspondence table , You can query unicode.org, Or special Chinese character correspondence table .
Unicode The problem of
It should be noted that ,Unicode It's just a set of symbols , It only specifies the binary code of the symbol , It doesn't specify how the binary code should be stored .
such as , Chinese characters yan Of Unicode It's a hexadecimal number 4E25, Conversion to binary is enough 15 position (100111000100101), in other words , The representation of this symbol requires at least 2 Bytes . Other larger symbols , You may need to 3 Bytes or 4 Bytes , Even more .
There are two serious problems , The first question is , How to distinguish Unicode and ASCII ? How do computers know that three bytes represent a symbol , Instead of three symbols ? The second question is , We already know , Only one byte is enough for English letters , If Unicode Uniform rules , Each symbol is represented by three or four bytes , Then every letter must be preceded by two or three bytes 0, It's a huge waste of storage , The size of the text file will therefore be two or three times larger , This is unacceptable .
The result is :1) There is Unicode A variety of storage methods , That is to say, there are many different binary formats , It can be used to express Unicode.2)Unicode Can't promote... For a long time , Until the advent of the Internet .
UTF-8
The popularity of the Internet , A unified coding method is strongly demanded .UTF-8 It's the most widely used on the Internet Unicode How to implement . Other implementations include UTF-16( Characters are represented by two or four bytes ) and UTF-32( Characters are represented in four bytes ), But not on the Internet . Come again , The relationship here is ,UTF-8 yes Unicode One of the ways to realize .
UTF-8 The biggest one , Is it It's a variable length encoding . It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol .
UTF-8 The coding rules of are very simple , There are only two :
1) For single byte symbols , The first bit of the byte is set to 0, Back 7 Bit by bit Unicode code . So for English letters ,UTF-8 Coding and ASCII The code is the same .
2) about n Symbol of byte (n > 1), Before the first byte n All places are set as 1, The first n + 1 Set as 0, The first two bits of the next byte are all set to 10. The remaining bits not mentioned , All for this symbol Unicode code .
边栏推荐
- 2022.7.26-----leetcode.1206
- 08_ Event modifier
- 第1章 需求分析与ssm环境准备
- Six challenges facing enterprise data governance!
- 第二部分—C语言提高篇_9. 链表
- [literature reading] hat: hardware aware transformers for efficient natural language processing
- Dajiang Zhitu and CC have produced multiple copies of data. How to combine them into one and load them in the new earth map
- Basic operations of objects
- Silicon Valley class lesson 6 - Tencent cloud on demand management module (I)
- Practice of intelligent code reconstruction of Zhongyuan bank
猜你喜欢

分页插件--PageHelper

动态sql

Part II - C language improvement_ 12. Packaging and use of dynamic / precision Library

Part II - C language improvement_ 13. Recursive function

Part II - C language improvement_ 8. File operation

In depth interpretation of the investment logic of the consortium's participation in the privatization of Twitter

第二部分—C语言提高篇_11. 预处理

Chapter 1 Introduction and use skills of interceptors

At 12:00 on July 17, 2022, the departure of love life on June 28 was basically completed, and it needs to rebound

Add an article ----- scanf usage
随机推荐
Silicon Valley class lesson 6 - Tencent cloud on demand management module (I)
证券公司哪家佣金最低?网上开户安全吗
30、 Modern storage system (management database and distributed storage system)
大疆智图、CC生产了多份数据,如何合并为一份在图新地球进行加载
第6节:cmake语法介绍
Tensorflow2.0 深度学习运行代码简单教程
银河证券网上开户佣金,网上客户经理开户安全吗
1. Configuration environment and project creation
力扣152题:乘积最大子数组
In depth interpretation of the investment logic of the consortium's participation in the privatization of Twitter
[2016] [paper notes] differential frequency tunable THz technology——
Everything you should know about wearable NFT!
NFT展示指南:如何展示你的NFT藏品
push to origin/master was rejected 错误解决方法
Distributed lock and its implementation
第二部分—C语言提高篇_7. 结构体
Practice of intelligent code reconstruction of Zhongyuan bank
嵌入式系统移植【8】——设备树和根文件系统移植
13_ conditional rendering
分页插件--PageHelper