当前位置：网站首页>Line up to pick up the express. At this meeting, I sorted out all kinds of code sets

Line up to pick up the express. At this meeting, I sorted out all kinds of code sets

2022-06-11 17:44:00 【Poplar branch】

The four codes of encirclement and suppression ——ANSI、ASCII、Unicode and UTF-8

The foreword is full of thoughts

Mention character set , Most of the kids may think of the time when they were learning programming , Or someone else's code , Or engage in web development , Browser one run , Ah, this , Why is it all garbled ...

The little shark stopped thinking

Never decide , Ask Du Niang

Baidu

After a query, we know that there is no unified character set , Quick correction , Okay , Problem solving , Continue to learn ~.

But I don't know if there are any friends like me , I know how to change it when I see the garbled code UTF-8, Simple characters a I know its ASCII Code is 97, See capitalized A I know its ASCII Code is 65. But you let me say UTF-8 What is it? , The difference between them , I really can't say anything

Meet with difficulties

Okay , Stop gossiping , We are serious people , Solve today's problems , How to distinguish them systematically .

The distinction between bytes and characters

One 、 byte

byte ？！ I am familiar with this , Bytes jump

Bytes to beat

Dabao , I am also thinking about the big factory

sad

Return to the right topic

In the computer , Bytes are used to A unit that measures the storage capacity of a computer . English is Byte. This is the common storage unit MB、GB The last one is capitalized "B". The minimum unit in which a computer summarizes and stores information is bit (bit), Generally speaking , One of the computers '0' Or a '1' Just be one . The relationship between them is this kind of ,
Octet Count as One byte ： 1Byte = 8 bit

Two 、 character

Characters are words and symbols used in computers , such as “1、2、3、A、B、C、~！·#￥%…*（）+” And so on are called characters .

ASCII code

ASCII Code should be the most silly and sweet of the four codes , It should also be the most contact with the students of the science class , Freshman C Language homework should have $word operator a$ Turn into $word operator A$ Your figure .

silly

ASCII The full name is American Standard Code for Information Interchange, The Chinese translation comes from American Standard Code for information exchange

ASCII In code , An English letter （ Case insensitive ） Take up a byte of space , One Chinese character takes up two bytes of space .

ASCII The number of characters that the code can represent is 128 Characters .

about ASCII In terms of code , What impressed me most was ASCII Code table .
A few more important things to remember ：0 Of ASCII Code is 48;A Of ASCII Code is 65;a Of ASCII Code is 97. Other 1、b、B Yes, gradually 1 Just push it . then ASCII There is nothing more important about the code ( The dog's head lives ).

surface

I put my watch here , Don't go to Baidu , Originally, I was taking advantage of the time to watch the meeting CSDN, There is no need to switch to Baidu .

Switch

ANSI code

ANSI The code is right ASCII An extension of the code . because ASCII Code means 128 Characters are not enough to meet our needs .

ANSI For coding 0x00~0x7f （ Decimal 0 To 127） Scope 1 In bytes 1 English characters , More than one byte 0x80~0xFFFF Range to represent other characters in other languages . in other words ,ANSI Code first only 128（0-127） One and ASCII Same code , The following characters are all characters of a national language .

ANSI Coding actually includes a lot of coding ： China made GB2312 code , It's used to encode Chinese. In addition , Japan compiles Japanese into Shift_JIS in , South Korea compiles Korean into Euc-kr in , Every country has its own standards . Subject to the conditions at that time , Between different languages ANSI Codes cannot be converted to each other , This will lead to garbled text in multilingual mixed text .

Unicode code

In order to solve different countries ANSI Coding conflicts ,Unicode Coding is born of this —— If every symbol in the world is given a unique code , Then the confusion will disappear .

Wordy GA

Unicode Standards are evolving , But the most common one is Use two bytes to represent a character （ If you want to use very remote characters , Need 4 Bytes ）. Modern operating systems and most programming languages directly support Unicode.

But the problem is , Originally, it was only required to store English letters in one byte Unicode There must be two bytes in it （ The rule is that the original English letter corresponds to ASCII Fill in the front of the yard 0）, This produces waste . So is there one that can eliminate garbled code , And avoid wasteful coding methods , here , Our lovely UTF-8 It's coming out. .

UTF-8 code

UTF-8 It's a variable length encoding , It can be used 1~4 Bytes represent a symbol , The length of the bytes varies according to the symbol

When the character ASCII The range of yards , It's just one byte , Retain the ASCII The encoding of a byte of a character as part of it , In this way UTF-8 Coding can also be regarded as a pair of ASCII Code expansion .

What's more interesting is that ：
unicode One Chinese character in the code accounts for 2 Bytes , and UTF-8 One Chinese character 3 Bytes . from unicode To uft-8 It's not a direct correspondence , It's about algorithms and rules

A small summary

In computer memory , Unified use Unicode code , When you need to save to a hard disk or need to transfer , Just switch to UTF-8 code .

When editing with Notepad , Read from file UTF-8 The characters are converted to Unicode Characters in memory , After editing , Save it with Unicode Convert to UTF-8 Save to file . This is a clever way , That is, the characters are unified , It also solves the problem of garbled code , It also saves space

too strong