当前位置：网站首页>Story of [Kun Jintong]: talk about Chinese character coding and common character sets

Story of [Kun Jintong]: talk about Chinese character coding and common character sets

2022-07-06 16:36:00 【Ruo Miaoshen】

List of articles

（ One ） code
（ Two ） Show the reason for the garbled code
（ 3、 ... and ） Avoid reading and writing garbled files
（ Four ） Extend the discussion :Oracle Character set for
（ Four ） Extend the discussion :FTP Character encoding of

Before N All the articles mentioned Chinese characters garbled , It's really a problem that has plagued us for a long time .
Regardless of the development language , Disk files , database , Coding problems may occur in network transmission .

（ One ） code

The computer has no code , They only recognize 0101 Binary system （ We often write formats for convenience 0xFF Of 16 Base number ）.
So to display any text, you need to encode , Even English letters . So it was human beings who created the code .
Keep the nonsense short ：
PS： The content and pictures come from Baidu and other websites （ Those who can find links are given ）.

1.1 ASCII code

ASCII= American Standard Code for Information Interchange= American Standard Code for information exchange

A single byte represents a character , The highest bit is 0, The combination of other bits represents various English letters and symbols , such as ：

most : 0111 1111,7F
HEX:41 42 43 44 2C 31 32 33 34 —— ABCD,1234

In English , use 128 A symbol code can represent all letters and symbols , But it is not enough to express other languages .

1.2 ASCII Code extension

Use the highest bit , For example, in French é The code of is 130（ Binary system 10000010）.
thus , The coding system used by these European countries , Can mean at most 256 Symbols .

most : 1111 1111,FF

But different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different . such as ：

byte （130） In French coding, it stands for é,
In Hebrew code, it stands for the letters Gimel (ג),
In Russian code, it will represent another symbol .

But anyway , Of all these coding methods ,0–127 The symbols are the same , It's just that 128–255 This part of .
PS： In order to know the specific characters represented by the same code , We must know the meaning of this passage Character set .

1.3 Chinese characters （ Including other words ） Multi byte encoding of

Because we have the national standard code （GB） There are also international standard codes （Unicode）, So Chinese is relatively complicated .

GB2312 code ：1981 year 5 month 1 The national standard for simplified Chinese character coding issued on May .GB2312 Use of Chinese characters Double byte code , Included 7445 Graphic characters , These include 6763 The Chinese characters .
BIG5 code ： Taiwan Traditional Chinese standard character set , Double byte encoding is adopted , Collects 13053 Chinese characters ,1984 Year of implementation .
GBK code ：1995 year 12 The national standard for Chinese character coding issued in May , It's right GB2312 Expansion of coding , Use of Chinese characters Double byte code .GBK The character set contains 21003 The Chinese characters , Including national standards GB13000-1 All the Chinese, Japanese and Korean characters in , and BIG5 All Chinese characters in the code .
GB18030 code ：2000 year 3 month 17 National standard of Chinese character coding issued by Japan , It's right GBK Expansion of coding , Cover Chinese 、 Japanese 、 Korean and Chinese minority languages , It includes 27484 The Chinese characters .GB18030 Character set uses Single byte 、 Double byte and Four bytes Three ways to encode characters . compatible GBK and GB2312 Character set .
Unicode code ： International standard character set , It defines a unique code for each character in various languages in the world , To meet cross language needs 、 Cross platform text information conversion .Unicode use Four bytes Code for each character .
UTF-8 and UTF-16 code ：Unicode Encoding conversion format , Variable length encoding , be relative to Unicode More space saving .UTF-16 The byte order of has a big tail （big-endian） And small tail sequence （little-endian） The difference .PS：UTF-8 The Chinese character of is usually Three bytes .

Our national standard code （ Character set ） It developed like this ：
The first two bytes of encoding contain relation

1.4 Coding examples and tests

for instance , Four Chinese characters （ Reference resources Website ）

【 I 〇䶵𬌗】

I ： frequently-used character , There are all kinds of character set codes .
〇： In the early GB2312 Not included .
䶵： Japanese Kanji ？ Reference link ,GBK Not included ,GB18030 It's four bytes ,UTF-8 It's three bytes
𬌗： Occlusal surface of teeth , Reference link ,GBK Not included ,GB18030 It's four bytes ,UTF-8 It's four bytes .

Insert picture description here
use Java Test it , The code is as follows ：

		String aTestStr=" Chinese I 〇䶵𬌗abc";
        {
    
            System.out.print("UTF-8 ：");
            byte[] gb = aTestStr.getBytes(StandardCharsets.UTF_8);
            for (byte b : gb)
                System.out.printf("%#02x,", b);
            System.out.println("\n"+new String(gb, StandardCharsets.UTF_8)+"\n");
        }
        {
    
            System.out.print("GB18030：");
            byte[] gb = aTestStr.getBytes("GB18030");
            for (byte b : gb)
                System.out.printf("%#02x,", b);
            System.out.println("\n"+new String(gb, "GB18030")+"\n");
        }
        {
    
            System.out.print("GBK ：");
            byte[] gb = aTestStr.getBytes("GBK");
            for (byte b : gb)
                System.out.printf("%#02x,", b);
            System.out.println("\n"+new String(gb, "GBK")+"\n");
        }
        {
    
            System.out.print("GB2312 ：");
            byte[] gb = aTestStr.getBytes("GB2312");
            for (byte b : gb)
                System.out.printf("%#02x,", b);
            System.out.println("\n"+new String(gb, "GB2312")+"\n");
        }

The output is as follows , Consistent with the above table ：
Uh , Take a closer look , Or remove irrelevant words ...

UTF-8  ：0xe4,0xb8,0xad,0xe6,0x96,0x87,0xe6,0x88,0x91,0xe3,0x80,0x87,0xe4,0xb6,0xb5,0xf0,0xac,0x8c,0x97,0x61,0x62,0x63,
 Chinese I 〇䶵𬌗abc

GB18030：0xd6,0xd0,0xce,0xc4,0xce,0xd2,0xa9,0x96,0x82,0x35,0x87,0x38,0x99,0x31,0xd2,0x39,0x61,0x62,0x63,
 Chinese I 〇䶵𬌗abc

GBK    ：0xd6,0xd0,0xce,0xc4,0xce,0xd2,0xa9,0x96,0x3f,0x3f,0x61,0x62,0x63,
 Chinese I 〇??abc

GB2312 ：0xd6,0xd0,0xce,0xc4,0xce,0xd2,0x3f,0x3f,0x3f,0x61,0x62,0x63,
 Chinese I ???abc

（ Two ） Show the reason for the garbled code

2.1 Out of coding range

As in the above example ,GBK,GB2312 There are random codes , Rare words appear ？ question mark .

If the bytecode of a string stores the encoded content that is not in the character set used ,
The display will produce confused symbols and strange characters that you can't understand , Generally we call it garbled .

PS： Encountered before ：《Python When writing to a text file ‘GBK’ The encoder cannot encode characters ‘\uXXYY‘》 It's the coding range .
The article is not written correctly , I'm too lazy to change , The results of the above test ,Java It's not a designation GBK Just relax , have to GB18030 ah ！

2.2 code UTF8 Of BOM

stay Windows Maybe some UTF8 The coding , front 3 Is it Bit Order Mark（ I made a mistake ）, But in fact UTF8 There is no need to identify bits in byte order , So the only function is to show that this is a UTF8 The file of .

This is not a very general , We all accept the setting （ Please find out for yourself BOM）, such as Linux I don't recognize BOM Of .
If you ignore BOM It will cause a little bit of garbled code in front of it when reading .

The best way is not to use BOM, chinese UTF8 code （ Yes BOM） The data example of is as follows ：

EF BB BF 41 42 43 31 32 33 2C E4 B8 AD E6 96 87 E6 B1 89 E5 AD 97
“ABC123, Chinese characters ”

2.3 No Chinese support

For example, the operating system does not support , Chinese fonts are not installed .
Even if the content encoding is correct , But the system doesn't know what is GB18030, Nothing can show GB18030 The font of . It can only be displayed as garbled .

In fact, this situation cannot be called garbled , The code is right , But it can't be displayed （ It's usually a box ？）.

2.4 Wrong code

Bingo！

Compared with the previous few less common reasons , Wrong coding when programming , Is the main cause of garbled code .
The so-called wrong use , Is to use a kind of code , Read the bytecode content of another encoding .
The most common ： use GB Serial encoding mode read UTF8, use UTF8 read GB series .

PS： Encountered before ：《 upgrade HBase2 Character encoding and Chinese display 》 It belongs to the wrong code ,
But I didn't write it wrong on my own initiative , It is String.getBytes No character set parameters were passed , The problem of using the system default character set .
Windows/Linux The default is different , and Java In subsequent operations （HBASE Take out the data ） For unspecified text , Have adopted UTF8 Handle .

I don't know which God sorted out the form below , When encountering garbled code, you can have a look .
Insert picture description here

2.5 Original bytecode error

If it's like the one mentioned above ： After reading incorrectly, the content is written into a new text file , Then the new text file is encoded incorrectly .
The original bytecode of the text has been wrong , No matter how you read it later , The display is all wrong .

especially 【 Kunjin copy 】 such , Is an unrecoverable error .

（ 3、 ... and ） Avoid reading and writing garbled files

3.1 Note the default encoding

Java By default UTF8 code .
Linux The default is UTF8 code .
Windows The default is GB18030 code （ Everybody says GBK, however GBK Smaller range , Ah ）
Even if Windows Next ,IntelliJ IDEA The default unit test for is UTF8 code （ How to test is different from the formal runtime ？）.

3.2 Specified encoding

open , When writing to a text file , To specify an encoding , You need to specify the correct .
Correct encoding requires no conversion , The code to be converted must be wrong .

3.3 Don't rely too much on automatic judgment

Two cases ：

The content is too short , Both coding ranges are included .
The file is too large , Only English in front , Do you need to finish reading 10GB Text to judge the coding ？

The second situation is easy to understand ,
And the first case , There are only a few Chinese characters in the content , Such as UTF8 Coded 【 Jump jump 】,【 tinkling of jade pendants 】：

 jump (UTF8) = E8 B7 83 
 Sam (UTF8) = E7 8F 8A

So the lovely reduplication is ：

 Jump jump (UTF8) = E8 B7 83,E8 B7 83
 tinkling of jade pendants (UTF8) = E7 8F 8A,E7 8F 8A

If we 2 Look at bytes in groups ：

 Jump jump (UTF8) = E8 B7 83,E8 B7 83 = E8 B7,83 E8,B7 83 =  Pathetic (GB18030)
 tinkling of jade pendants (UTF8) = E7 8F 8A,E7 8F 8A = E7 8F,8A E7,8F 8A =  Strong (GB18030)

Although not a common word , But it can be concluded that GB18030 Is it wrong ？

Maybe Yue Yue and Shan Shan look too normal ,
Let's take another example 【趃珋】 and 【 Zan mi 】 Who on earth is right ？

（ Four ） Extend the discussion :Oracle Character set for

Be careful Oracle It looks something like this ：

Oracle Even the English character set of the server, such as ISO8859p1 You can also store Chinese .
It only needs Oracle The character set settings of client and server are consistent .
Except for one kind of coding : Server side AL32UTF8, The client can set Jianzhong ZHS16GBK / In complexity ZHT16BIG5.
Be a server AL32UTF8 when , The client should not be set to AL32UTF8.

The principle is as follows , But be careful of pits ：

Various client software tools handle character sets differently .
Java8 Don't use Oracle Client character set （ clam ？？？）.

For example, I have tried the following client tools （ Does not mean all versions ！）：

TOAD： Not according to NLS_LANG environment variable , Unable to set character set , Only support ZHS16GBK.
PL/SQL： according to NLS_LANG environment variable , Unable to set character set , But import data to AL32UTF8 Time is actually ZHS16GBK code , The query can be displayed correctly at the same time AL32UTF8 and ZHS16GBK Chinese for .
Navicat： Import and query according to the set character set , But Chinese characters often make mistakes when importing .

Java With Alibaba's Druids, I managed to solve , You can refer to the problems encountered before .

This ：《Oracle The database character set is WE8ISO8859P1 Store Chinese and Java Reading and writing display 》
as well as ：《Oracle The database character set is WE8ISO8859P1 Store Chinese and client programs to show Chinese problems 》

（ Four ） Extend the discussion :FTP Character encoding of

We develop FTP It is also easy to encounter garbled code , But mature FTP Tools generally do not .
That's because others judge carefully , Will ask in detail FTP Supported instructions , Including command coding method .

Simply put, if FTP The service side with UTF8, Then no conversion is required .
If the server is not UTF-8, Then we need to put our GBK Bytecode , Forcibly convert to the code of the server （ Include 8859-1 A class ）.

Name related instructions ,list,put,get, Usually there is a directory name , Call everywhere in the file name .
Why? GBK turn 8859-1, No UTF8 turn 8859-1 Well , Because if it is UTF8 It already supports Chinese ！！！
Um. , It's a matter of logic ……

Part of the code is as follows ：

	public String FromServerEncodingString(String aOriString) throws Exception {
    
		if (ftp.getControlEncoding().equalsIgnoreCase("UTF-8")) return aOriString.trim();
		else return new String(aOriString.getBytes(ftp.getControlEncoding()), "GBK").trim();
	}
	public String ToServerEncodingString(String aOriString) throws Exception {
    
		if (ftp.getControlEncoding().equalsIgnoreCase("UTF-8")) return aOriString.trim();
		else return new String(aOriString.getBytes("GBK"), ftp.getControlEncoding());
	}