当前位置:网站首页>Story of [Kun Jintong]: talk about Chinese character coding and common character sets
Story of [Kun Jintong]: talk about Chinese character coding and common character sets
2022-07-06 16:36:00 【Ruo Miaoshen】
List of articles
Before N All the articles mentioned Chinese characters garbled , It's really a problem that has plagued us for a long time .
Regardless of the development language , Disk files , database , Coding problems may occur in network transmission .
( One ) code
The computer has no code , They only recognize 0101 Binary system ( We often write formats for convenience 0xFF Of 16 Base number ).
So to display any text, you need to encode , Even English letters . So it was human beings who created the code .
Keep the nonsense short :
PS: The content and pictures come from Baidu and other websites ( Those who can find links are given ).
1.1 ASCII code
ASCII= American Standard Code for Information Interchange= American Standard Code for information exchange
A single byte represents a character , The highest bit is 0, The combination of other bits represents various English letters and symbols , such as :
most : 0111 1111,7F
HEX:41 42 43 44 2C 31 32 33 34 —— ABCD,1234
In English , use 128 A symbol code can represent all letters and symbols , But it is not enough to express other languages .
1.2 ASCII Code extension
Use the highest bit , For example, in French é The code of is 130( Binary system 10000010).
thus , The coding system used by these European countries , Can mean at most 256 Symbols .
most : 1111 1111,FF
But different countries have different letters , therefore , Even if they all use 256 The encoding of symbols , The letters are different . such as :
byte (130) In French coding, it stands for é,
In Hebrew code, it stands for the letters Gimel (ג),
In Russian code, it will represent another symbol .
But anyway , Of all these coding methods ,0–127 The symbols are the same , It's just that 128–255 This part of .
PS: In order to know the specific characters represented by the same code , We must know the meaning of this passage Character set .
1.3 Chinese characters ( Including other words ) Multi byte encoding of
Because we have the national standard code (GB) There are also international standard codes (Unicode), So Chinese is relatively complicated .
- GB2312 code :1981 year 5 month 1 The national standard for simplified Chinese character coding issued on May .GB2312 Use of Chinese characters Double byte code , Included 7445 Graphic characters , These include 6763 The Chinese characters .
- BIG5 code : Taiwan Traditional Chinese standard character set , Double byte encoding is adopted , Collects 13053 Chinese characters ,1984 Year of implementation .
- GBK code :1995 year 12 The national standard for Chinese character coding issued in May , It's right GB2312 Expansion of coding , Use of Chinese characters Double byte code .GBK The character set contains 21003 The Chinese characters , Including national standards GB13000-1 All the Chinese, Japanese and Korean characters in , and BIG5 All Chinese characters in the code .
- GB18030 code :2000 year 3 month 17 National standard of Chinese character coding issued by Japan , It's right GBK Expansion of coding , Cover Chinese 、 Japanese 、 Korean and Chinese minority languages , It includes 27484 The Chinese characters .GB18030 Character set uses Single byte 、 Double byte and Four bytes Three ways to encode characters . compatible GBK and GB2312 Character set .
- Unicode code : International standard character set , It defines a unique code for each character in various languages in the world , To meet cross language needs 、 Cross platform text information conversion .Unicode use Four bytes Code for each character .
- UTF-8 and UTF-16 code :Unicode Encoding conversion format , Variable length encoding , be relative to Unicode More space saving .UTF-16 The byte order of has a big tail (big-endian) And small tail sequence (little-endian) The difference .PS:UTF-8 The Chinese character of is usually Three bytes .
Our national standard code ( Character set ) It developed like this :
1.4 Coding examples and tests
for instance , Four Chinese characters ( Reference resources Website )
【 I 〇 䶵 𬌗 】
- I : frequently-used character , There are all kinds of character set codes .
- 〇: In the early GB2312 Not included .
- 䶵: Japanese Kanji ? Reference link ,GBK Not included ,GB18030 It's four bytes ,UTF-8 It's three bytes
- 𬌗: Occlusal surface of teeth , Reference link ,GBK Not included ,GB18030 It's four bytes ,UTF-8 It's four bytes .
use Java Test it , The code is as follows :
String aTestStr=" Chinese I 〇䶵𬌗abc";
{
System.out.print("UTF-8 :");
byte[] gb = aTestStr.getBytes(StandardCharsets.UTF_8);
for (byte b : gb)
System.out.printf("%#02x,", b);
System.out.println("\n"+new String(gb, StandardCharsets.UTF_8)+"\n");
}
{
System.out.print("GB18030:");
byte[] gb = aTestStr.getBytes("GB18030");
for (byte b : gb)
System.out.printf("%#02x,", b);
System.out.println("\n"+new String(gb, "GB18030")+"\n");
}
{
System.out.print("GBK :");
byte[] gb = aTestStr.getBytes("GBK");
for (byte b : gb)
System.out.printf("%#02x,", b);
System.out.println("\n"+new String(gb, "GBK")+"\n");
}
{
System.out.print("GB2312 :");
byte[] gb = aTestStr.getBytes("GB2312");
for (byte b : gb)
System.out.printf("%#02x,", b);
System.out.println("\n"+new String(gb, "GB2312")+"\n");
}
The output is as follows , Consistent with the above table :
Uh , Take a closer look , Or remove irrelevant words ...
UTF-8 :0xe4,0xb8,0xad,0xe6,0x96,0x87,0xe6,0x88,0x91,0xe3,0x80,0x87,0xe4,0xb6,0xb5,0xf0,0xac,0x8c,0x97,0x61,0x62,0x63,
Chinese I 〇䶵𬌗abc
GB18030:0xd6,0xd0,0xce,0xc4,0xce,0xd2,0xa9,0x96,0x82,0x35,0x87,0x38,0x99,0x31,0xd2,0x39,0x61,0x62,0x63,
Chinese I 〇䶵𬌗abc
GBK :0xd6,0xd0,0xce,0xc4,0xce,0xd2,0xa9,0x96,0x3f,0x3f,0x61,0x62,0x63,
Chinese I 〇??abc
GB2312 :0xd6,0xd0,0xce,0xc4,0xce,0xd2,0x3f,0x3f,0x3f,0x61,0x62,0x63,
Chinese I ???abc
( Two ) Show the reason for the garbled code
2.1 Out of coding range
As in the above example ,GBK,GB2312 There are random codes , Rare words appear ? question mark .
If the bytecode of a string stores the encoded content that is not in the character set used ,
The display will produce confused symbols and strange characters that you can't understand , Generally we call it garbled .
PS: Encountered before :《Python When writing to a text file ‘GBK’ The encoder cannot encode characters ‘\uXXYY‘》 It's the coding range .
The article is not written correctly , I'm too lazy to change , The results of the above test ,Java It's not a designation GBK Just relax , have to GB18030 ah !
2.2 code UTF8 Of BOM
stay Windows Maybe some UTF8 The coding , front 3 Is it Bit Order Mark( I made a mistake ), But in fact UTF8 There is no need to identify bits in byte order , So the only function is to show that this is a UTF8 The file of .
This is not a very general , We all accept the setting ( Please find out for yourself BOM), such as Linux I don't recognize BOM Of .
If you ignore BOM It will cause a little bit of garbled code in front of it when reading .
The best way is not to use BOM, chinese UTF8 code ( Yes BOM) The data example of is as follows :
EF BB BF 41 42 43 31 32 33 2C E4 B8 AD E6 96 87 E6 B1 89 E5 AD 97
“ABC123, Chinese characters ”
2.3 No Chinese support
For example, the operating system does not support , Chinese fonts are not installed .
Even if the content encoding is correct , But the system doesn't know what is GB18030, Nothing can show GB18030 The font of . It can only be displayed as garbled .
In fact, this situation cannot be called garbled , The code is right , But it can't be displayed ( It's usually a box ?).
2.4 Wrong code
Bingo!
Compared with the previous few less common reasons , Wrong coding when programming , Is the main cause of garbled code .
The so-called wrong use , Is to use a kind of code , Read the bytecode content of another encoding .
The most common : use GB Serial encoding mode read UTF8, use UTF8 read GB series .
PS: Encountered before :《 upgrade HBase2 Character encoding and Chinese display 》 It belongs to the wrong code ,
But I didn't write it wrong on my own initiative , It is String.getBytes No character set parameters were passed , The problem of using the system default character set .
Windows/Linux The default is different , and Java In subsequent operations (HBASE Take out the data ) For unspecified text , Have adopted UTF8 Handle .
I don't know which God sorted out the form below , When encountering garbled code, you can have a look .
2.5 Original bytecode error
If it's like the one mentioned above : After reading incorrectly, the content is written into a new text file , Then the new text file is encoded incorrectly .
The original bytecode of the text has been wrong , No matter how you read it later , The display is all wrong .
especially 【 Kunjin copy 】 such , Is an unrecoverable error .
( 3、 ... and ) Avoid reading and writing garbled files
3.1 Note the default encoding
- Java By default UTF8 code .
- Linux The default is UTF8 code .
- Windows The default is GB18030 code ( Everybody says GBK, however GBK Smaller range , Ah )
- Even if Windows Next ,IntelliJ IDEA The default unit test for is UTF8 code ( How to test is different from the formal runtime ?).
3.2 Specified encoding
- open , When writing to a text file , To specify an encoding , You need to specify the correct .
- Correct encoding requires no conversion , The code to be converted must be wrong .
3.3 Don't rely too much on automatic judgment
Two cases :
- The content is too short , Both coding ranges are included .
- The file is too large , Only English in front , Do you need to finish reading 10GB Text to judge the coding ?
The second situation is easy to understand ,
And the first case , There are only a few Chinese characters in the content , Such as UTF8 Coded 【 Jump jump 】,【 tinkling of jade pendants 】:
jump (UTF8) = E8 B7 83
Sam (UTF8) = E7 8F 8A
So the lovely reduplication is :
Jump jump (UTF8) = E8 B7 83,E8 B7 83
tinkling of jade pendants (UTF8) = E7 8F 8A,E7 8F 8A
If we 2 Look at bytes in groups :
Jump jump (UTF8) = E8 B7 83,E8 B7 83 = E8 B7,83 E8,B7 83 = Pathetic (GB18030)
tinkling of jade pendants (UTF8) = E7 8F 8A,E7 8F 8A = E7 8F,8A E7,8F 8A = Strong (GB18030)
Although not a common word , But it can be concluded that GB18030 Is it wrong ?
Maybe Yue Yue and Shan Shan look too normal ,
Let's take another example 【 趃 珋 】 and 【 Zan mi 】 Who on earth is right ?
( Four ) Extend the discussion :Oracle Character set for
Be careful Oracle It looks something like this :
- Oracle Even the English character set of the server, such as ISO8859p1 You can also store Chinese .
- It only needs Oracle The character set settings of client and server are consistent .
- Except for one kind of coding : Server side AL32UTF8, The client can set Jianzhong ZHS16GBK / In complexity ZHT16BIG5.
- Be a server AL32UTF8 when , The client should not be set to AL32UTF8.
The principle is as follows , But be careful of pits :
- Various client software tools handle character sets differently .
- Java8 Don't use Oracle Client character set ( clam ???).
For example, I have tried the following client tools ( Does not mean all versions !):
- TOAD: Not according to NLS_LANG environment variable , Unable to set character set , Only support ZHS16GBK.
- PL/SQL: according to NLS_LANG environment variable , Unable to set character set , But import data to AL32UTF8 Time is actually ZHS16GBK code , The query can be displayed correctly at the same time AL32UTF8 and ZHS16GBK Chinese for .
- Navicat: Import and query according to the set character set , But Chinese characters often make mistakes when importing .
Java With Alibaba's Druids, I managed to solve , You can refer to the problems encountered before .
This :《Oracle The database character set is WE8ISO8859P1 Store Chinese and Java Reading and writing display 》
as well as :《Oracle The database character set is WE8ISO8859P1 Store Chinese and client programs to show Chinese problems 》
( Four ) Extend the discussion :FTP Character encoding of
We develop FTP It is also easy to encounter garbled code , But mature FTP Tools generally do not .
That's because others judge carefully , Will ask in detail FTP Supported instructions , Including command coding method .
- Simply put, if FTP The service side with UTF8, Then no conversion is required .
- If the server is not UTF-8, Then we need to put our GBK Bytecode , Forcibly convert to the code of the server ( Include 8859-1 A class ).
Name related instructions ,list,put,get, Usually there is a directory name , Call everywhere in the file name .
Why? GBK turn 8859-1, No UTF8 turn 8859-1 Well , Because if it is UTF8 It already supports Chinese !!!
Um. , It's a matter of logic ……
Part of the code is as follows :
public String FromServerEncodingString(String aOriString) throws Exception {
if (ftp.getControlEncoding().equalsIgnoreCase("UTF-8")) return aOriString.trim();
else return new String(aOriString.getBytes(ftp.getControlEncoding()), "GBK").trim();
}
public String ToServerEncodingString(String aOriString) throws Exception {
if (ftp.getControlEncoding().equalsIgnoreCase("UTF-8")) return aOriString.trim();
else return new String(aOriString.getBytes("GBK"), ftp.getControlEncoding());
}
But fortunately, SFTP That is to say SSH agreement , It seems to be unified UTF8 code .
Make complaints about it FTP What a loose agreement !!!
For the time being , We'll talk about it in the future when we encounter new situations .
边栏推荐
- 第6章 DataNode
- It is forbidden to trigger onchange in antd upload beforeupload
- Codeforces Round #801 (Div. 2)A~C
- Li Kou - 298th weekly match
- Configuration du cadre flask loguru log Library
- QT有关QCobobox控件的样式设置(圆角、下拉框,向上展开、可编辑、内部布局等)
- 生成随机密码/验证码
- Share an example of running dash application in raspberry pie.
- Market trend report, technical innovation and market forecast of double-sided foam tape in China
- QT按钮点击切换QLineEdit焦点(含代码)
猜你喜欢
It is forbidden to trigger onchange in antd upload beforeupload
力扣:第81场双周赛
提交Spark应用的若干问题记录(sparklauncher with cluster deploy mode)
树莓派4B64位系统安装miniconda(折腾了几天终于解决)
pytorch提取骨架(可微)
软通乐学-js求字符串中字符串当中那个字符出现的次数多 -冯浩的博客
简单尝试DeepFaceLab(DeepFake)的新AMP模型
Click QT button to switch qlineedit focus (including code)
Chapter 7__ consumer_ offsets topic
Some problems encountered in installing pytorch in windows11 CONDA
随机推荐
(lightoj - 1349) Aladdin and the optimal invitation (greed)
Flask框架配置loguru日志库
QT simulates mouse events and realizes clicking, double clicking, moving and dragging
Summary of game theory
浏览器打印边距,默认/无边距,占满1页A4
第6章 DataNode
Codeforces Round #801 (Div. 2)A~C
Browser print margin, default / borderless, full 1 page A4
Problem - 1646C. Factorials and Powers of Two - Codeforces
It is forbidden to trigger onchange in antd upload beforeupload
计算时间差
QT按钮点击切换QLineEdit焦点(含代码)
Codeforces Round #800 (Div. 2)AC
Market trend report, technical innovation and market forecast of China's desktop capacitance meter
Click QT button to switch qlineedit focus (including code)
Input can only input numbers, limited input
Tree of life (tree DP)
使用jq实现全选 反选 和全不选-冯浩的博客
Kubernetes集群部署
力扣:第81场双周赛