当前位置:网站首页>What is the matter that programmers often say "the left hand is knuckled and the right hand is hot"?

What is the matter that programmers often say "the left hand is knuckled and the right hand is hot"?

2022-08-03 01:32:00 InfoQ

Actually Kunjinpao/Tengtangtang/Tuntuntun/Nuo Nuo Nuo are all problems between character set conversions, and some special-purpose bytes are represented as these special characters in another character set encoding.
Next, let's take a look at their origins one by one.

Kunjinkao

originated from the conversion between GBK character set and Unicode character set.As we all know, Unicode is one of the most common encoding methods. When it is converted with other old encoding systems, there must be some words that Unicode cannot represent (because Unicode is not fully compatible with all encoding systems when it is designed, and there is no necessary), so Unicode officially uses a placeholder to represent these unrepresentable characters, namely: U+FFFD (REPLACEMENT CHARATER), so we examine the UTF-8 encoding representation of U+FFFD, which is "0xef0xbf0xbd", if Repeat several times, such as "0xef0xbf0xbd0xef0xbf0xbd", and then decode it in the GBK/CP936/GB2312/GB18030 character set, a Chinese character with two bytes, the final result is Kunjin copy - Kun (0xEFBF), catties (0xBDEF), copy (0xBFBD).
#!/usr/bin/env python
# -*- coding=utf8 -*-
"""
# File Name : kunjinkao.py
# Author: SangYu
# Email: [email protected]
# Created Time: Tuesday, August 02, 2022 21:53:16
# Description:
"""

if __name__ == "__main__":
for i in u'\uFFFD'.encode('utf-8'):
print("%x" % ord(i)),
print ""
print((u'\uFFFD'.encode('utf-8')*2).decode('gbk'))
for i in u'Kunjinkao'.encode('gbk'):
print("%x" % ord(i)),
pass
Execution result:
null

TangTangTangTunTunTun

In the Windows platform, in order to debug the memory problem, the VC compiler of MicroSoft will use 0xcc for the uninitialized stack memory when the user turns on the Debug mode.Filling, all uninitialized heap memory is filled with 0xcd. For the same reason, we look at this worth GBK code as follows:
[hex(ord(i)) for i in u''.encode('gbk')]
[hex(ord(i)) for i in u'tuntuntun'.encode('gbk')]
Test result:div>null

锘锘锘

UTF encoding scheme has a standard mark for identifying encoding, such as FF FE in UTF-16, when weWhen writing a web page, use Notepad to edit. At this time, GBK encoding is used, then a Unicode reserved character U+FEFF will be added after conversion. Similarly, if this character header is represented as UTF-8 encoding, it is "0xEF0xBB0xBF", when converted to GBK encoding, it will be the following characters:
  • 锘EFBB
  • Kuan BFEF
  • 豢BBBF
[hex(ord(i)) for i in u'锘kuangzhu'.encode('gbk')]
Test result:
null
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/215/202208022248495534.html