当前位置：网站首页>\Processing method of ufeff

\Processing method of ufeff

2022-06-21 08:59:00 【Break through】

An error occurred while reading the file \ufeff, How to solve it

problem ：
The language used ：Python
The tools used ：pycharm
A problem was found while reading the file ： existing csv file （ Non empty ） The print result will appear \ufeff

import csv
#  test ： from csv The data is read from the file 
file = open('userinfo.csv', 'r')
table = csv.reader(file)
for row in table:
   print(row)

Print the results

['\ufeff Zhang Ming ', '123456', ' Login successful ']
[' Wang two pock marks ', '123456', ' The username does not exist ']
[' Zhang Ming ', '111111', ' Wrong password ']
['', '', ' The username does not exist ']

Process finished with exit code 0

resolvent ： Just modify the encoding format accordingly , hold UTF-8 code Change to UTF-8-sig

import csv
#  test ： from csv The data is read from the file 
file = open('userinfo.csv', 'r', encoding='UTF-8-sig')
table = csv.reader(file)
for row in table:
   print(row)

Print the results

[' Zhang Ming ', '123456', ' Login successful ']
[' Wang two pock marks ', '123456', ' The username does not exist ']
[' Zhang Ming ', '111111', ' Wrong password ']
['', '', ' The username does not exist ']

Process finished with exit code 0

The following is from the Internet

utf-8 And utf-8-sig The difference between the two coding formats :

As UTF-8 is an 8-bit encoding no BOM is required and anyU+FEFF character in the decoded Unicode string (even if it’s the firstcharacter) is treated as a ZERO WIDTH NO-BREAK SPACE.

UTF-8 Take byte as encoding unit , Its byte order is the same in all systems , There is no problem with byte order , And so it doesn't really need BOM(“ByteOrder Mark”). however UTF-8 with BOM namely utf-8-sig Need to provide BOM.

About \ufeff Some information about （ From Wikipedia ）：

Byte order mark （ English ：byte-order mark,BOM） It's at the code point U+FEFF The name of the Unicode character . When we use UTF-16 or UTF-32 to UCS/ When encoding a string composed of Unicode characters , This character is used to indicate its byte order . It is often used as a marking document to UTF-8、UTF-16 or UTF-32 The sign of the code .

　　 character U+FEFF If it appears at the beginning of the byte stream , Is used to identify the byte order of the byte stream , Is it high or low . If it appears in the middle of the byte stream , Then express Zero width non newline space The meaning of , The user looks like a space . from Unicode3.2 Start ,U+FEFF Can only appear at the beginning of a byte stream , Can only be used to identify byte order , Like its name —— Byte order mark —— It's the same as ; Other uses have been abandoned . In its place , Use U+2060 To express zero width, no break blank .

stay UTF-16 in , The byte order mark is placed as the first character of a file or string stream , To indicate in this file or string stream , The end order of the code in all 16 bits （ Byte order ）.

If a 16 bit unit is represented as a large tail order , This byte order marker character will appear in the sequence 0xFE, And then 0xFF（ Among them 0x It's used to indicate hexadecimal ）.
If the 16 bit unit uses small tail sequence , This byte sequence is 0xFF, And then 0xFE.

And in Unicode , The value is U+FFFE Is guaranteed that it will not be specified as a Unicode character . It means 0xFF、0xFE It can only be interpreted as U+FEFF（ Because it can't be in the big endings U+FFFE）.

UTF-8 There is no byte order issue .UTF-8 The encoded byte order mark is used to indicate that it is UTF-8 The file of . It's only used to mark one UTF-8 The file of , It's not about byte order .[1] Many windows programs （ Including Notepad ） Will add a byte order mark to UTF-8 file . However , In the class Unix System （ Use a lot of text files , For file formats , For interprocess communication ） in , This approach is not recommended . Because it gets in the way of the interpreter script at the beginning of Shebang And so on . It also affects programming languages that don't recognize it . Such as gcc It will report an unrecognized character at the beginning of the source file . And in the PHP in , If the output buffer is not activated （output buffering）, It makes the content of the page start to be sent to the browser （ namely ： The user header file has been submitted ）, This makes PHP Script cannot specify user header file （HTTP Header）. The byte order is marked in UTF-8 Is represented as a sequence EF BB BF, I'm not ready to deal with most of them UTF-8 For text editors and web browsers , stay ISO-8859-1 It will be displayed in the environment of ï»¿.

Although byte order marks can also be used for UTF-32, But this code is rarely used for transmission , The rules are like UTF-16. For those who have been in IANA Registered character set UTF-16BE、UTF-16LE、UTF-32BE and UTF-32LE Wait a minute , Do not use byte order marks . At the beginning of the document U+FEFF Will be interpreted as a （ Abandoned ）" Zero width, no break ", Because the names of these character sets determine their byte order . For registered character sets UTF-16 and UTF-32 Come on , A beginning U+FEFF Is used to indicate the byte order .

原网站

版权声明
本文为[Break through]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202221450579267.html

当前位置：网站首页>\Processing method of ufeff

\Processing method of ufeff

边栏推荐

猜你喜欢

随机推荐