当前位置:网站首页>\Processing method of ufeff

\Processing method of ufeff

2022-06-21 08:59:00 Break through

An error occurred while reading the file \ufeff, How to solve it

problem :

The language used :Python

The tools used :pycharm

A problem was found while reading the file : existing csv file ( Non empty ) The print result will appear \ufeff

import csv
#  test : from csv The data is read from the file 
file = open('userinfo.csv', 'r')
table = csv.reader(file)
for row in table:
   print(row)

Print the results

['\ufeff Zhang Ming ', '123456', ' Login successful ']
[' Wang two pock marks ', '123456', ' The username does not exist ']
[' Zhang Ming ', '111111', ' Wrong password ']
['', '', ' The username does not exist ']
​
Process finished with exit code 0

resolvent : Just modify the encoding format accordingly , hold UTF-8 code Change to UTF-8-sig

import csv
#  test : from csv The data is read from the file 
file = open('userinfo.csv', 'r', encoding='UTF-8-sig')
table = csv.reader(file)
for row in table:
   print(row)

Print the results

[' Zhang Ming ', '123456', ' Login successful ']
[' Wang two pock marks ', '123456', ' The username does not exist ']
[' Zhang Ming ', '111111', ' Wrong password ']
['', '', ' The username does not exist ']
​
Process finished with exit code 0

The following is from the Internet


utf-8 And utf-8-sig The difference between the two coding formats :

As UTF-8 is an 8-bit encoding no BOM is required and anyU+FEFF character in the decoded Unicode string (even if it’s the firstcharacter) is treated as a ZERO WIDTH NO-BREAK SPACE.

UTF-8 Take byte as encoding unit , Its byte order is the same in all systems , There is no problem with byte order , And so it doesn't really need BOM(“ByteOrder Mark”). however UTF-8 with BOM namely utf-8-sig Need to provide BOM.

About \ufeff Some information about ( From Wikipedia ):

Byte order mark ( English :byte-order mark,BOM) It's at the code point U+FEFF The name of the Unicode character . When we use UTF-16 or UTF-32 to UCS/ When encoding a string composed of Unicode characters , This character is used to indicate its byte order . It is often used as a marking document to UTF-8、UTF-16 or UTF-32 The sign of the code .

   character U+FEFF If it appears at the beginning of the byte stream , Is used to identify the byte order of the byte stream , Is it high or low . If it appears in the middle of the byte stream , Then express Zero width non newline space The meaning of , The user looks like a space . from Unicode3.2 Start ,U+FEFF Can only appear at the beginning of a byte stream , Can only be used to identify byte order , Like its name —— Byte order mark —— It's the same as ; Other uses have been abandoned . In its place , Use U+2060 To express zero width, no break blank .

stay UTF-16 in , The byte order mark is placed as the first character of a file or string stream , To indicate in this file or string stream , The end order of the code in all 16 bits ( Byte order ).

  • If a 16 bit unit is represented as a large tail order , This byte order marker character will appear in the sequence 0xFE, And then 0xFF( Among them 0x It's used to indicate hexadecimal ).

  • If the 16 bit unit uses small tail sequence , This byte sequence is 0xFF, And then 0xFE.

And in Unicode , The value is U+FFFE Is guaranteed that it will not be specified as a Unicode character . It means 0xFF0xFE It can only be interpreted as U+FEFF( Because it can't be in the big endings U+FFFE).

UTF-8 There is no byte order issue .UTF-8 The encoded byte order mark is used to indicate that it is UTF-8 The file of . It's only used to mark one UTF-8 The file of , It's not about byte order .[1] Many windows programs ( Including Notepad ) Will add a byte order mark to UTF-8 file . However , In the class Unix System ( Use a lot of text files , For file formats , For interprocess communication ) in , This approach is not recommended . Because it gets in the way of the interpreter script at the beginning of Shebang And so on . It also affects programming languages that don't recognize it . Such as gcc It will report an unrecognized character at the beginning of the source file . And in the PHP in , If the output buffer is not activated (output buffering), It makes the content of the page start to be sent to the browser ( namely : The user header file has been submitted ), This makes PHP Script cannot specify user header file (HTTP Header). The byte order is marked in UTF-8 Is represented as a sequence EF BB BF, I'm not ready to deal with most of them UTF-8 For text editors and web browsers , stay ISO-8859-1 It will be displayed in the environment of .

Although byte order marks can also be used for UTF-32, But this code is rarely used for transmission , The rules are like UTF-16. For those who have been in IANA Registered character set UTF-16BE、UTF-16LE、UTF-32BE and UTF-32LE Wait a minute , Do not use byte order marks . At the beginning of the document U+FEFF Will be interpreted as a ( Abandoned )" Zero width, no break ", Because the names of these character sets determine their byte order . For registered character sets UTF-16 and UTF-32 Come on , A beginning U+FEFF Is used to indicate the byte order .

 

原网站

版权声明
本文为[Break through]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202221450579267.html