Foreword
It is very simple to convert GBK encoding to string in JS, just call TextDecoder
directly:
const gbkBuf = new Uint8Array([196, 227, 186, 195, 49, 50, 51])new TextDecoder('gbk').decode(gbkBuf) // "Hello 123"
But conversely, the conversion of strings to GBK encoding is not so simple, because TextEncoder
cannot specify the character set, and can only convert strings into UTF-8 encoded binary data.
So the vast majority of solutions in the industry use third-party encoding libraries, such as iconv.Since these libraries pack a large amount of character set data, the volume is very considerable. Even the simplified version of iconv-lite has several hundred kB, which is obviously not perfect on the browser side.We hope to solve it in just a few hundred bytes!
traverse
According to the information available, GBK actually only has more than 20,000 characters, so the easiest way is to "brute force exhaustion".With the help of TextDecoder
, the JS characters corresponding to each GBK can be traversed, and the subsequent encoding process is nothing more than a table lookup.
In fact, the encoding range of GBK is regular:
https://en.wikipedia.org/wiki/GBK_(character_encoding)#Encoding
So just traversing in a predetermined range, even if it takes a dozen lines of code to improve performance, it is worth it.
const ranges = [[0xA1, 0xA9, 0xA1, 0xFE],[0xB0, 0xF7, 0xA1, 0xFE],[0x81, 0xA0, 0x40, 0xFE],[0xAA, 0xFE, 0x40, 0xA0],[0xA8, 0xA9, 0x40, 0xA0],[0xAA, 0xAF, 0xA1, 0xFE],[0xF8, 0xFE, 0xA1, 0xFE],[0xA1, 0xA7, 0x40, 0xA0],]const codes = new Uint16Array(23940)let i = 0for (const [b1Begin, b1End, b2Begin, b2End] of ranges) {for (let b2 = b2Begin; b2 <= b2End; b2++) {if (b2 !== 0x7F) {for (let b1 = b1Begin; b1 <= b1End; b1++) {codes[i++] = b2 << 8 | b1}}}}const str = new TextDecoder('gbk').decode(codes)// code tableconst table = new Uint16Array(65536)for (let i = 0; i < str.length; i++) {table[str.charCodeAt(i)] = codes[i]}
It would be very inefficient to call TextDecoder
every time a GBK is traversed.Therefore, we store all GBKs in the above codes array, and finally call TextDecoder
batch conversion only once.
This initialization process takes only 1ms ~ 2ms, and the overhead is very low.
Lookup table
With the mapping table, you can directly look up the table when coding:
function stringToGbk(str) {const buf = new Uint16Array(str.length)for (let i = 0; i < str.length; i++) {const code = str.charCodeAt(i)buf[i] = table[code]}return new Uint8Array(buf.buffer)}stringToGbk('Hello') // [196, 227, 186, 195]
The output is the same as demonstrated at the beginning of this article.
However, the above ignores the ASCII range. If you pass in "hello 123", there will be a problem.Since the ASCII part of GBK is stored in a single byte, the encoding logic needs to be adjusted:
function stringToGbk(str) {const buf = new Uint8Array(str.length * 2)let n = 0for (let i = 0; i < str.length; i++) {const code = str.charCodeAt(i)if (code < 0x80) {buf[n++] = code} else {const gbk = table[code]buf[n++] = gbk & 0xFFbuf[n++] = gbk >> 8}}return buf.subarray(0, n)}stringToGbk('Hello 123') // [196, 227, 186, 195, 49, 50, 51]
The output is the same as demonstrated at the beginning of this article.
Uint8Array is used instead of Array for performance reasons.However, the length of Uint8Array is fixed and cannot be changed after application. Therefore, it is assumed that the input string is all non-ASCII characters, so as to ensure that the buffer is sufficient, and then intercept it when it returns.(use subarray reference, no copy needed)
perfect
If a character not supported by GBK is passed in during encoding, it will become a 0 character according to the above logic, because the table vacancy position is 0 by default.And 0 itself is part of GBK, so it's not perfect.
Therefore, we can fill the table with other values, and then the value appears when looking up the table, which can be handled as an exception.
In addition, according to popular science on Wikipedia, Microsoft's implementation of Code page 936 based on GBK One more 0x80 character code, the corresponding character is the euro symbol €
.
Try it, even non-Windows browsers support it:
const gbkBuf = new Uint8Array([0x80])new TextDecoder('gbk').decode(gbkBuf) // "€"
Demo: https://jsbin.com/vuxawul/edit?html,output
Final implementation: https://github.com/EtherDream/str2gbk
Using this scheme, GBK encoding can be achieved in dozens of lines of code and hundreds of bytes, and the performance is very high.