当前位置:网站首页>What is the encoding that starts with ?
What is the encoding that starts with ?
2022-07-31 09:32:00 【Flying fat meow】
What code does start with?
When using cheerio at the Node layer to parse the web page, the output Chinese content is all with A bunch of things like garbled characters at the beginning of x, tried all kinds of encodings to no avail, and the magic is that after saving this bunch of "garbled characters" as a web page, it can be displayed normally when opened through a browser.What exactly is this??
The reduced sample code is as follows:
const cheerio = require('cheerio');const $ = cheerio.load('Hello')console.log($('#content').html()) //你好
Actually, the above pile of garbled things, its scientific name is entity code.
The following quote knowledgeAlmost the answer found.
In HTML, certain characters are reserved, such as the less-than sign "<", the greater-than sign ">", etc., and browsers will treat them as tags.If we want to display these reserved characters in HTML, we need to use character entities.We are more familiar with character entities such as space " ", less than sign "<", greater than sign ">" and so on.This format is more semantic and easy to remember, but in fact, there are other formats for character entities:
&name;dddd;hhhh;
- These three escaping methods are all called character references. The first is character entity reference. The "&" symbol is followed by a predefined entity name.
- The latter two are numeric character references, and the number value is the Unicode code point of the target character; the one starting with "" is followed by a decimal number, and the one starting with "" is followed by a hexadecimal number.
Starting with HTML4, the numeric character reference is in Unicode, regardless of the document encoding.The two characters "Hello" are the Unicode characters U+4F60 and U+597D, respectively, and the code point values "4F60" and "597D" in hexadecimal, which are also "20320" and "22909" in decimal.So
Type in HTML
你好你好
will appear as "Hello".
After knowing the reason, how to solve the above problem?
Method 1:Use attributes provided by cheerio
cheerio will decode the entity by default, we just need to turn off this function
const cheerio = require('cheerio');const $ = cheerio.load('Hello', { decodeEntities: false })console.log($('#content').html()) // hello
Method 2:Decode manually
function decode(str) {// Generally, it can be converted to standard unicode format first (add if necessary: when the returned data presents too many \\\u and so on)str = unescape(str.replace(/\\u/g, "%u"));// Then escape the entity character// If there is x, it means it is hexadecimal, $1 is to match whether there is an x, $2 is the content captured by the second bracket that matches, and convert $2 to the corresponding hexadecimal representationstr = str.replace(/(x)?(\w+);/g, function($, $1, $2) {return String.fromCharCode(parseInt($2, $1? 16: 10));});return str;}
Reprint address: &What code starts with #x? - Cannon~ - Blog Park
边栏推荐
猜你喜欢
随机推荐
踩水坑2 数据超出long long
Come n times - 09. Implement queues with two stacks
JSP response,request操作中(中文乱码)-如何解决呢?
Kotlin—基本语法(二)
JSP session的生命周期简介说明
如何在 TiDB Cloud 上使用 Databricks 进行数据分析 | TiDB Cloud 使用指南
loadrunner-controller-目标场景Schedule配置
HTC官方RUU固件提取刷机包rom.zip以及RUU解密教程
一次Spark SQL线上问题排查和定位
(C语言)程序环境和预处理
51单片机-----外部中断
各位大佬,sqlserver 支持表名正则匹配吗
MySQL 的几种碎片整理方案总结(解决delete大量数据后空间不释放的问题)
Come n times - 07. Rebuild the binary tree
spark过滤器
第六章
ARC在编译和运行做了什么?
Redis Sentinel原理
js implements the 2020 New Year's Day countdown bulletin board
【Redis高手修炼之路】Jedis——Jedis的基本使用