当前位置:网站首页>Count characters in UTF-8 string function
Count characters in UTF-8 string function
2022-07-31 19:31:00 【marshmallow superman】

前言
in actual work development,需要对UTF-8Characters in a string are counted,But characters may occupy multiple bytes ,Therefore, at this time, counting the number of bytes does not meet the actual functional requirements
The main knowledge points involved are:位运算、补码、Unicode字符集、UTF-8编码规则
Unicode与UTF-8的区别?
ASCII码
在计算机中每八个二进制位组成了一个字节(Byte),一个二进制有两种状态:”0” 状态 和 “1”状态,八位的字节一共可以组合出256(2的八次方)种不同的状态,Computers were initially only used in the United States,早期人们用 8 位二进制来编码英文字母(最前面的一位是 0),English letters and some commonly used characters are represented by consecutive byte states,一直编到127You can use different bytes to store English text,Hence the name of the programANSI的ACII编码
ANSI的ACIIEncoding is limited to characters representing English only,Special characters from other countries and regions cannot be encoded,So each country decided to use the first unused bit in the byte,一直把序号编到了最后一个状态255Used to represent new letters、符号,所以从128~255This character set is called "扩展字符集"
But since each country is for"扩展字符集"赋予了不同的含义,That is, there may be different encodings between countries,This can easily lead to the formation of garbled characters
Unicode
UnicodeJust to unify all the characters in the world,These are assigned a unique numerical number,这个编号范围从 0x000000 到 0x10FFFF (十六进制),有 110 多万,每个字符都有一个唯一的 Unicode 编号,这个编号一般写成 16 进制,在前面加上 U+.例如:“马”的 Unicode 是U+9A6C

UnicodeOnly the symbol encoding range is specified ,The storage method for symbols is not specified,Therefore, how symbols are stored still requires corresponding encoding rules
UTF-8
UTF(UCS Transfer Format)Standards are meant to be resolvedunicode如何在网络上传输的问题,UTF-8就是每次8个位传输数据,Although every time is used8位进行传输,But it is actually variable length bytes,That is, the bytes used are variable,This variable depends on the character inUnicode的编码大小,Smaller numbers use smaller bytes,The number of bytes used increases accordingly
UTF-8编码规则
1、对于单字节的符号,字节的第一位为0,The next seven areUnicode码,因为对于英文字母,Unicode码和ASCIIcode is the same
2、对于n字节的符号(n>1),第一个字节的前 n 位都设为 1,第 n+1 位设为 0,后面字节的前两位一律设为 10,剩下的没有提及的二进制位,全部为这个符号的 Unicode 码
“马”的 Unicode 编号是:0x9A6C,整数编号是 39532,对应第三个范围(2048 - 65535),其格式为:1110XXXX 10XXXXXX 10XXXXXX,39532 对应的二进制是 1001 1010 0110 1100,将二进制填入进入就为: 11101001 10101001 10101100
源码
int strlenUtf8(const char *s)
{
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80){
j++;
}
i++;
}
return j;
}
代码分析
This function is mainly forUTF-8Character statistics for strings,So be clear firstUTF-8coding rules and Unicode字符集的区别,其次是&位运算

位运算
(s[i] & 0xc0) != 0x80
0xC0=0b11000000
0x80=0b10000000
& 代表 按位逻辑与
to satisfy this expressions[i]的前两位为11,即11101001 10101001 10101100 只有11101001satisfy the expression
在计算机内存中,Integers are stored in two's complement form,So Chinese characters"马"The corresponding three bytes are respectively-23、-87、-84
总结
对UTF-8The main knowledge point for statistics of characters in strings is bit operations、补码、Unicode字符集、UTF-8编码规则,Hope to learn together by sharing the code,to progress,I also hope that the big guys can point out where there are mistakes,peace&love
边栏推荐
- Teach you how to deploy Nestjs projects
- Architect 04 - Application Service Encryption Design and Practice
- MySQL---Subqueries
- API for JD.com to obtain historical price information of commodities
- MySQL---Create and manage databases and data tables
- Bika LIMS 开源LIMS集—— SENAITE的使用(检测流程)
- Write a database document management tool based on WPF repeating the wheel (1)
- Redis Overview: Talk to the interviewer all night long about Redis caching, persistence, elimination mechanism, sentinel, and the underlying principles of clusters!...
- 【Yugong Series】July 2022 Go Teaching Course 025-Recursive Function
- leetcode 665. Non-decreasing Array
猜你喜欢

How can we improve the real yourself, become an excellent architect?

程序员如何学习开源项目,这篇文章告诉你

Basic configuration of OSPFv3

Memblaze发布首款基于长存颗粒的企业级SSD,背后有何新价值?

1161. Maximum Sum of Elements in Layer: Hierarchical Traversal Application Problems

角色妆容的实现

Financial profitability and solvency indicators

35 MySQL interview questions and diagrams, this is also easy to understand

Shell 脚本 快速入门到实战 -02

ReentrantLock原理(未完待续)
随机推荐
leetcode 665. Non-decreasing Array
高通cDSP简单编程例子(实现查询高通cDSP使用率、签名),RK3588 npu使用率查询
Book of the Month (202207): The Definitive Guide to Swift Programming
Chinese encoding Settings and action methods return values
返回一个零长度的数组或者空的集合,不要返回null
淘宝/天猫获得淘口令真实url API
Memblaze发布首款基于长存颗粒的企业级SSD,背后有何新价值?
SiC MOSFET的短路特性及保护
STM32 full series development firmware installation guide under Arduino framework
第七章
pytorch lstm时间序列预测问题踩坑「建议收藏」
关注!海泰方圆加入《个人信息保护自律公约》
Short-circuit characteristics and protection of SiC MOSFETs
Carbon教程之 基本语法入门大全 (教程)
手把手教你学会部署Nestjs项目
Introduction to Audio Types and Encoding Formats in Unity
微信小程序的路由拦截
Three.js入门
rj45对接头千兆(百兆以太网接口定义)
-xms -xmx(information value)