当前位置:网站首页>Count characters in UTF-8 string function
Count characters in UTF-8 string function
2022-07-31 19:31:00 【marshmallow superman】
前言
in actual work development,需要对UTF-8Characters in a string are counted,But characters may occupy multiple bytes ,Therefore, at this time, counting the number of bytes does not meet the actual functional requirements
The main knowledge points involved are:位运算、补码、Unicode字符集、UTF-8编码规则
Unicode与UTF-8的区别?
ASCII码
在计算机中每八个二进制位组成了一个字节(Byte),一个二进制有两种状态:”0” 状态 和 “1”状态,八位的字节一共可以组合出256(2的八次方)种不同的状态,Computers were initially only used in the United States,早期人们用 8 位二进制来编码英文字母(最前面的一位是 0),English letters and some commonly used characters are represented by consecutive byte states,一直编到127You can use different bytes to store English text,Hence the name of the programANSI的ACII编码
ANSI的ACIIEncoding is limited to characters representing English only,Special characters from other countries and regions cannot be encoded,So each country decided to use the first unused bit in the byte,一直把序号编到了最后一个状态255Used to represent new letters、符号,所以从128~255This character set is called "扩展字符集"
But since each country is for"扩展字符集"赋予了不同的含义,That is, there may be different encodings between countries,This can easily lead to the formation of garbled characters
Unicode
UnicodeJust to unify all the characters in the world,These are assigned a unique numerical number,这个编号范围从 0x000000 到 0x10FFFF (十六进制),有 110 多万,每个字符都有一个唯一的 Unicode 编号,这个编号一般写成 16 进制,在前面加上 U+.例如:“马”的 Unicode 是U+9A6C
UnicodeOnly the symbol encoding range is specified ,The storage method for symbols is not specified,Therefore, how symbols are stored still requires corresponding encoding rules
UTF-8
UTF(UCS Transfer Format)Standards are meant to be resolvedunicode如何在网络上传输的问题,UTF-8就是每次8个位传输数据,Although every time is used8位进行传输,But it is actually variable length bytes,That is, the bytes used are variable,This variable depends on the character inUnicode的编码大小,Smaller numbers use smaller bytes,The number of bytes used increases accordingly
UTF-8编码规则
1、对于单字节的符号,字节的第一位为0,The next seven areUnicode码,因为对于英文字母,Unicode码和ASCIIcode is the same
2、对于n字节的符号(n>1),第一个字节的前 n 位都设为 1,第 n+1 位设为 0,后面字节的前两位一律设为 10,剩下的没有提及的二进制位,全部为这个符号的 Unicode 码
“马”的 Unicode 编号是:0x9A6C,整数编号是 39532,对应第三个范围(2048 - 65535),其格式为:1110XXXX 10XXXXXX 10XXXXXX,39532 对应的二进制是 1001 1010 0110 1100,将二进制填入进入就为: 11101001 10101001 10101100
源码
int strlenUtf8(const char *s)
{
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80){
j++;
}
i++;
}
return j;
}
代码分析
This function is mainly forUTF-8Character statistics for strings,So be clear firstUTF-8coding rules and Unicode字符集的区别,其次是&位运算

位运算
(s[i] & 0xc0) != 0x80
0xC0=0b11000000
0x80=0b10000000
& 代表 按位逻辑与
to satisfy this expressions[i]的前两位为11,即11101001 10101001 10101100 只有11101001satisfy the expression
在计算机内存中,Integers are stored in two's complement form,So Chinese characters"马"The corresponding three bytes are respectively-23、-87、-84
总结
对UTF-8The main knowledge point for statistics of characters in strings is bit operations、补码、Unicode字符集、UTF-8编码规则,Hope to learn together by sharing the code,to progress,I also hope that the big guys can point out where there are mistakes,peace&love
边栏推荐
猜你喜欢
Shell script quick start to actual combat -02
Apache EventMesh 分布式事件驱动多运行时
MySQL - multi-table query
leetcode:6135. 图中的最长环【内向基环树 + 最长环板子 + 时间戳】
The new telecom "routine", my dad was tricked!
嵌入式开发没有激情了,正常吗?
2022 Android interview summary (with interview questions | source code | interview materials)
ResNet的基础:残差块的原理
What's wrong with the sql syntax in my sql
The whole network is on the verge of triggering, and the all-round assistant for content distribution from media people - Rongmeibao
随机推荐
Introduction to Audio Types and Encoding Formats in Unity
全网一触即发,自媒体人的内容分发全能助手——融媒宝
Thymeleaf是什么?该如何使用。
Batch (batch size, full batch, mini batch, online learning), iterations and epochs in deep learning
leetcode: 6135. The longest ring in the graph [inward base ring tree + longest ring board + timestamp]
Three. Introduction to js
Write a database document management tool based on WPF repeating the wheel (1)
Kotlin coroutines: continuation, continuation interceptor, scheduler
架构实战营模块八作业
Short-circuit characteristics and protection of SiC MOSFETs
Huawei mobile phone one-click to open "maintenance mode" to hide all data and make mobile phone privacy more secure
广汽本田安全体验营:“危险”是最好的老师
Architecture Battalion Module 8 Homework
SiC MOSFET的短路特性及保护
MySQL---aggregate function
【码蹄集新手村600题】通向公式与程序相结合
Made with Flutter and Firebase!counter application
MySQL---运算符
多线程之锁
Apache EventMesh distributed event-driven multi-runtime