当前位置:网站首页>Count characters in UTF-8 string function
Count characters in UTF-8 string function
2022-07-31 19:31:00 【marshmallow superman】
前言
in actual work development,需要对UTF-8Characters in a string are counted,But characters may occupy multiple bytes ,Therefore, at this time, counting the number of bytes does not meet the actual functional requirements
The main knowledge points involved are:位运算、补码、Unicode字符集、UTF-8编码规则
Unicode与UTF-8的区别?
ASCII码
在计算机中每八个二进制位组成了一个字节(Byte),一个二进制有两种状态:”0” 状态 和 “1”状态,八位的字节一共可以组合出256(2的八次方)种不同的状态,Computers were initially only used in the United States,早期人们用 8 位二进制来编码英文字母(最前面的一位是 0),English letters and some commonly used characters are represented by consecutive byte states,一直编到127You can use different bytes to store English text,Hence the name of the programANSI的ACII编码
ANSI的ACIIEncoding is limited to characters representing English only,Special characters from other countries and regions cannot be encoded,So each country decided to use the first unused bit in the byte,一直把序号编到了最后一个状态255Used to represent new letters、符号,所以从128~255This character set is called "扩展字符集"
But since each country is for"扩展字符集"赋予了不同的含义,That is, there may be different encodings between countries,This can easily lead to the formation of garbled characters
Unicode
UnicodeJust to unify all the characters in the world,These are assigned a unique numerical number,这个编号范围从 0x000000 到 0x10FFFF (十六进制),有 110 多万,每个字符都有一个唯一的 Unicode 编号,这个编号一般写成 16 进制,在前面加上 U+.例如:“马”的 Unicode 是U+9A6C
UnicodeOnly the symbol encoding range is specified ,The storage method for symbols is not specified,Therefore, how symbols are stored still requires corresponding encoding rules
UTF-8
UTF(UCS Transfer Format)Standards are meant to be resolvedunicode如何在网络上传输的问题,UTF-8就是每次8个位传输数据,Although every time is used8位进行传输,But it is actually variable length bytes,That is, the bytes used are variable,This variable depends on the character inUnicode的编码大小,Smaller numbers use smaller bytes,The number of bytes used increases accordingly
UTF-8编码规则
1、对于单字节的符号,字节的第一位为0,The next seven areUnicode码,因为对于英文字母,Unicode码和ASCIIcode is the same
2、对于n字节的符号(n>1),第一个字节的前 n 位都设为 1,第 n+1 位设为 0,后面字节的前两位一律设为 10,剩下的没有提及的二进制位,全部为这个符号的 Unicode 码
“马”的 Unicode 编号是:0x9A6C,整数编号是 39532,对应第三个范围(2048 - 65535),其格式为:1110XXXX 10XXXXXX 10XXXXXX,39532 对应的二进制是 1001 1010 0110 1100,将二进制填入进入就为: 11101001 10101001 10101100
源码
int strlenUtf8(const char *s)
{
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80){
j++;
}
i++;
}
return j;
}
代码分析
This function is mainly forUTF-8Character statistics for strings,So be clear firstUTF-8coding rules and Unicode字符集的区别,其次是&位运算

位运算
(s[i] & 0xc0) != 0x80
0xC0=0b11000000
0x80=0b10000000
& 代表 按位逻辑与
to satisfy this expressions[i]的前两位为11,即11101001 10101001 10101100 只有11101001satisfy the expression
在计算机内存中,Integers are stored in two's complement form,So Chinese characters"马"The corresponding three bytes are respectively-23、-87、-84
总结
对UTF-8The main knowledge point for statistics of characters in strings is bit operations、补码、Unicode字符集、UTF-8编码规则,Hope to learn together by sharing the code,to progress,I also hope that the big guys can point out where there are mistakes,peace&love
边栏推荐
- 迁移学习——Domain Adaptation
- matplotlib ax bar color 设置ax bar的颜色、 透明度、label legend
- npm 更改为淘宝镜像的方法[通俗易懂]
- 全网一触即发,自媒体人的内容分发全能助手——融媒宝
- 第六章
- 手把手教你学会部署Nestjs项目
- multithreaded lock
- MySQL---基本的select语句
- 深度学习中的batch(batch size,full batch,mini batch, online learning)、iterations与epoch
- 高通cDSP简单编程例子(实现查询高通cDSP使用率、签名),RK3588 npu使用率查询
猜你喜欢
Getting Started with Tkinter
架构实战营模块八作业
全平台GPU通用AI视频补帧超分教程
The new telecom "routine", my dad was tricked!
Jiuqi ny3p series voice chip replaces the domestic solution KT148A, which is more cost-effective and has a length of 420 seconds
Tkinter 入门之旅
Three. Introduction to js
老牌音乐播放器 WinAmp 发布 5.9 RC1 版:迁移到 VS 2019 完全重建,兼容 Win11
全网一触即发,自媒体人的内容分发全能助手——融媒宝
学生管理系统第一天:完成登录退出操作逻辑 PyQt5 + MySQL5.8
随机推荐
嵌入式开发没有激情了,正常吗?
Redis综述篇:与面试官彻夜长谈Redis缓存、持久化、淘汰机制、哨兵、集群底层原理!...
Huawei mobile phone one-click to open "maintenance mode" to hide all data and make mobile phone privacy more secure
iNeuOS工业互联网操作系统,设备运维业务和“低代码”表单开发工具
如何才能真正的提高自己,成为一名出色的架构师?
Apache EventMesh distributed event-driven multi-runtime
uni-app中的renderjs使用
BOW/DOM(上)
UVM RAL模型和内置seq
The new telecom "routine", my dad was tricked!
Batch (batch size, full batch, mini batch, online learning), iterations and epochs in deep learning
API for JD.com to obtain historical price information of commodities
老牌音乐播放器 WinAmp 发布 5.9 RC1 版:迁移到 VS 2019 完全重建,兼容 Win11
Thymeleaf是什么?该如何使用。
Jiuqi ny3p series voice chip replaces the domestic solution KT148A, which is more cost-effective and has a length of 420 seconds
matplotlib ax bar color Set the color, transparency, label legend of the ax bar
AI 自动写代码插件 Copilot(副驾驶员)
MySQL---子查询
leetcode 665. Non-decreasing Array
leetcode 665. Non-decreasing Array