当前位置:网站首页>Count characters in UTF-8 string function
Count characters in UTF-8 string function
2022-07-31 19:31:00 【marshmallow superman】
前言
in actual work development,需要对UTF-8Characters in a string are counted,But characters may occupy multiple bytes ,Therefore, at this time, counting the number of bytes does not meet the actual functional requirements
The main knowledge points involved are:位运算、补码、Unicode字符集、UTF-8编码规则
Unicode与UTF-8的区别?
ASCII码
在计算机中每八个二进制位组成了一个字节(Byte),一个二进制有两种状态:”0” 状态 和 “1”状态,八位的字节一共可以组合出256(2的八次方)种不同的状态,Computers were initially only used in the United States,早期人们用 8 位二进制来编码英文字母(最前面的一位是 0),English letters and some commonly used characters are represented by consecutive byte states,一直编到127You can use different bytes to store English text,Hence the name of the programANSI的ACII编码
ANSI的ACIIEncoding is limited to characters representing English only,Special characters from other countries and regions cannot be encoded,So each country decided to use the first unused bit in the byte,一直把序号编到了最后一个状态255Used to represent new letters、符号,所以从128~255This character set is called "扩展字符集"
But since each country is for"扩展字符集"赋予了不同的含义,That is, there may be different encodings between countries,This can easily lead to the formation of garbled characters
Unicode
UnicodeJust to unify all the characters in the world,These are assigned a unique numerical number,这个编号范围从 0x000000 到 0x10FFFF (十六进制),有 110 多万,每个字符都有一个唯一的 Unicode 编号,这个编号一般写成 16 进制,在前面加上 U+.例如:“马”的 Unicode 是U+9A6C
UnicodeOnly the symbol encoding range is specified ,The storage method for symbols is not specified,Therefore, how symbols are stored still requires corresponding encoding rules
UTF-8
UTF(UCS Transfer Format)Standards are meant to be resolvedunicode如何在网络上传输的问题,UTF-8就是每次8个位传输数据,Although every time is used8位进行传输,But it is actually variable length bytes,That is, the bytes used are variable,This variable depends on the character inUnicode的编码大小,Smaller numbers use smaller bytes,The number of bytes used increases accordingly
UTF-8编码规则
1、对于单字节的符号,字节的第一位为0,The next seven areUnicode码,因为对于英文字母,Unicode码和ASCIIcode is the same
2、对于n字节的符号(n>1),第一个字节的前 n 位都设为 1,第 n+1 位设为 0,后面字节的前两位一律设为 10,剩下的没有提及的二进制位,全部为这个符号的 Unicode 码
“马”的 Unicode 编号是:0x9A6C,整数编号是 39532,对应第三个范围(2048 - 65535),其格式为:1110XXXX 10XXXXXX 10XXXXXX,39532 对应的二进制是 1001 1010 0110 1100,将二进制填入进入就为: 11101001 10101001 10101100
源码
int strlenUtf8(const char *s)
{
int i = 0, j = 0;
while (s[i]) {
if ((s[i] & 0xc0) != 0x80){
j++;
}
i++;
}
return j;
}
代码分析
This function is mainly forUTF-8Character statistics for strings,So be clear firstUTF-8coding rules and Unicode字符集的区别,其次是&位运算

位运算
(s[i] & 0xc0) != 0x80
0xC0=0b11000000
0x80=0b10000000
& 代表 按位逻辑与
to satisfy this expressions[i]的前两位为11,即11101001 10101001 10101100 只有11101001satisfy the expression
在计算机内存中,Integers are stored in two's complement form,So Chinese characters"马"The corresponding three bytes are respectively-23、-87、-84
总结
对UTF-8The main knowledge point for statistics of characters in strings is bit operations、补码、Unicode字符集、UTF-8编码规则,Hope to learn together by sharing the code,to progress,I also hope that the big guys can point out where there are mistakes,peace&love
边栏推荐
- Batch (batch size, full batch, mini batch, online learning), iterations and epochs in deep learning
- Bika LIMS open source LIMS set - use of SENAITE (detection process)
- 性能优化:记一次树的搜索接口优化思路
- Shell 脚本 快速入门到实战 -02
- 中文编码的设置与action方法的返回值
- leetcode: 6135. The longest ring in the graph [inward base ring tree + longest ring board + timestamp]
- Basic configuration of OSPFv3
- MySQL---运算符
- 架构师04-应用服务间加密设计和实践
- Architecture Battalion Module 8 Homework
猜你喜欢
Redis Overview: Talk to the interviewer all night long about Redis caching, persistence, elimination mechanism, sentinel, and the underlying principles of clusters!...
OSPFv3的基本配置
idea中搜索具体的字符内容的快捷方式
全网一触即发,自媒体人的内容分发全能助手——融媒宝
高通cDSP简单编程例子(实现查询高通cDSP使用率、签名),RK3588 npu使用率查询
Architecture Battalion Module 8 Homework
1161. 最大层内元素和 : 层序遍历运用题
ResNet的基础:残差块的原理
How programmers learn open source projects, this article tells you
手把手教你学会部署Nestjs项目
随机推荐
ECCV 2022 华科&ETH提出首个用于伪装实例分割的一阶段Transformer的框架OSFormer!代码已开源!...
UVM RAL模型和内置seq
基于WPF重复造轮子,写一款数据库文档管理工具(一)
财务盈利、偿债能力指标
MySQL---基本的select语句
leetcode 665. Non-decreasing Array 非递减数列(中等)
【AcWing】The 62nd Weekly Match 【2022.07.30】
MySQL---运算符
Bika LIMS 开源LIMS集—— SENAITE的使用(检测流程)
Unity 之 音频类型和编码格式介绍
Arduino框架下STM32全系列开发固件安装指南
几款永久免费内网穿透,好用且简单(内网穿透教程)
Tkinter 入门之旅
【愚公系列】2022年07月 Go教学课程 023-Go容器之列表
spark报错OutOfMemory「建议收藏」
Basics of ResNet: Principles of Residual Blocks
Thymeleaf是什么?该如何使用。
C# 之 扑克游戏 -- 21点规则介绍和代码实现
pytorch lstm时间序列预测问题踩坑「建议收藏」
Three.js入门