当前位置:网站首页>How to intercept the string correctly (for example, intercepting the stock in operation by applying the error information)
How to intercept the string correctly (for example, intercepting the stock in operation by applying the error information)
2022-07-06 09:03:00 【Bo · Meng】
List of articles
1. Preface
When participating in work , I found that sometimes I put some Error information drop operation , But the length of the error message is impossible to predict its size in advance , therefore , By default, it will be set in the storage error information field The default value is , But also because of accidental things , Lead to Error message length Greater than The length defined by this field , Thus, the warehousing operation fails .
In order to solve the above problems , It is necessary to string the error message substring Intercept operation .
Usually intercept strings , The way is as follows :
String str = " error message "; String errMsg = str.substring(0, " Specific field size value "); xxxDao.insert(errMsg);
But will it happen that the intercepted string is garbled Well ? as follows :
When an error message String This expression is stored :public class demo { public static void main(String[] args) { String str = "\uD83D\uDe06"; System.out.println(str); } }
It is now required that as long as the error message is retained, the length is only 1:public class demo { public static void main(String[] args) { String str = "\uD83D\uDe06"; System.out.println("str Output :" + str); System.out.println("str Original length :" + str.length()); System.out.println("str The intercept length is 1 When : " + str.substring(0, 1)); } }
You can see from the picture above , The intercepted string is no longer understandable .The problem is summarized in one sentence :
How to intercept the string correctly ?
2. analysis
1. The difference between bytes and characters
- byte : It is a kind of computer information technology used to measure storage capacity Unit of measurement , Usually : One byte equals eight bits . Generally speaking : Bytes are units
- character : It refers to the letters used in computers 、 Numbers 、 Words and symbols . Such as :A,1, Xiao Ming et al , These can be called characters .
- The commonly used character set is gbk、utf8
- gbk in : An English character consists of a byte , A Chinese character consists of two bytes .
- uft8 in : An English character consists of a byte , A Chinese character consists of three bytes .
2. java Medium char
- stay java in ,String character string Actually it is from One character by one , And in the java in char Represents a character , It's using UTF-16 To represent characters , Define characters as fixed width 16 Bit entity , That is to say So-called 2 Bytes (8 Position as 1 byte )
- This means that java in char Type variables can store a Chinese character .
- char This type is 16 Bit . It can have 65536 Species value , namely 65536 Number , Each number can represent 1 Characters .
But in fact, these numbers are far from enough , For example, expression , Special rare words and so on , Therefore, more numbers are needed to represent other characters , The original one char Can represent a character , Now use it Permutation and combination , With two char Represents another character , In this way, there can be many, many situations , There are many numbers .
3. Surrogate (Surrogate)
- Surrogate (Surrogate), It's a kind of UTF-16 Used to indicate Supplementary characters Methods . stay UTF-16 in , by Supplementary characters Distribute Two 16 Bit Unicode Code unit :
- The first code unit , go by the name of High surrogate code unit or Leading code unit
- The second unit code , go by the name of Low surrogate code unit or Trailing code units .
- stay UTF-16 Use it 2048 As a proxy ,
- The number is U+D800 to U+DBFF The regulation of is 「High Surrogates」, common 1024 individual .
- The number is U+DC00 to U+DFFF The regulation of is 「Low Surrogates」, It's also 1024 individual
- They appear in pairs , You can say more 1048576 Characters .
- If you lose a high-level proxy Surrogates Or low proxy Surrogates, There will be chaos .
- Special characters are composed of High level agent + Low proxy constitute .
4. Intercepting string
Let there be a string String str = ” expression :“;
Current demand : The maximum interception length is 1、2、3、4 String of , Random code is not allowedpublic class demo { public static void main(String[] args) { // 1. str String length is 5, It takes up two lengths (char) String str = " expression :\uD83D\uDe06"; System.out.println("str Output :" + str); System.out.println("str Original length :" + str.length()); System.out.println("str The intercept length is 1 When : " + str.substring(0, 1)); System.out.println("str The intercept length is 2 When : " + str.substring(0, 2)); System.out.println("str The intercept length is 3 When : " + str.substring(0, 3)); // 2. The output here is garbled , Because as I said before : // If you lose a high-level proxy Surrogates Or low proxy Surrogates, There will be chaos System.out.println("str The intercept length is 4 When : " + str.substring(0, 4)); } }
So when intercepting strings , You can't take it for granted to intercept according to the length , The problem of high and low proxies should be considered :
public class demo { public static void main(String[] args) { String str = " expression :\uD83D\uDe06"; System.out.println("str Output :" + str); // Let the length to be intercepted be 4 int length = 4; // Find the fourth character // length - 1 Because the subscript is from 0 At the beginning char c = str.charAt(length - 1); // Determine whether it is a high-level agent if (Character.isHighSurrogate(c)) { // If it is a high-level proxy , The explanation will put two char Half of the characters intercepted appear to be garbled // Therefore, remove the high-level proxy // here length - 1 Because substring(0,3) The left is not the right // therefore Only 0 1 2 Subscript char character There are only three System.out.println(str.substring(0, length - 1)); } else { // If it is not a high-level proxy , It may be a low proxy // But the interceptions all reach the low proxy , The description has already included the high-level proxy , It's not messy // There may also be a char The characters of , It's definitely not messy System.out.println(str.substring(0, length)); } } }
5. take string Insert into database
1. mysql And oracle in varchar The difference between
- The codes stored in both databases are utf8:
- stay mysql in varchar(5):5 representative 5 Characters , Sure Storage 5 Characters , Five English 、 Or five Chinese 、 As long as the length of the stored string is 5 Can be inserted .
- stay oracle in varchar(5):5 representative 5 Bytes , Yes bytes , Yes bytes . Of course, it can also store five English , But it can only store 1 Chinese , Because a Chinese account for 3 Bytes , There are also two byte positions , That is, store at most 2 English .
- The difference between byte and character , As explained above , Simply put, characters > byte
- mysql Ignore how many bytes this character takes , How many characters can it store
2. Demonstrate both varchar difference
- stay mysql in , The preparation table and code are as follows :
- stay oracle in , The table and code remain unchanged :
Execute the first sentence :userDao.insert_data(“ Xiao Ming eats ice cream ”), The error is as follows :
Carry out the second sentence , The third sentence can be inserted :
Execute the last sentence :userDao.insert_data(“ Xiao Ming to eat AB”);
3. Store the error information (mysql、oracle)
- stay mysql Zhongzhili How many characters can be saved , Just judge whether the interception will intercept garbled code .
- stay oracle Because it is reasonable How many bytes can be stored , First of all Character to code conversion , then Then judge whether the interception will intercept garbled code .
- Some people don't understand why this distinction should be mentioned :
Originally str = ” Xiao Ming's expression " It's a length of 5, But turn utf8 It is 15 了
4. demonstration ( Insert the string into the database )
mysql in , the reason being that varchar(5) Represents that you can store 5 Characters :
@SpringBootTest class CsdnApplicationTests { @Autowired UserDao userDao; @Test public void mysql_test() { // The length is 5 Of str userDao.insert_data(handleString(" expression :\uD83D\uDE06", 5)); // The length is 6 Of str userDao.insert_data(handleString(" His expression :\uD83D\uDE06", 5)); // Long, long str userDao.insert_data(handleString(" After Xiao Ming eats ice cream , His expression is :\uD83D\uDE06", 5)); } private static String handleString(String str, int length) { if (str.length() > length) { // Find the last character char c = str.charAt(length - 1); // Determine whether it is a high proxy if (Character.isHighSurrogate(c)) { // In case of high proxy , Explain that two are intercepted at present char Half of the characters , There's garbled code // therefore Cut one more str = str.substring(0, length - 1); } else { // It's normal to get here char Or low proxy // Low agent Then it means that the high proxy item has been included // Therefore, there will be no garbled code , No special treatment str = str.substring(0, length); } } return str; } }
oracle in , the reason being that varchar(5) Only one Chinese can be stored :
@SpringBootTest class CsdnApplicationTests { @Autowired UserDao userDao; @Test public void oracle_test() { // The length is 5 Of str userDao.insert_data(handleString(" expression :\uD83D\uDE06",5)); // The length is 6 Of str userDao.insert_data(handleString(" His expression :\uD83D\uDE06",5)); // Long, long str userDao.insert_data(handleString(" After Xiao Ming eats ice cream , His expression is :\uD83D\uDE06",5)); } private static String handleString(String str, int length) { // If itself str Just like length A lot longer , First intercept , Optimize next utf8 The efficiency of if (str.length() > length) { // Find the last character char c = str.charAt(length - 1); // Determine whether it is a high proxy if (Character.isHighSurrogate(c)) { // In case of high proxy , Explain that two are intercepted at present char Half of the characters , There's garbled code // therefore Cut one more str = str.substring(0, length - 1); } else { // It's normal to get here char Or low proxy // Low agent Then it means that the high proxy item has been included // Therefore, there will be no garbled code , No special treatment str = str.substring(0, length); } } // in the light of oracle Bytes of , Priority to character encoding format // If after escape Is still greater than the specified length, You still need to continue intercepting , Cut off the last one while (StandardCharsets.UTF_8.encode(str).limit() > length) { // It has been intercepted from above str Find the last one char c = str.charAt(str.length() - 1); // Determine whether it is a low proxy , if (Character.isLowerCase(c)) { // Because turn utf8 I found it was still long , We need to cut off another one // And if this one is a low proxy , You must also remove the previous high proxy , Otherwise, there will be confusion // So I intercepted one more str = str.substring(0, str.length() - 2); } else { // If it is normal or high proxy , Just cut off one str = str.substring(0, str.length() - 1); } } return str; } }
Perfect insertion , There is no wrong report .
3. summary
- For special strings , Don't use it for granted str.subString(0, length), Because we still need to consider two char Composed of characters , Will it intercept to the middle , Cause the problem of garbled code .
- Understand what high-level proxies are 、 Low proxy .
- For the database mysql、oracle Of varchar The difference between , When intercepting strings ,oracle You also need to add the comparison length after transcoding .
边栏推荐
- Selenium+pytest automated test framework practice
- 【shell脚本】使用菜单命令构建在集群内创建文件夹的脚本
- Leetcode: Sword finger offer 48 The longest substring without repeated characters
- LeetCode:162. 寻找峰值
- pytorch查看张量占用内存大小
- Alibaba cloud server mining virus solution (practiced)
- Li Kou daily question 1 (2)
- Tdengine biweekly selection of community issues | phase III
- Notes 01
- An article takes you to understand the working principle of selenium in detail
猜你喜欢
LeetCode:498. 对角线遍历
IJCAI2022论文合集(持续更新中)
A convolution substitution of attention mechanism
Promise 在uniapp的简单使用
Advanced Computer Network Review(3)——BBR
The ECU of 21 Audi q5l 45tfsi brushes is upgraded to master special adjustment, and the horsepower is safely and stably increased to 305 horsepower
Alibaba cloud server mining virus solution (practiced)
TP-LINK 企业路由器 PPTP 配置
Computer graduation design PHP Zhiduo online learning platform
一篇文章带你了解-selenium工作原理详解
随机推荐
LeetCode:214. 最短回文串
Mise en œuvre de la quantification post - formation du bminf
UML圖記憶技巧
[OC]-<UI入门>--常用控件的学习
Implement window blocking on QWidget
Pytest's collection use case rules and running specified use cases
LeetCode:41. Missing first positive number
LeetCode41——First Missing Positive——hashing in place & swap
LeetCode:498. 对角线遍历
Navicat Premium 创建MySql 创建存储过程
LeetCode:162. Looking for peak
多元聚类分析
[OC-Foundation框架]--<Copy对象复制>
Leetcode刷题题解2.1.1
Leetcode: Sword Finger offer 42. Somme maximale des sous - tableaux consécutifs
Notes 01
TDengine 社区问题双周精选 | 第三期
CSP first week of question brushing
IJCAI2022论文合集(持续更新中)
Alibaba cloud server mining virus solution (practiced)