当前位置:网站首页>Fast detection of short text repetition rate
Fast detection of short text repetition rate
2022-06-10 15:11:00 【Alibaba Amoy technology team official website blog】

Go straight to the theme , This paper describes a method to quickly detect the repetition rate of short text , Applicable scenarios and similar content release , Product release, etc , Reduce poor quality stacked text , such as :“ High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ”
The core difficulty
The biggest difficulty to solve this problem is how to determine the repeated key words and sentences , When you get it , You can calculate the proportion of keyword sentences in the total characters and the number of occurrences , And then calculate the repetition rate , So let's start with this step .
Analyze key words and sentences
Let's take the above example , For ease of understanding , Here, we manually mark the duplicate copy
“ High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ”
Repeated words and sentences are marked by the same background color , We can see the repetition as follows :
“ High pressure washing water gun , Wash the car easily without waiting ” There is 2 Time , This is the most obvious word stack , We hope to finally analyze this result
“ Water gun ” There is 3 Time
“ spray ” There is 3 Time
“ branch ” There is 2 Time
“ gun ” There is 4 Time
Above , We can tell by brain circuits , This copy is not up to standard , Obviously, there is a suspicion that words are piled up , So how can we quickly identify it through the code ?
Through the above manual analysis process, we found that there are several characteristics :
word 、 word 、 Punctuation marks and other repeated occurrences are more likely , And it is not suitable to judge repetition by this kind of character
Long repetitions cover shorter repetitions , Need to avoid double counting the repetition rate , Otherwise, the repetition rate calculation will be increased
▐ Remove special characters
Based on the above characteristics , The first idea I decided on was to remove special characters , After all, in a real business scenario , People don't write a bunch of punctuation marks , Because this is lower than the stacking of words , This is simpler , It's like bathing a string , A regular command will do it
const demoText = ' High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ';
const specialTextReg:RegExp = /[\s·!#¥(——):;“”‘、,|《.》?、【】[\]`[email protected]#$%^&*()_+<>?:"{},.\/;']/gim;
const cleanText = demoText.replace(specialTextReg, '');The output is as follows , We will call this string “ The parent string ”:
“ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
So it's easier , Facilitate further keyword sentence analysis
▐ Find the key words and sentences
First, we will split the string into a single word array , Here is the order in which the original characters appear .
[' high ', ' Pressure ', ' wash ', ' vehicle ', ' water ', ' gun ', ' One ', ' spray ', ' light ', ' pine ', ' wash ', ' vehicle ', ' No ', ' etc. ', ' stay ', ' whole ', ' copper ', '4', ' branch ', '6', ' branch ', ' high ', ' Pressure ', ' water ', ' gun ', ' can ', ' transfer ', ' section ', ' spray ', ' gun ', ' Pick up ', ' head ', ' set ', ' loading ', ' Pouring ', ' flowers ', ' irrigation ', ' Irrigation ', ' garden ', ' high ', ' Pressure ', ' wash ', ' vehicle ', ' water ', ' gun ', ' One ', ' spray ', ' light ', ' pine ', ' wash ', ' vehicle ', ' No ', ' etc. ', ' stay ']
The emergence of key words and sentences has a very important feature , Is a continuous occurrence ( It's like bullshit ), So how to analyze continuity , Here we can put consecutive words in separate arrays , So that we can distinguish continuity , So we end up with a two-dimensional array , The generation of two-dimensional arrays follows three basic principles :
Characters that never appear are placed in the first array ( Parent array ), And sort them in the order they appear
Each character is compared with the first character , If yes, add an array , And stored in the corresponding parent array position in the new array
If the next character is repeated , There is no need to add an array , Just add characters to the original array , You need to open a new array in the following two cases :
Repeated character break ( Is the occurrence of characters that the parent array does not have , In this case, the characters push To the parent array )
The sequence number of the repeated character in the parent array is less than or equal to the sequence number of the previous repeated character
It's hard to understand here , Help us understand through the diagram , For ease of understanding , I put a continuous analysis process , Forced distribution explanation
The characters above , First of all, it is analyzed to 10 Characters
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine |
Sequence | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Next, the analysis goes to the 12 Characters (“ Wash the car ”), Because in the parent array , So you need to open a new array to store , The analysis results are as follows :
1 | wash | vehicle | ||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine |
Sequence | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Next, the analysis goes to the 20 Characters , Because from 13 Start , Characters that are not in the parent array appear again , So go back to the parent array for the characters push
1 | wash | vehicle | ||||||||||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | No | etc. | stay | whole | copper | 4 | branch | 6 |
Sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Next 21 Characters “ branch ” And the... In the parent array 17 Characters repeat , And because of the array 1 Terminal push 了 , So you need to open a new array 2
2 | branch | |||||||||||||||||
1 | wash | vehicle | ||||||||||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | No | etc. | stay | whole | copper | 4 | branch | 6 |
Sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Next, we analyze to the 23 Characters (“ High pressure ”), Due to the high voltage, the sequence numbers appearing in the parent array are 1,2, The first 4 Step “ branch ” The sequence number appearing in the array is 17, Satisfy <= 17, So you need to open a new array , give the result as follows :
3 | high | Pressure | ||||||||||||||||
2 | branch | |||||||||||||||||
1 | wash | vehicle | ||||||||||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | No | etc. | stay | whole | copper | 4 | branch | 6 |
Sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Next, the analysis goes to the 25 Characters (“ Water gun ”), It also appears in our parent array , The serial numbers that appear are 5,6, dissatisfaction <= 2, So you can continue in array three push, give the result as follows :
3 | high | Pressure | water | gun | ||||||||||||||
2 | branch | |||||||||||||||||
1 | wash | vehicle | ||||||||||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | No | etc. | stay | whole | copper | 4 | branch | 6 |
Sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 |
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
To the first 6 Step , Basically, the rules are clear , One analogy , The following results can be obtained :
7 | wash | vehicle | No | etc. | stay | |||||||||||||||||||||||||
6 | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | ||||||||||||||||||||
5 | gun | |||||||||||||||||||||||||||||
4 | spray | |||||||||||||||||||||||||||||
3 | high | Pressure | water | gun | ||||||||||||||||||||||||||
2 | branch | |||||||||||||||||||||||||||||
1 | wash | vehicle | ||||||||||||||||||||||||||||
Parent array | high | Pressure | wash | vehicle | water | gun | One | spray | light | pine | No | etc. | stay | whole | copper | 4 | branch | 6 | can | transfer | section | Pick up | head | set | loading | Pouring | flowers | irrigation | Irrigation | garden |
Sequence | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
From the above table, we can easily determine the occurrence times of words and sentences and the length proportion of each word and sentence in the parent array , For the program , Just parse the array according to two principles 1-7 Repeated words and sentences in :
Consecutive characters in the same array are identified as repeating words
When encountering empty or switching arrays , Continuous interruption
Through the above two steps , The results are as follows :
Key Words | Number of repetitions | Repeat rate |
Wash the car | 4 | 14.81% |
branch | 2 | 3.70% |
High pressure | 3 | 11.11% |
Water gun | 3 | 11.11% |
spray | 3 | 5.56% |
gun | 4 | 7.40% |
A spray of high-pressure washing water gun is easy | 2 | 37.03% |
Don't wait for | 2 | 11.11% |
Here is the repetition rate of keyword sentences = The length of key words and sentences / Length of parent string x Number of occurrences x 100%
Get the repetition rate
Because we finally need to get a total value of the repetition rate , We need to get the final result through the above values , If it's just a simple addition , And what you get is 96.27%, This is obviously inappropriate , But the more words you repeat , The greater the repetition rate, the greater the , From our senses , What we feel is “ A spray of high-pressure washing water gun is easy ” The repetition is unacceptable , So we need to use weighting , Reduce short words 、 word 、 The effect of sentence repetition rate calculation , Here I use the simplest rule , The weight value returned according to the length of repeated words and sentences is as follows :
Character length | The weight |
1 | 0.1 |
2 | 0.4 |
3 | 0.5 |
4 | 0.5 |
>=5 | 1 |
Therefore, the results after adding weights in the above table are as follows :
Key Words | Number of repetitions | Repeat rate |
Wash the car | 4 | 5.92% |
branch | 2 | 0.37% |
High pressure | 3 | 4.44% |
Water gun | 3 | 4.44% |
spray | 3 | 0.55% |
gun | 4 | 0.74% |
A spray of high-pressure washing water gun is easy | 2 | 37.03% |
Don't wait for | 2 | 5.55% |
Repeat rate | 59.07% | |
So we finally get the repetition rate data 59.07%, If the threshold is defined :30%, Then the above string is judged as repeated
Special scenario analysis
▐ Keyword sentence analysis problem
The above characters , The longest repetition is :“ A spray of high-pressure washing water gun can easily wash the car without waiting ”, But through the above analysis , because “ Wash the car ” It's repeated , So the array is forced to be switched , So the repeated words and sentences are :“ A spray of high-pressure washing water gun is easy ”、“ Don't wait for ” Two , This is not reasonable , So the first one above 7 The following adjustments are required for step :
* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”
Therefore, it is not enough to judge the continuity according to whether the array is switched and whether it is empty , Need to add a subscript , The subscript here is the sequence number of the character in the parent string , Whether key words and sentences are combined continuously through subscript

The repetition rate is as follows :
Key Words |
Wash the car |
Sub high-pressure water gun |
Spray gun |
A spray of high-pressure washing water gun can easily wash the car without waiting |
Here we find that in addition to the 1、4 Words and sentences ,2、3 Only once , So here is a comparison with the parent string , Filter those that only appear once , The results are as follows :
Key Words | Number of repetitions | Repeat rate |
Wash the car | 4 | 5.92% |
A spray of high-pressure washing water gun can easily wash the car without waiting | 2 | 51.72% |
Repeat rate | 57.64% | |
▐ Keyword sentences are repeated
But we also found that ,“ Wash the car ” The key word is “ A spray of high-pressure washing water gun can easily wash the car without waiting ” There was 2 Time , Two repetitions are 4 Time , So it was double counted , These four times should be removed here ( In fact, this process can be done in advance , You can think about )
Key Words | Number of repetitions | Repeat rate |
Wash the car | 0 | 0% |
A spray of high-pressure washing water gun can easily wash the car without waiting | 2 | 51.72% |
Repeat rate | 51.72% | |
The final result has been the same as our initial expectation , Also according to 30% Of the threshold value , The string is also repeated
summary
Have to say , There are still many areas that can be optimized in the above implementation , For example, in doing keyword sentence analysis , But the accuracy is tested by browsing the content profile , If you have any ideas, you can leave a message to correct .
team introduction
We are the front-end team of technical content of big Taobao , Mainly responsible for the content business of Taobao ( live broadcast 、 Image & Text 、 Short video ) And the content of the middle stage construction , Involving Taobao live broadcast 、 Stroll around 、 Take photos 、 There are good goods and other businesses , And support the content business of other teams of the group through a platform , Including hungry 、 Box horse 、 youku 、 Idle fish 、 Flying pig, etc 24 individual BU、160 Business scenarios .
Content is a relatively new battlefield , The whole front-end team is working on multimedia 、 machine learning 、 player 、 Video clip 、LowCode And other technical fields have more mining and technical applications , Welcome to leave a message for technical exchange .
* Expanding reading
author | TIFF
edit | Orange King

边栏推荐
- Golang []byte 转 File
- 如何写一个全局的 Notice 组件?
- Golang []byte to file
- RSA a little bit of thought
- 虚拟机ping不通的几种原因及解决办法
- CG collision testing
- Kubernetes 1.24:StatefulSet引进了maxUnavailable副本数
- 反“内卷”,消息称 360 企业安全云将上线“一键强制下班”功能,电脑自动关闭办公软件
- 欧几里得算法求最大公因数 Go语言实现
- [registration] to solve the core concerns of technology entrepreneurs, the online enrollment of "nebula plan open class" was opened
猜你喜欢

This awesome low code generator is now open source!

2022第十五届南京国际工业自动化展览会

Applet network request promise

洞見科技入選「愛分析· 隱私計算廠商全景報告」,獲評金融解决方案代錶廠商

CRM对企业以及销售员有哪些帮助?

4、再遇Panuon.UI.Silver之窗体标题栏

At the early stage of product development, do you choose to develop apps or applets?

Hutool使用总结(VIP典藏版)

一文带你了解J.U.C的FutureTask、Fork/Join框架和BlockingQueue

4. Meet panuon again UI. Title bar of silver form
随机推荐
Comment construire un plan de produit axé sur le client: conseils du CTO
golang使用反射将一个结构体的数据直接复制到另一个结构体中(通过相同字段)
CVPR 2022 oral | SCI: fast, flexible and robust low light image enhancement
Interview question details
Development of stm8s103f single chip microcomputer (1) lighting of LED lamp
Hutool使用总结(VIP典藏版)
【Rust日报】2022-04-19 Rust异步框架的性能评估
Insight Technology a été sélectionné dans le rapport panorama des fournisseurs d'analyse de l'amour et d'informatique de la vie privée et a été évalué comme représentant des fournisseurs de solutions
100003字,带你解密 双11、618电商大促场景下的系统架构体系
反“内卷”,消息称 360 企业安全云将上线“一键强制下班”功能,电脑自动关闭办公软件
4、再遇Panuon.UI.Silver之窗体标题栏
Remote monitoring and data acquisition solution
三子棋(c语言实现)
华为云SRE确定性运维介绍
JS中的call()方法和apply()方法用法总结
虚拟机ping不通的几种原因及解决办法
Collision detection unity experiment code
QT 基于QScrollArea的界面嵌套移动
远程监控及数据采集解决方案
Golang []byte 转 File

