当前位置:网站首页>Fast detection of short text repetition rate

Fast detection of short text repetition rate

2022-06-10 15:11:00 Alibaba Amoy technology team official website blog

d6e0ab6061bdba65b02703e63c0f2627.gif

Go straight to the theme , This paper describes a method to quickly detect the repetition rate of short text , Applicable scenarios and similar content release , Product release, etc , Reduce poor quality stacked text , such as :“ High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ”

The core difficulty

The biggest difficulty to solve this problem is how to determine the repeated key words and sentences , When you get it , You can calculate the proportion of keyword sentences in the total characters and the number of occurrences , And then calculate the repetition rate , So let's start with this step .

Analyze key words and sentences

Let's take the above example , For ease of understanding , Here, we manually mark the duplicate copy



“ High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ”



Repeated words and sentences are marked by the same background color , We can see the repetition as follows :

  1. “ High pressure washing water gun , Wash the car easily without waiting ” There is 2 Time , This is the most obvious word stack , We hope to finally analyze this result

  2. “ Water gun ” There is 3 Time

  3. “ spray ” There is 3 Time

  4. “ branch ” There is 2 Time

  5. “ gun ” There is 4 Time



Above , We can tell by brain circuits , This copy is not up to standard , Obviously, there is a suspicion that words are piled up , So how can we quickly identify it through the code ?

Through the above manual analysis process, we found that there are several characteristics :

  1. word 、 word 、 Punctuation marks and other repeated occurrences are more likely , And it is not suitable to judge repetition by this kind of character

  2. Long repetitions cover shorter repetitions , Need to avoid double counting the repetition rate , Otherwise, the repetition rate calculation will be increased

  Remove special characters

Based on the above characteristics , The first idea I decided on was to remove special characters , After all, in a real business scenario , People don't write a bunch of punctuation marks , Because this is lower than the stacking of words , This is simpler , It's like bathing a string , A regular command will do it

const demoText = ' High pressure washing water gun , Wash the car easily without waiting , All copper 4 branch 6 The sub high-pressure water gun can adjust the spray gun joint suit to water the garden , High pressure washing water gun , Wash the car easily without waiting ';
const specialTextReg:RegExp = /[\s·!#¥(——):;“”‘、,|《.》?、【】[\]`[email protected]#$%^&*()_+<>?:"{},.\/;']/gim;
const cleanText = demoText.replace(specialTextReg, '');

The output is as follows , We will call this string “ The parent string ”:

“ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”

So it's easier , Facilitate further keyword sentence analysis

   Find the key words and sentences

First, we will split the string into a single word array , Here is the order in which the original characters appear .



[' high ', ' Pressure ', ' wash ', ' vehicle ', ' water ', ' gun ', ' One ', ' spray ', ' light ', ' pine ', ' wash ', ' vehicle ', ' No ', ' etc. ', ' stay ', ' whole ', ' copper ', '4', ' branch ', '6', ' branch ', ' high ', ' Pressure ', ' water ', ' gun ', ' can ', ' transfer ', ' section ', ' spray ', ' gun ', ' Pick up ', ' head ', ' set ', ' loading ', ' Pouring ', ' flowers ', ' irrigation ', ' Irrigation ', ' garden ', ' high ', ' Pressure ', ' wash ', ' vehicle ', ' water ', ' gun ', ' One ', ' spray ', ' light ', ' pine ', ' wash ', ' vehicle ', ' No ', ' etc. ', ' stay ']



The emergence of key words and sentences has a very important feature , Is a continuous occurrence ( It's like bullshit ), So how to analyze continuity , Here we can put consecutive words in separate arrays , So that we can distinguish continuity , So we end up with a two-dimensional array , The generation of two-dimensional arrays follows three basic principles :

  1. Characters that never appear are placed in the first array ( Parent array ), And sort them in the order they appear

  2. Each character is compared with the first character , If yes, add an array , And stored in the corresponding parent array position in the new array

  3. If the next character is repeated , There is no need to add an array , Just add characters to the original array , You need to open a new array in the following two cases :

    Repeated character break ( Is the occurrence of characters that the parent array does not have , In this case, the characters push To the parent array )

    The sequence number of the repeated character in the parent array is less than or equal to the sequence number of the previous repeated character



It's hard to understand here , Help us understand through the diagram , For ease of understanding , I put a continuous analysis process , Forced distribution explanation

  • The characters above , First of all, it is analyzed to 10 Characters

Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

Sequence

0

1

2

3

4

5

6

7

8

9

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”



  • Next, the analysis goes to the 12 Characters (“ Wash the car ”), Because in the parent array , So you need to open a new array to store , The analysis results are as follows :

1



wash

vehicle







Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

Sequence

0

1

2

3

4

5

6

7

8

9

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”



  • Next, the analysis goes to the 20 Characters , Because from 13 Start , Characters that are not in the parent array appear again , So go back to the parent array for the characters push

1



wash

vehicle















Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

No

etc.

stay

whole

copper

4

branch

6

Sequence

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”

  • Next 21 Characters “ branch ” And the... In the parent array 17 Characters repeat , And because of the array 1 Terminal push 了 , So you need to open a new array 2

2

















branch


1



wash

vehicle















Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

No

etc.

stay

whole

copper

4

branch

6

Sequence

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”


  • Next, we analyze to the 23 Characters (“ High pressure ”), Due to the high voltage, the sequence numbers appearing in the parent array are 1,2, The first 4 Step “ branch ” The sequence number appearing in the array is 17, Satisfy <= 17, So you need to open a new array , give the result as follows :

3

high

Pressure

















2

















branch


1



wash

vehicle















Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

No

etc.

stay

whole

copper

4

branch

6

Sequence

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”

  • Next, the analysis goes to the 25 Characters (“ Water gun ”), It also appears in our parent array , The serial numbers that appear are 5,6, dissatisfaction <= 2, So you can continue in array three push, give the result as follows :

3

high

Pressure



water

gun













2

















branch


1



wash

vehicle















Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

No

etc.

stay

whole

copper

4

branch

6

Sequence

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”

  • To the first 6 Step , Basically, the rules are clear , One analogy , The following results can be obtained :

7



wash

vehicle







No

etc.

stay


















6

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine





















5






gun

























4








spray























3

high

Pressure



water

gun

























2

















branch














1



wash

vehicle



























Parent array

high

Pressure

wash

vehicle

water

gun

One

spray

light

pine

No

etc.

stay

whole

copper

4

branch

6

can

transfer

section

Pick up

head

set

loading

Pouring

flowers

irrigation

Irrigation

garden

Sequence

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

From the above table, we can easily determine the occurrence times of words and sentences and the length proportion of each word and sentence in the parent array , For the program , Just parse the array according to two principles 1-7 Repeated words and sentences in :

  1. Consecutive characters in the same array are identified as repeating words

  2. When encountering empty or switching arrays , Continuous interruption

Through the above two steps , The results are as follows :

Key Words

Number of repetitions

Repeat rate

Wash the car

4

14.81%

branch

2

3.70%

High pressure

3

11.11%

Water gun

3

11.11%

spray

3

5.56%

gun

4

7.40%

A spray of high-pressure washing water gun is easy

2

37.03%

Don't wait for

2

11.11%

Here is the repetition rate of keyword sentences = The length of key words and sentences / Length of parent string x Number of occurrences x 100%

Get the repetition rate

Because we finally need to get a total value of the repetition rate , We need to get the final result through the above values , If it's just a simple addition , And what you get is 96.27%, This is obviously inappropriate , But the more words you repeat , The greater the repetition rate, the greater the , From our senses , What we feel is “ A spray of high-pressure washing water gun is easy ” The repetition is unacceptable , So we need to use weighting , Reduce short words 、 word 、 The effect of sentence repetition rate calculation , Here I use the simplest rule , The weight value returned according to the length of repeated words and sentences is as follows :

Character length

The weight

1

0.1

2

0.4

3

0.5

4

0.5

>=5

1

Therefore, the results after adding weights in the above table are as follows :

Key Words

Number of repetitions

Repeat rate

Wash the car

4

5.92%

branch

2

0.37%

High pressure

3

4.44%

Water gun

3

4.44%

spray

3

0.55%

gun

4

0.74%

A spray of high-pressure washing water gun is easy

2

37.03%

Don't wait for

2

5.55%

Repeat rate

59.07%

So we finally get the repetition rate data 59.07%, If the threshold is defined :30%, Then the above string is judged as repeated

Special scenario analysis

  Keyword sentence analysis problem

The above characters , The longest repetition is :“ A spray of high-pressure washing water gun can easily wash the car without waiting ”, But through the above analysis , because “ Wash the car ” It's repeated , So the array is forced to be switched , So the repeated words and sentences are :“ A spray of high-pressure washing water gun is easy ”、“ Don't wait for ” Two , This is not reasonable , So the first one above 7 The following adjustments are required for step :

* “ The high-pressure washing water gun can easily wash the car without waiting for all copper 4 branch 6 Sub high-pressure water gun adjustable spray gun joint set watering irrigation garden high-pressure car washing water gun one spray easy car washing without waiting ”

Therefore, it is not enough to judge the continuity according to whether the array is switched and whether it is empty , Need to add a subscript , The subscript here is the sequence number of the character in the parent string , Whether key words and sentences are combined continuously through subscript

7597def35adf774739eccbe59ba6ebed.png

The repetition rate is as follows :

Key Words

Wash the car

Sub high-pressure water gun

Spray gun

A spray of high-pressure washing water gun can easily wash the car without waiting

Here we find that in addition to the 1、4 Words and sentences ,2、3 Only once , So here is a comparison with the parent string , Filter those that only appear once , The results are as follows :

Key Words

Number of repetitions

Repeat rate

Wash the car

4

5.92%

A spray of high-pressure washing water gun can easily wash the car without waiting

2

51.72%

Repeat rate

57.64%


   Keyword sentences are repeated

But we also found that ,“ Wash the car ” The key word is “ A spray of high-pressure washing water gun can easily wash the car without waiting ” There was 2 Time , Two repetitions are 4 Time , So it was double counted , These four times should be removed here ( In fact, this process can be done in advance , You can think about )

Key Words

Number of repetitions

Repeat rate

Wash the car

0

0%

A spray of high-pressure washing water gun can easily wash the car without waiting

2

51.72%

Repeat rate

51.72%

The final result has been the same as our initial expectation , Also according to 30% Of the threshold value , The string is also repeated

summary

Have to say , There are still many areas that can be optimized in the above implementation , For example, in doing keyword sentence analysis , But the accuracy is tested by browsing the content profile , If you have any ideas, you can leave a message to correct .

team introduction

We are the front-end team of technical content of big Taobao , Mainly responsible for the content business of Taobao ( live broadcast 、 Image & Text 、 Short video ) And the content of the middle stage construction , Involving Taobao live broadcast 、 Stroll around 、 Take photos 、 There are good goods and other businesses , And support the content business of other teams of the group through a platform , Including hungry 、 Box horse 、 youku 、 Idle fish 、 Flying pig, etc 24 individual BU、160 Business scenarios .
Content is a relatively new battlefield , The whole front-end team is working on multimedia 、 machine learning 、 player 、 Video clip 、LowCode And other technical fields have more mining and technical applications , Welcome to leave a message for technical exchange .

*   Expanding reading

ca73dd250a8200cb335cfc7152065e41.png

1cc63501ab14373714229aa3d145504d.png

author | TIFF

edit | Orange King

e319abfd91b897a1a89144bea99886f9.png

原网站

版权声明
本文为[Alibaba Amoy technology team official website blog]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206101333250159.html