当前位置:网站首页>Similarities and differences of text similarity between Jaccard and cosine
Similarities and differences of text similarity between Jaccard and cosine
2022-07-03 23:47:00 【Necther】
In the course of work , Students in other businesses often ask : What is the similarity between two words ? What is the similarity between two sentences ? What is the similarity between two documents ? In this paper , Let's talk about jaccard And cosine Differences in text similarity , And the scenarios they apply . Before introducing the similarities and differences between the two , Let's first introduce ,jaccard Similarity and cosine Definition of similarity .
( Want to see the conclusion directly , Please pay attention to the bold part at the end of the text )
Jaccard Similarity degree
Jaccard The definition of similarity is very simple , The intersection of two sentence words size Divide by the union of two sentence words size. For example :
- The sentence 1: AI is our friend and it has been friendly.
- The sentence 2: AI and humans have always been friendly.
For calculation Jaccard Similarity degree , We first use English nlp Technology commonly used in Lemmatization, Replace words with roots that have the same root . In the example above ,friend and friendly Have the same root ,have and has Have the same root . We can draw the intersection and union of two sentence words , As shown in the figure :
For the above two sentences , Its Jaccard The similarity is 5/(5+3+2)=0.5, That is, the intersection of two sentence words 5 A vocabulary , Combine 10 A vocabulary .
def get_jaccard_sim(str1, str2):
a = set(str1.split())
b = set(str2.split())
c = a.intersection(b)
return float(len(c)) / (len(a) + len(b) - len(c))
It is worth noting that , The sentence 1 It contains two friend, But this does not affect our calculation of similarity , But it will affect cosine Similarity degree . Let's first recall cosine Definition of similarity , The formula is as follows .
cosine Similarity is calculated by calculating the angle between two vectors , To evaluate the similarity of two vectors .
since cosine Similarity is calculated using vectors , We must first convert the sentence text into the corresponding vector . There are many ways to convert sentences into vectors , The simplest one is to use bag of words Calculated TF(term frequency) and TF-IDF(term frenquency-inverse document frequency). Which clock conversion method is better ? actually , The two methods have their own application scenarios . When we want to roughly estimate the text similarity , Use TF That's all right. . When we use text similarity to retrieve similar scenes ( Such as in search engines query relevence The calculation of ), here TF-IDF Better .
Of course , We can also use word2vec Or use custom word vectors to convert sentences into vectors . Here is a brief introduction tf-idf and word embedding Similarities and differences : - 1. tf/tf-idf Calculate a number for each word , and word embedding Express words as vectors - 2. tf/tf-idf Perform better in the task of text classification , and word embedding The method of is more suitable for judging the semantic information of the context ( This may be caused by word embedding The calculation method of ).
For how to calculate cosine similarity, Let's try the above example :
- The sentence 1: AI is our friend and it has been friendly.
- The sentence 2: AI and humans have always been friendly.
Calculation cosine similarity The process of , Quantile the following steps :
First step
Use bag of words Method of calculation term frequency, The following figure shows word frequency The statistics of .
The second step
term frequency The problem is , Words in longer sentences term frequency It's a little bit higher . To solve this problem , We can use the normalization method (Normlization, Such as L2-norm) To get rid of the influence of sentence length . The operation is as follows : First of all, for each word frequency Sum of squares , And then we'll make a prescription . If you use L2-norm, Then the sentence 1 The value of is 3.3166, And sentences 2 The value of is 2.6458. With every word term frquency Divide by these norm Value , We can get the following results :
The third step
In the last step , We normalize the modulus of the sentence vector to 1, It can be calculated by point multiplication cosine Similarity degree : Cosine Similarity = (0.3020.378) + (0.6030.378) + (0.3020.378) + (0.3020.378) + (0.302*0.378) = 0.684
So two sentences cosine The similarity is 0.684, and Jaccard The result of similarity is 0.5. Calculation cosine Similar python The code is as follows :
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_sim(*strs):
vectors = [t for t in get_vectors(*strs)]
return cosine_similarity(vectors)
def get_vectors(*strs):
text = [t for t in strs]
vectorizer = CountVectorizer(text)
vectorizer.fit(text)
return vectorizer.transform(text).toarray()
To sum up ,Jaccard and cosine What is the difference in similarity ? There should be the following :
- Jaccard The collection operation is used , The vector length of a sentence consists of two sentences unique The number of words determines , and cosine The size of the vector used for similarity is determined by the dimension of the word vector .
- What does the above conclusion mean ? hypothesis friend This word is in the sentence 1 It has been repeated many times ,cosine The similarity will change , and Jaccard The value of similarity will not change . Let's do a simple calculation , If sentence 1 Medium friend Word repetition 50 Time ,cosine The similarity will be reduced to 0.4, and Jaccard Maintain similarity 0.5 unchanged .
- Based on the above conclusion ,Jaccard What scenario does similarity apply to ? Suppose the text of a business scenario contains many repetitive words , And whether these repetitions have little to do with the tasks we want to do , When analyzing text similarity , Use Jaccard Just calculate the similarity , Because for Jaccard In terms of similarity , Repetition has no effect ; Suppose this repetition has a great impact on the task we want to do , Then use cosine Similarity degree .
Last , Here are two specific application scenarios , For your consideration . Just right Jaccard and cosine In terms of similarity :
1. The jingdong 、 Tmall's product search bar , What similarity is the best ?
2. The similarity of voice transcribed text , Which is better ?
This article is translated from : Overview of Text Similarity Metrics in Python, A slight change .
边栏推荐
- Qtoolbutton available signal
- Correlation analysis summary
- C # basic knowledge (2)
- QT creator source code learning note 05, how does the menu bar realize plug-in?
- D27:mode of sequence (maximum, translation)
- Cgb2201 preparatory class evening self-study and lecture content
- Report on the construction and development mode and investment mode of sponge cities in China 2022-2028
- C summary of knowledge point definitions, summary notes
- C # basic knowledge (1)
- A preliminary study on the middleware of script Downloader
猜你喜欢
Bufferpool caching mechanism for executing SQL in MySQL
Scratch uses runner Py run or debug crawler
How to understand the gain bandwidth product operational amplifier gain
Unity shader visualizer shader graph
How to make recv have a little temper?
Common mode interference of EMC
2022 t elevator repair registration examination and the latest analysis of T elevator repair
[note] IPC traditional interprocess communication and binder interprocess communication principle
Ningde times and BYD have refuted rumors one after another. Why does someone always want to harm domestic brands?
Loop compensation - explanation and calculation of first-order, second-order and op amp compensation
随机推荐
D27:mode of sequence (maximum, translation)
Introduction to the gtid mode of MySQL master-slave replication
Gossip about redis source code 76
Apple released a supplementary update to MacOS Catalina 10.15.5, which mainly fixes security vulnerabilities
Docking Alipay process [pay in person, QR code Payment]
Selenium library 4.5.0 keyword explanation (II)
2022 free examination questions for hoisting machinery command and hoisting machinery command theory examination
2022 chemical automation control instrument examination content and chemical automation control instrument simulation examination
SQL data update
炒股开户佣金优惠怎么才能获得,网上开户安全吗
Amway by head has this project management tool to improve productivity in a straight line
What are the securities companies with the lowest Commission for stock account opening? Would you recommend it? Is it safe to open an account on your mobile phone
Pyqt5 sensitive word detection tool production, operator's Gospel
Fashion cloud interview questions series - JS high-frequency handwritten code questions
Gossip about redis source code 82
Double efficiency. Six easy-to-use pychar plug-ins are recommended
Introducing Software Testing
How to prevent malicious crawling of information by one-to-one live broadcast source server
C # basic knowledge (2)
Go error collection | talk about the difference between the value type and pointer type of the method receiver