当前位置：网站首页>Similarities and differences of text similarity between Jaccard and cosine

Similarities and differences of text similarity between Jaccard and cosine

2022-07-03 23:47:00 【Necther】

In the course of work , Students in other businesses often ask ： What is the similarity between two words ？ What is the similarity between two sentences ？ What is the similarity between two documents ？ In this paper , Let's talk about jaccard And cosine Differences in text similarity , And the scenarios they apply . Before introducing the similarities and differences between the two , Let's first introduce ,jaccard Similarity and cosine Definition of similarity .

（ Want to see the conclusion directly , Please pay attention to the bold part at the end of the text ）

Jaccard Similarity degree

Jaccard The definition of similarity is very simple , The intersection of two sentence words size Divide by the union of two sentence words size. For example ：

The sentence 1： AI is our friend and it has been friendly.
The sentence 2： AI and humans have always been friendly.

For calculation Jaccard Similarity degree , We first use English nlp Technology commonly used in Lemmatization, Replace words with roots that have the same root . In the example above ,friend and friendly Have the same root ,have and has Have the same root . We can draw the intersection and union of two sentence words , As shown in the figure ：

For the above two sentences , Its Jaccard The similarity is 5/(5+3+2)=0.5, That is, the intersection of two sentence words 5 A vocabulary , Combine 10 A vocabulary .

def get_jaccard_sim(str1, str2): 
    a = set(str1.split()) 
    b = set(str2.split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

It is worth noting that , The sentence 1 It contains two friend, But this does not affect our calculation of similarity , But it will affect cosine Similarity degree . Let's first recall cosine Definition of similarity , The formula is as follows .

cosine Similarity is calculated by calculating the angle between two vectors , To evaluate the similarity of two vectors .

since cosine Similarity is calculated using vectors , We must first convert the sentence text into the corresponding vector . There are many ways to convert sentences into vectors , The simplest one is to use bag of words Calculated TF(term frequency) and TF-IDF（term frenquency-inverse document frequency）. Which clock conversion method is better ？ actually , The two methods have their own application scenarios . When we want to roughly estimate the text similarity , Use TF That's all right. . When we use text similarity to retrieve similar scenes （ Such as in search engines query relevence The calculation of ）, here TF-IDF Better .

Of course , We can also use word2vec Or use custom word vectors to convert sentences into vectors . Here is a brief introduction tf-idf and word embedding Similarities and differences ： - 1. tf/tf-idf Calculate a number for each word , and word embedding Express words as vectors - 2. tf/tf-idf Perform better in the task of text classification , and word embedding The method of is more suitable for judging the semantic information of the context （ This may be caused by word embedding The calculation method of ）.

For how to calculate cosine similarity, Let's try the above example ：

The sentence 1： AI is our friend and it has been friendly.
The sentence 2： AI and humans have always been friendly.

Calculation cosine similarity The process of , Quantile the following steps ：

First step

Use bag of words Method of calculation term frequency, The following figure shows word frequency The statistics of .

The second step

term frequency The problem is , Words in longer sentences term frequency It's a little bit higher . To solve this problem , We can use the normalization method （Normlization, Such as L2-norm） To get rid of the influence of sentence length . The operation is as follows ： First of all, for each word frequency Sum of squares , And then we'll make a prescription . If you use L2-norm, Then the sentence 1 The value of is 3.3166, And sentences 2 The value of is 2.6458. With every word term frquency Divide by these norm Value , We can get the following results ：

The third step

In the last step , We normalize the modulus of the sentence vector to 1, It can be calculated by point multiplication cosine Similarity degree ： Cosine Similarity = (0.3020.378) + (0.6030.378) + (0.3020.378) + (0.3020.378) + (0.302*0.378) = 0.684

So two sentences cosine The similarity is 0.684, and Jaccard The result of similarity is 0.5. Calculation cosine Similar python The code is as follows ：

from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_sim(*strs): 
    vectors = [t for t in get_vectors(*strs)]
    return cosine_similarity(vectors)

def get_vectors(*strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

To sum up ,Jaccard and cosine What is the difference in similarity ？ There should be the following ：

- Jaccard The collection operation is used , The vector length of a sentence consists of two sentences unique The number of words determines , and cosine The size of the vector used for similarity is determined by the dimension of the word vector .

- What does the above conclusion mean ？ hypothesis friend This word is in the sentence 1 It has been repeated many times ,cosine The similarity will change , and Jaccard The value of similarity will not change . Let's do a simple calculation , If sentence 1 Medium friend Word repetition 50 Time ,cosine The similarity will be reduced to 0.4, and Jaccard Maintain similarity 0.5 unchanged .

- Based on the above conclusion ,Jaccard What scenario does similarity apply to ？ Suppose the text of a business scenario contains many repetitive words , And whether these repetitions have little to do with the tasks we want to do , When analyzing text similarity , Use Jaccard Just calculate the similarity , Because for Jaccard In terms of similarity , Repetition has no effect ; Suppose this repetition has a great impact on the task we want to do , Then use cosine Similarity degree .

Last , Here are two specific application scenarios , For your consideration . Just right Jaccard and cosine In terms of similarity ：

1. The jingdong 、 Tmall's product search bar , What similarity is the best ？

2. The similarity of voice transcribed text , Which is better ？

This article is translated from ： Overview of Text Similarity Metrics in Python, A slight change .

原网站

版权声明
本文为[Necther]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202142043172994.html