当前位置：网站首页>TFIDF and BM25

TFIDF and BM25

2022-06-25 15:41:00 【xieyan0811】

TFIDF

Just to review tfidf,tf It's word frequency , That is, a word i stay article j Frequency of occurrence in . The denominator is the number of words in the passage , The denominator is a word i Number of occurrences .tf The higher it is, the more important it is , For short text matching , Each word usually appears only once ,tf The size of depends on the denominator , The length of the article .
$tf_{i,j}=\frac{n_{i,j}}{\sum_kn_{k,j}}$
idf It's reverse document frequency , Calculate the frequency of the word in all articles , here , The denominator contains the keyword i The number of articles , The numerator is the number of all articles N. use log Equivalent to the same trend , The value becomes smaller . The more the word appears , The bigger the molecule ,idf The smaller the value. , such as ：“ Of ” Often appear , So it's not a keyword . Dang Ci i stay article j It doesn't appear at all , The denominator is 0, So add... To the denominator 1.
$idf_i=log\frac{N}{df_i+1}$
tf and idf The product of is the word i In the article j Importance of .
$tfidf_{i,j}=tf_{i,j} \times idf_i$
In search , Calculate multiple keywords in the search string And article j The similarity ： Combine the tfidf Add up ：
$\sum_{i} tfidf_{i,j}$
Search before , You need to know the distribution of each word in a given article set .

BM25

BM25 Is based on TF-IDF The improved algorithm ,BM yes Best Match Acronym for best match ,25 It means the 25 Second algorithm iteration .

idf Only minor adjustments have been made in the part ：
$idf_i=log\frac{N-df_i+0.5}{df_i+0.5}$
The denominator part subtracts from all the articles that contain i The article ,0.5 For smoothing .

Next , Right again tf The following adjustments have been made ：
$\frac {(k + 1) \times tf} { k \times (1 - b + b \times \frac{L_d}{L_{avg}}) + tf}$
Here we introduce a super parameter k and b.

First look at the parentheses in the denominator ,Ld Is the length of the article ,Lavg Is the average length of all articles , When the article length is consistent with the average length , The value in brackets is 1, Equivalent to the UN multiplied coefficient ; When the article is shorter than the average length , The value in brackets is less than 1, The smaller the denominator , The greater the result of the above formula , That is to say, the shorter the article , The more important each word is , This is also consistent with intuition . in addition , The influence of length is related to b of ,b The bigger it is , The greater the impact ,b The value is 0-1 Between , When b by 0 when , The effect of length is completely disregarded ,b The general value is 0.75.

k The range used to standardize word frequency , take tf Value compressed to 0~k+1 Between , The function curve is as follows ：
$\frac{(k + 1) \times tf}{k + tf}$

Its horizontal axis is tf, The vertical axis is tfscore, Respectively for k=0,1,2,3,4 drawing . When k=0 when ,tfscore by 1, Do not consider the influence of word frequency , and k The greater the word frequency, the closer it is to the original word frequency . therefore , If the article contains only short text , Or you don't need to pay attention to how many times the word appears , You can set it to k=0.

Sometimes the word i Frequency in search text , The above formula is expanded to ：
$\sum_{t \in q} log[\frac{N-df_i+0.5}{df_i+0.5}] \times \frac{(k_1+1)tf_{td}}{k_1(1-b+b \times \frac{L_d}{L_{avg}})+tf_{td}} \times \frac{(k_2+1)tf_{tq}}{k_2+tf_{tq}}$
among td Refers to the searched text ,tq Refers to search text .

In this way, we can refine the control tf The proportion of , And the length of the article , To adapt to the search and matching tasks in different situations . Pay attention to setting parameters k and b.

Previous BM25 Algorithms are integrated in gensim in , The latest version is gone , If you want to use , It can be extracted from the old version .

原网站

版权声明
本文为[xieyan0811]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/176/202206251500498436.html