当前位置:网站首页>Natural language processing - BM25 commonly used in search
Natural language processing - BM25 commonly used in search
2020-11-06 01:22:00 【Elementary school students in IT field】
BM25 Algorithm is a common formula used to score correlation degree , The idea is simple , The main thing is to calculate a query The relevance of all the words and documents in it , Then I'm adding up the scores , And the relevance score of each word is mainly affected by tf/idf Influence .
About Bim
BIM( Binary hypothesis model ) For word features , Just consider whether the word is in doc There has been , Not considering the relevant characteristics of the word itself ,BM25 stay BIM On the basis of the introduction of words in the query weight , Words in doc Weights in , And some empirical parameters , therefore BM25 In practical application, the effect is much better than BIM Model .
Concrete bm25
bm25 Algorithms are commonly used to calculate query Similar to the relevance of the article . In fact, the principle of this algorithm is very simple , Is what will need to be calculated query Divide words into w1,w2,…,wn, Then find out the relevance of each word and article , Finally, the correlation degree is accumulated , Finally, we can get the result of text similarity calculation .
First Wi It means the first one i The weight of a word , We usually use TF-IDF Algorithm to calculate the weight of words, the second term of the formula R(qi,d) That means we query query Every word and article in d The relevance of , This one involves complex operations , Let's take a slow look at . Generally speaking Wi We usually use the inverse text frequency IDF Calculation formula :
In this formula ,N Represents the total number of documents ,n(qi) Indicates the number of articles containing the word , To avoid that the denominator in the logarithm is equal to 0, We add... To the denominator 0.5, This 0.5 It's called the adjustment coefficient , So when n(qi) When I was younger IDF The more it's worth , The greater the weight of the signifier .
Here's a chestnut :“bm25” This word only appears in a few articles ,n(qi) It will be very small , that “bm25” Of IDF It's worth a lot ;“ We ”,“ yes ”,“ Of ” Such a word , Basically in every article there will be , that n(qi) It's very close to N, therefore IDF Value is very close to 0,
Then let's look at the second term in the formula R(qi,d), Now let's look at the formula for the second term :
In this formula , Generally speaking ,k1、k2 and b It's all regulatory factors ,k1=1、k2=1、b = 0.75,qfi Express qi In the query query Frequency of occurrence in ,fi Express qi In the document d Frequency of occurrence in , Because in general ,qi In the query query Only once will , So the qfi=1 and k2=1 Put it in the formula above , The latter one is equal to 1, In the end, you can get :
Let's see K, In fact, here K The value of is also an abbreviation of the formula , We put K Expand to see :
stay K In the expansion of dl Represents the length of the document ,avg(dl) Represents the average length of the document ,b Is the regulatory factor mentioned earlier , It can be seen from the formula that when the length of the article is fixed than the average length of the article , Regulatory factors b The bigger it is , The greater the weight of the article's length , On the contrary, the smaller . In regulating factors b Fixed time , When the length of the article is larger than the average length of the article , be K The bigger it is ,R(qi,d) The smaller . We put K The expansion of to bm25 Go to... In the formula :
That's all bm25 The flow of the algorithm .
Here is the implementation process :
版权声明
本文为[Elementary school students in IT field]所创,转载请带上原文链接,感谢
边栏推荐
- 嘗試從零開始構建我的商城 (二) :使用JWT保護我們的資訊保安,完善Swagger配置
- Thoughts on interview of Ali CCO project team
- 6.2 handleradapter adapter processor (in-depth analysis of SSM and project practice)
- 华为云“四个可靠”的方法论
- Leetcode's ransom letter
- Analysis of react high order components
- CCR炒币机器人:“比特币”数字货币的大佬,你不得不了解的知识
- Polkadot series (2) -- detailed explanation of mixed consensus
- PHPSHE 短信插件说明
- 加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
猜你喜欢
hadoop 命令总结
Do not understand UML class diagram? Take a look at this edition of rural love class diagram, a learn!
Troubleshooting and summary of JVM Metaspace memory overflow
比特币一度突破14000美元,即将面临美国大选考验
EOS创始人BM: UE,UBI,URI有什么区别?
How to encapsulate distributed locks more elegantly
Not long after graduation, he earned 20000 yuan from private work!
vue-codemirror基本用法:实现搜索功能、代码折叠功能、获取编辑器值及时验证
git rebase的時候捅婁子了,怎麼辦?線上等……
如何玩转sortablejs-vuedraggable实现表单嵌套拖拽功能
随机推荐
一篇文章带你了解CSS 分页实例
PHPSHE 短信插件说明
基於MVC的RESTFul風格API實戰
教你轻松搞懂vue-codemirror的基本用法:主要实现代码编辑、验证提示、代码格式化
IPFS/Filecoin合法性:保护个人隐私不被泄露
Python3 e-learning case 4: writing web proxy
小程序入门到精通(二):了解小程序开发4个重要文件
Polkadot series (2) -- detailed explanation of mixed consensus
The choice of enterprise database is usually decided by the system architect - the newstack
html
How to get started with new HTML5 (2)
做外包真的很难,身为外包的我也无奈叹息。
Architecture article collection
Jmeter——ForEach Controller&Loop Controller
采购供应商系统是什么?采购供应商管理平台解决方案
“颜值经济”的野望:华熙生物净利率六连降,收购案遭上交所问询
Examples of unconventional aggregation
一篇文章带你了解CSS 渐变知识
The practice of the architecture of Internet public opinion system
6.6.1 localeresolver internationalization parser (1) (in-depth analysis of SSM and project practice)