当前位置:网站首页>[introduction to information retrieval] Chapter 7 scoring calculation in search system
[introduction to information retrieval] Chapter 7 scoring calculation in search system
2022-07-02 07:22:00 【lwgkzl】
1. executive summary
This chapter mainly solves the following problems :
- For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ?
- except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ? How to synthesize these indicators
- What modules does a complete information retrieval system need to include ?
- Whether the vector space model supports Wildcard Queries ?
2. Quick scoring and sorting
This chapter mainly introduces some heuristic methods , It is used to quickly find the one that is more relevant to a query K A document , The documents found do not completely contain the most relevant topk, But we will return with truth topk The score is close K A document .
a. Index removal optimization :
Consider only documents that contain multiple query items in the query , Or only consider including more than a certain number of words idf Threshold document .
b. Winner list
For each word item in the inverted list , Find the most relevant t A document , Then in the subsequent query , For each word, only t The document with the highest score , That is, for a query , We only need the word items contained in the query corresponding to N* t Select from documents topk that will do , among N The number of terms included for this query .
c. Static score of the document
Use in combination with the winner table , You can use the static score of the document as the first choice of each word t Basis of documents . The static score of the document represents the quality of the document , Such as the website users' comments on the website . As shown in the figure below :
d. Cluster pruning
Use document vectors for clustering , Then choose ( N ) \sqrt(N) (N) Cluster centers , It is expected that each cluster will contain ( N ) \sqrt(N) (N) A document , among N Is the number of documents . Then choose leave query The nearest cluster center as a candidate , Here ( N ) \sqrt(N) (N) Select from documents topk Geli query Recent documents .
3. Composition of information retrieval system
a. Hierarchical index :
As shown in the figure above , Divide into different levels through scores , When searching, search from top to bottom , Until I find K Candidate documents .
b. Lexical proximity :
The closer the word items in the query are in the document , The higher the score of this document . As for how to evaluate the score of the word item approaching , Content requiring machine learning .
c. Calculation of scoring function :
ditto , Comprehensive scoring for a document , Static scores need to be considered , query Similarity with documents , Word proximity and other factors , According to the different emphasis of application , Manual rules can be formulated to comprehensively score documents , You can also regard the above scores as characteristics , Then input these features into the machine learning model for scoring .
d. Composition of information retrieval system :
On the document side : Indexing to imprecise retrieval
On user side : From spelling correction to retrieval , To sort and grade documents through machine learning algorithms .
4. Support of vector space model for various query operations
a. Boolean query
Obviously, the vector space model can support the Boolean query of a single word , But the Boolean query expression of multiple word combinations , It is not easy to accumulate scores by vector space model .
b. Wildcard query
Wildcards can be parsed first , Resolve the query terms that wildcards may represent , Then take all possible terms as candidates query Go to query , Finally, integrate all query Query results of . Therefore, vector space model can solve
c. Phrase query :
Because the vector space model does not consider the relative position of each word in the phrase , Therefore, vector space model is not suitable for phrase query .
5. Summary
- For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ?
answer : Use heuristic algorithms , Like the winner table , The method of hierarchical indexing , Find a relatively good topk A document . - except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ? How to synthesize these indicators .
answer : There are many other indicators , Such as static scoring of documents , Word item nearest neighbor score, etc , Finally, it can be comprehensively scored by means of manual rules or machine learning . - What modules does a complete information retrieval system need to include ?
answer : See the picture 3.d - Whether the vector space model supports Wildcard Queries ?
answer : Support .
6. Mind mapping

边栏推荐
- Oracle apex 21.2 installation and one click deployment
- sparksql数据倾斜那些事儿
- ORACLE EBS接口开发-json格式数据快捷生成
- JSP intelligent community property management system
- RMAN增量恢复示例(1)-不带未备份的归档日志
- ssm+mysql实现进销存系统
- Oracle EBS ADI development steps
- Oracle EBs and apex integrated login and principle analysis
- MapReduce concepts and cases (Shang Silicon Valley Learning Notes)
- User login function: simple but difficult
猜你喜欢

腾讯机试题

SSM student achievement information management system

Classloader and parental delegation mechanism

Sqli-labs customs clearance (less1)

Three principles of architecture design

【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization

Ingress Controller 0.47.0的Yaml文件

view的绘制机制(一)

SSM garbage classification management system

Practice and thinking of offline data warehouse and Bi development
随机推荐
华为机试题-20190417
Sqli labs customs clearance summary-page2
Write a thread pool by hand, and take you to learn the implementation principle of ThreadPoolExecutor thread pool
ORACLE APEX 21.2安装及一键部署
华为机试题
Spark的原理解析
Only the background of famous universities and factories can programmers have a way out? Netizen: two, big factory background is OK
【模型蒸馏】TinyBERT: Distilling BERT for Natural Language Understanding
php中获取汉字拼音大写首字母
Cloud picture says | distributed transaction management DTM: the little helper behind "buy buy buy"
【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
Ingress Controller 0.47.0的Yaml文件
MySQL无order by的排序规则因素
SSM laboratory equipment management
Yaml file of ingress controller 0.47.0
ssm+mysql实现进销存系统
Explanation of suffix of Oracle EBS standard table
Conda 创建,复制,分享虚拟环境
Sqli labs customs clearance summary-page1
Analysis of MapReduce and yarn principles