当前位置：网站首页>[introduction to information retrieval] Chapter 7 scoring calculation in search system

[introduction to information retrieval] Chapter 7 scoring calculation in search system

2022-07-02 07:22:00 【lwgkzl】

1. executive summary

This chapter mainly solves the following problems ：

For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ？
except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ？ How to synthesize these indicators
What modules does a complete information retrieval system need to include ？
Whether the vector space model supports Wildcard Queries ？

2. Quick scoring and sorting

This chapter mainly introduces some heuristic methods , It is used to quickly find the one that is more relevant to a query K A document , The documents found do not completely contain the most relevant topk, But we will return with truth topk The score is close K A document .

a. Index removal optimization ：
Consider only documents that contain multiple query items in the query , Or only consider including more than a certain number of words idf Threshold document .

b. Winner list
For each word item in the inverted list , Find the most relevant t A document , Then in the subsequent query , For each word, only t The document with the highest score , That is, for a query , We only need the word items contained in the query corresponding to N* t Select from documents topk that will do , among N The number of terms included for this query .

c. Static score of the document
Use in combination with the winner table , You can use the static score of the document as the first choice of each word t Basis of documents . The static score of the document represents the quality of the document , Such as the website users' comments on the website . As shown in the figure below ：
Insert picture description here
d. Cluster pruning
Use document vectors for clustering , Then choose $\sqrt(N)$ Cluster centers , It is expected that each cluster will contain $\sqrt(N)$ A document , among N Is the number of documents . Then choose leave query The nearest cluster center as a candidate , Here $\sqrt(N)$ Select from documents topk Geli query Recent documents .

3. Composition of information retrieval system

a. Hierarchical index ：
Insert picture description here
As shown in the figure above , Divide into different levels through scores , When searching, search from top to bottom , Until I find K Candidate documents .

b. Lexical proximity ：
The closer the word items in the query are in the document , The higher the score of this document . As for how to evaluate the score of the word item approaching , Content requiring machine learning .

c. Calculation of scoring function :
ditto , Comprehensive scoring for a document , Static scores need to be considered , query Similarity with documents , Word proximity and other factors , According to the different emphasis of application , Manual rules can be formulated to comprehensively score documents , You can also regard the above scores as characteristics , Then input these features into the machine learning model for scoring .

d. Composition of information retrieval system ：
Insert picture description here
On the document side ： Indexing to imprecise retrieval
On user side ： From spelling correction to retrieval , To sort and grade documents through machine learning algorithms .

4. Support of vector space model for various query operations

a. Boolean query
Obviously, the vector space model can support the Boolean query of a single word , But the Boolean query expression of multiple word combinations , It is not easy to accumulate scores by vector space model .
b. Wildcard query
Wildcards can be parsed first , Resolve the query terms that wildcards may represent , Then take all possible terms as candidates query Go to query , Finally, integrate all query Query results of . Therefore, vector space model can solve
c. Phrase query ：
Because the vector space model does not consider the relative position of each word in the phrase , Therefore, vector space model is not suitable for phrase query .

5. Summary

For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ？
answer ： Use heuristic algorithms , Like the winner table , The method of hierarchical indexing , Find a relatively good topk A document .
except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ？ How to synthesize these indicators .
answer ： There are many other indicators , Such as static scoring of documents , Word item nearest neighbor score, etc , Finally, it can be comprehensively scored by means of manual rules or machine learning .
What modules does a complete information retrieval system need to include ？
answer ： See the picture 3.d
Whether the vector space model supports Wildcard Queries ？
answer ： Support .