当前位置:网站首页>[introduction to information retrieval] Chapter 7 scoring calculation in search system
[introduction to information retrieval] Chapter 7 scoring calculation in search system
2022-07-02 07:22:00 【lwgkzl】
1. executive summary
This chapter mainly solves the following problems :
- For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ?
- except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ? How to synthesize these indicators
- What modules does a complete information retrieval system need to include ?
- Whether the vector space model supports Wildcard Queries ?
2. Quick scoring and sorting
This chapter mainly introduces some heuristic methods , It is used to quickly find the one that is more relevant to a query K A document , The documents found do not completely contain the most relevant topk, But we will return with truth topk The score is close K A document .
a. Index removal optimization :
Consider only documents that contain multiple query items in the query , Or only consider including more than a certain number of words idf Threshold document .
b. Winner list
For each word item in the inverted list , Find the most relevant t A document , Then in the subsequent query , For each word, only t The document with the highest score , That is, for a query , We only need the word items contained in the query corresponding to N* t Select from documents topk that will do , among N The number of terms included for this query .
c. Static score of the document
Use in combination with the winner table , You can use the static score of the document as the first choice of each word t Basis of documents . The static score of the document represents the quality of the document , Such as the website users' comments on the website . As shown in the figure below :
d. Cluster pruning
Use document vectors for clustering , Then choose ( N ) \sqrt(N) (N) Cluster centers , It is expected that each cluster will contain ( N ) \sqrt(N) (N) A document , among N Is the number of documents . Then choose leave query The nearest cluster center as a candidate , Here ( N ) \sqrt(N) (N) Select from documents topk Geli query Recent documents .
3. Composition of information retrieval system
a. Hierarchical index :
As shown in the figure above , Divide into different levels through scores , When searching, search from top to bottom , Until I find K Candidate documents .
b. Lexical proximity :
The closer the word items in the query are in the document , The higher the score of this document . As for how to evaluate the score of the word item approaching , Content requiring machine learning .
c. Calculation of scoring function :
ditto , Comprehensive scoring for a document , Static scores need to be considered , query Similarity with documents , Word proximity and other factors , According to the different emphasis of application , Manual rules can be formulated to comprehensively score documents , You can also regard the above scores as characteristics , Then input these features into the machine learning model for scoring .
d. Composition of information retrieval system :
On the document side : Indexing to imprecise retrieval
On user side : From spelling correction to retrieval , To sort and grade documents through machine learning algorithms .
4. Support of vector space model for various query operations
a. Boolean query
Obviously, the vector space model can support the Boolean query of a single word , But the Boolean query expression of multiple word combinations , It is not easy to accumulate scores by vector space model .
b. Wildcard query
Wildcards can be parsed first , Resolve the query terms that wildcards may represent , Then take all possible terms as candidates query Go to query , Finally, integrate all query Query results of . Therefore, vector space model can solve
c. Phrase query :
Because the vector space model does not consider the relative position of each word in the phrase , Therefore, vector space model is not suitable for phrase query .
5. Summary
- For hundreds of billions of documents , Sorting the document library for each query is unrealistic , If you can quickly retrieve the most relevant topk A document ?
answer : Use heuristic algorithms , Like the winner table , The method of hierarchical indexing , Find a relatively good topk A document . - except query And document Beyond the similarity of , Whether other indicators are needed in the process of sorting documents ? How to synthesize these indicators .
answer : There are many other indicators , Such as static scoring of documents , Word item nearest neighbor score, etc , Finally, it can be comprehensively scored by means of manual rules or machine learning . - What modules does a complete information retrieval system need to include ?
answer : See the picture 3.d - Whether the vector space model supports Wildcard Queries ?
answer : Support .
6. Mind mapping
边栏推荐
- 【Torch】解决tensor参数有梯度,weight不更新的若干思路
- 【信息检索导论】第六章 词项权重及向量空间模型
- CRP implementation methodology
- IDEA2020中测试PySpark的运行出错
- DNS攻击详解
- Oracle EBS database monitoring -zabbix+zabbix-agent2+orabbix
- Transform the tree structure into array in PHP (flatten the tree structure and keep the sorting of upper and lower levels)
- parser.parse_args 布尔值类型将False解析为True
- 数仓模型事实表模型设计
- Oracle rman自动恢复脚本(生产数据向测试迁移)
猜你喜欢
Only the background of famous universities and factories can programmers have a way out? Netizen: two, big factory background is OK
spark sql任务性能优化(基础)
【BERT,GPT+KG调研】Pretrain model融合knowledge的论文集锦
叮咚,Redis OM对象映射框架来了
Implementation of purchase, sales and inventory system with ssm+mysql
架构设计三原则
读《敏捷整洁之道:回归本源》后感
2021-07-05c /cad secondary development create arc (4)
CAD secondary development object
SSM second hand trading website
随机推荐
实现接口 Interface Iterable<T>
Practice and thinking of offline data warehouse and Bi development
Oracle EBS数据库监控-Zabbix+zabbix-agent2+orabbix
ssm垃圾分类管理系统
Oracle general ledger balance table GL for foreign currency bookkeeping_ Balance change (Part 1)
oracle EBS标准表的后缀解释说明
华为机试题-20190417
【论文介绍】R-Drop: Regularized Dropout for Neural Networks
php中获取汉字拼音大写首字母
【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
[medical] participants to medical ontologies: Content Selection for Clinical Abstract Summarization
spark sql任务性能优化(基础)
Oracle段顾问、怎么处理行链接行迁移、降低高水位
ARP攻击
PM2 simple use and daemon
使用MAME32K进行联机游戏
allennlp 中的TypeError: Object of type Tensor is not JSON serializable错误
图解Kubernetes中的etcd的访问
Sqli-labs customs clearance (less2-less5)
CRP实施方法论