当前位置:网站首页>[introduction to information retrieval] Chapter 1 Boolean retrieval
[introduction to information retrieval] Chapter 1 Boolean retrieval
2022-07-02 07:21:00 【lwgkzl】
The overview
The first chapter mainly introduces information retrieval , The concept of Boolean retrieval and some basic definitions of retrieval , Such as inverted index, etc .
1.1 An example of information retrieval
Information retrieval (information retrieval): From large-scale unstructured corpus ( It's usually text ) Find out the process of meeting users' information needs .
Ad hoc retrieval: This is an application scenario of information retrieval , Generally speaking, the whole information retrieval is divided into two kinds , One is Ad hoc retrieval, That is, the corpus is basically fixed , Use this corpus to meet users' arbitrary queries , The scene is IR The most common , For example, search engine , Library retrieval system , Can be seen as Ad hoc retrieval. The other is filtering retrieval, This kind of demand will not change much , But the corpus will be in iteration at any time , This demand is less , A less appropriate example is the search of sports news websites , User query Probably fixed ( ball-game star , league ), News is dynamically updated .
Indexes (index): The usual search is linear search , Such as linear scanning text , Find the desired field, etc . If there is a tag, it can mark the document where the field is located , Then you can make this tag an index , Such as the number of the Library .
Term (term): In Text Retrieval , The term will be used as the index unit . Segment a sentence , The resulting sub unit is the word item . Such as “ I'm Chinese ” After the participle [“ I ”,“ yes ”,“ Chinese ”], There are three words in the list .
file (document): In the retrieval system , The retrieved object , It could be books , Journalism , Chapter, etc . Of course, we need to assign a number to each document (01,02,03…).
Term - Document correlation matrix (term-document incidence matrix): The basic structure of simple Boolean retrieval , It is a two-dimensional matrix composed of words and documents C, Value taken by 0,1 form , C i j = 1 C_{ij}=1 Cij=1 A word item i In the document j There has been , On the contrary, it means that there is no . The example is as follows :
Boolean retrieval model (bool retrieval model): Boolean search is to search all documents containing the word according to keywords . And it supports multiple keywords “AND”,“OR”,"NOT" operation .
According to the above words - Document correlation matrix , We can do this easily .
If the search also includes “Antony” and “Brutus” Documents , Will “Antony” Corresponding document vector “110001” And “Brutus” Document vector "110100" Just do and operate
110001 & 110100 => 110000
And the first two answers obtained by operation are 1, Explain that one document and the second document are the answers to the search ( It also includes the above two words ). The same can be achieved “OR” and “NOT” operation
Inverted index (inverted index):
Why do we need inverted index : By word item - Document incidence matrix construction of the relationship between terms and documents takes up too much space , Because in real life , There are tens of thousands of words , Documents are at the level of hundreds of billions , Multiplying the two is the volume that no machine can store . Because of the word - In the document matrix , A large number of locations are 0, In fact, it can be used in graph theory ** Adjacency list ** To store , That is, for each word , Save only the document number containing the word .
What is inverted index : That is, the adjacency table with word items as keys , Each line stores an idea of the document number containing the word item .
Inverted record sheet (posting list): Use inverted index to store word items - Adjacency table of the association relationship between documents , As shown in the figure below : On the left is the word column , On the right is a linked list , Store the document number containing the word item
For more efficient retrieval , The words on the left are arranged in order according to the dictionary order , The document numbers on the right are arranged from small to large
1.2 Preliminary construction of inverted index table
It is mainly divided into four steps
- collecting data
- Word segmentation of text data , Construct lexical items
- Normalize the word items .( In English , Remove lexical tenses , etc. )
- Merge the documents corresponding to the word items , And store them in order , You can also count the frequency of vocabulary .
1.3 Handle Boolean search queries
Reverse table merge (posting merge): Because each word item corresponds to an adjacency table ( chart 1-3), Then the previous requirements of Boolean retrieval ,“AND”,“OR”,"NOT" The implementation of is not a simple and or non operation of vectors . Instead, you need to merge multiple linked lists , Find its common elements . The public elements of the linked list can be found in O ( n ) O(n) O(n) In time to complete ( Based on the premise that the document number of the same word is orderly ):
More complex operations have corresponding optimizations , I'm not going to go into details here
1.4 Expand Boolean retrieval
Boolean search can find all documents that meet the conditions , But you can't grade and sort documents , That is, all documents have the same weight , Therefore, it leads to :
Lexical proximity index (term proxlmity): Its main idea is : The two terms used to specify the query should be close to each other in the document , The proximity depends on whether the two words are in the same institutional unit ( The same paragraph , The same sentence )
This provides additional functionality for Boolean retrieval ( not only “AND”“OR”,“NOT”):
“/s”,"/p","/k" It means that two words are in the same sentence , The same paragraph , as well as K Within a word .
limitations : Use “AND” In operation , Will cause high precision, low recall The circumstances of , While using “OR” During operation , Will cause high recall, low precision The situation of , It is difficult to weigh the relationship between the two by constructing query rules .
summary
This chapter mainly introduces :
- The concept and construction of inverted index
- The concept of Boolean retrieval model
- Some basic concepts of information retrieval
Knowledge map in this chapter
边栏推荐
- php中删除指定文件夹下的内容
- 华为机试题-20190417
- Find in laravel8_ in_ Usage of set and upsert
- Oracle rman自动恢复脚本(生产数据向测试迁移)
- Oracle EBS ADI development steps
- 【模型蒸馏】TinyBERT: Distilling BERT for Natural Language Understanding
- Pyspark build temporary report error
- 中年人的认知科普
- spark sql任务性能优化(基础)
- User login function: simple but difficult
猜你喜欢
Oracle EBS ADI development steps
Practice and thinking of offline data warehouse and Bi development
SSM laboratory equipment management
spark sql任务性能优化(基础)
ORACLE 11G利用 ORDS+pljson来实现json_table 效果
Implementation of purchase, sales and inventory system with ssm+mysql
图解Kubernetes中的etcd的访问
Oracle EBs and apex integrated login and principle analysis
How to call WebService in PHP development environment?
DNS攻击详解
随机推荐
【Torch】最简洁logging使用指南
CAD secondary development object
使用Matlab实现:幂法、反幂法(原点位移)
parser.parse_args 布尔值类型将False解析为True
Only the background of famous universities and factories can programmers have a way out? Netizen: two, big factory background is OK
Network security -- intrusion detection of emergency response
实现接口 Interface Iterable<T>
ERNIE1.0 与 ERNIE2.0 论文解读
JSP intelligent community property management system
读《敏捷整洁之道:回归本源》后感
Changes in foreign currency bookkeeping and revaluation general ledger balance table (Part 2)
@Transitional step pit
外币记账及重估总账余额表变化(下)
ssm人事管理系统
ORACLE 11G SYSAUX表空间满处理及move和shrink区别
php中计算树状结构数据中的合计
SSM supermarket order management system
解决万恶的open failed: ENOENT (No such file or directory)/(Operation not permitted)
架构设计三原则
ARP攻击