当前位置:网站首页>[introduction to information retrieval] Chapter 3 fault tolerant retrieval
[introduction to information retrieval] Chapter 3 fault tolerant retrieval
2022-07-02 07:21:00 【lwgkzl】
The overview
This chapter mainly solves the following problems :
- According to the user's inquiry , How to find the inverted list corresponding to the words in the user's query ?
- If the user doesn't remember how to spell a word , How to realize fuzzy query ( Wildcard query )?
- If the user writes a word incorrectly , How to help him correct , So as to return the word that the user really wants to query ?
The above questions correspond to the following three sections .
3.1 Search term dictionary
Preface : In the first two chapters , We perform Boolean queries directly by default according to the word items queried by the user , Directly get his inverted list . But actually , We need to find the corresponding term in the term dictionary first , Can return the inverted table corresponding to the word item . For users query After participle , Get the word item to be queried , We must first look it up in the lexicon , Of course, the simplest method is linear search , However, the time complexity of this method O ( n ) O(n) O(n) Too high . Of course, we can sort the word dictionary according to the dictionary order , In this way, we can happily use binary search O ( l o g n ) O(logn) O(logn) 了 , But things are not so simple , If it is web Search words , The vocabulary needs to be constantly updated , And if you need to ensure that the word list is orderly , Every time the thesaurus is updated, it will also cause a huge waste of performance . So next we will introduce two solutions , Help us locate quickly in the dictionary of words !
String hash : Mengde once said :“ Why melancholy , Only hash ”, The function of string hash is to define a function , Map a string to a number that uniquely corresponds to it , Here we can map directly to the memory address , So every query term , We go through the hash function “ Shua ” You can find the memory address corresponding to the word item at a glance , The time complexity is O ( 1 ) O(1) O(1), It has nothing to do with the size of the lexicon , But too fast is not a good thing , Especially for some friends , Hash functions have the following limitations .
Limitations of string hashing :
- With the dynamic increase of word dictionary , conflict ( Multiple strings are mapped to the same number by this function ) The possibility of
- String hashes cannot handle the following *“ Prefix query ”*, It cannot be adapted to the following “ Wildcard search ”
Then people expect a perfect algorithm to solve this problem , He should be able to query quickly , Also support fast dynamic insertion . He is ------- Binary search tree :
Each leaf node of the binary search tree is a string , Non leaf nodes are used as partition nodes , Divide the string smaller than this node into the left subtree , Divide strings larger than this node into right subtrees ( See Introduction to algorithm for details ). Binary search tree can also be regarded as prefix tree of string , It can be used for prefix retrieval of strings . To make a long story short , Binary search tree can achieve dynamic increase , And the retrieval efficiency is O ( l o g n ) O(logn) O(logn), Then there are some improvements based on the simple binary search tree , among Balanced binary trees Dynamically maintain the balance of binary search tree , Ensure that the time complexity of each retrieval is O ( l o g n ) O(logn) O(logn) about ,B- Trees Expand the binary tree to multi fork , Each branch represents a string of an interval , In order to reduce the cost of maintaining balance every time of balanced binary tree . Please check the saint level magic guide book for specific algorithm introduction —《 Introduction to algorithms 》.
3.2 Wildcard query
When Xiao Ming slowly typed out in the input box heavily “ Horse * mei ” When , What he wants to know is the heroic deeds of a great hero , Unfortunately, Xiao Ming forgot his full name , I remember his name is Ma Yimei . At this time, fuzzy retrieval is needed , The middle one entered by Xiao Ming * It stands for replacing any character . This kind of query is called wildcard query .
a. Single wildcard query
If the user's query contains only one wildcard , It can be easily solved in the existing term retrieval technology , Such as “ Horse * mei ”, Then I can construct two binary search trees , A word item that stores positive (A), Another word item stored after inversion (B). Then I just need to be in the tree A Retrieved in “ Horse ” A set of words prefixed as well as In the tree B Retrieved in “ mei ” A set of words prefixed , Then take the union of the two , It's all the word items corresponding to Ma Hemei .
b. Multiple Wildcard Queries
If the user's query contains multiple wildcards , Such as “ Horse * mei * Skey ”, At this time, we need to use some other means .
Wheel index : Add special characters at the end of a single word $, And add all the forms of its rotation to the lexicon , Then construct a binary search tree based on the rotation word dictionary . The example is shown in the following figure :
What are the benefits of doing so , At this time, we can query “ Horse * mei * Skey ” 了 , First, transform it into Skey $\ Horse * mei *, Then search the binary search tree for the information containing " Skey $\ Horse " The term of ( Prefix search ), Finally, in the retrieved words , By measuring whether the retrieved words contain “ mei ” Use this keyword to filter , After screening, we get the final word list .
Defects of wheel index : Multiply the original large number of words , Because he saved all the results of the original rotation of each word item , Make the computer storage that is not rich even worse .
k-gram Indexes :
This method is more native, It's a single word k-gram Draw out the word items , Constitute the k-gram The dictionary , For example, consider words “ Ma Dongmei autosky ” Of 2-gram, It is 【‘$ Horse ’,‘ Ma Dong ’,‘ Wintersweet ’…‘ Skey ’,‘ The base $’】 To identify the start and end position , Mark him with special characters at the beginning and end $, Then the dictionary contains so many . Then according to the 2-gram Just search the corresponding words : Such as “ Horse * mei ”, It becomes Boolean search “ Horse ” A N D “ mei Horse ” AND “ mei Horse ”AND“ mei ”, Then find the corresponding word .
3.3 Spelling correction
Finally came the spelling correction , Obviously, when users query, due to input method or memory deviation and other reasons , He may lose by mistake . For example, Xiao Ming inquired about a name “ Ma Ximei ”, Obviously, there is no “ Ma Ximei ” The name of this hero . This leads to Spelling correction technology .
Principles and steps :
- When the user retrieves too little information about keywords , We can think that he entered the wrong information , So correct his spelling , And provide the search results of the corrected words as “ Spelling suggestions ”.
- For a misspelled word , We need to find the nearest word in the correct spelling .
- If there are multiple correctly spelled words closest to the wrong spelling , Choose the word with the most frequent occurrence as the final spelling result
According to the above principles and steps , We need to define a concept as the relationship between wrong words and correct words *“ distance ”* What determines .
a. Word independent correction method
This method is aimed at the user entering the wrong word , As shown above , We need to define a standard to measure the gap between the wrong words and the correct words , Here are two ways .
Edit distance : The first measure is edit distance , This one is direct leetcode Understand , link
The minimum number of operations in the topic , That is the distance between two words .
k-gram Coincidence :
ditto , Disassemble the word items into k-gram after , Compare the common of correct words and wrong words k-gram The number of , The more public , The closer the distance between the two words . For example, consider words “ Ma Dongmei ” Of 2-gram, It is 【‘$ Horse ’,‘ Ma Dong ’,‘ Wintersweet ’,‘ mei $’】, And the wrong word “ Ma Ximei ” Of 2-gram by 【‘$ Horse ’,‘ Ma Dong ’,‘ Ximei ’,‘ mei $’】, obviously , The distance between the two is relatively close . But another problem is , If the length of the word item is very long , There may be some deviation, for example, a word item is “ Ma Dongmei - Robertsberg - Ostrovsky ”, Then he and the wrong words “ Ma Ximei ” Of 2-gram Public quantity and “ Ma Dongmei ” It's the same , Obviously, from the perspective of users “ Ma Dongmei ” And “ Ma Ximei ” It's closer . To take this into account , We use an index called the jackard coefficient .
Jacobian coefficient :
among A,B They are the wrong words and the correct words k-gram Set .
b. Context sensitive correction method
This method is applicable to , The words entered by the user are all right , But the combined phrase is not what the user wants to express . If the user enters “ Excavator of flying technology ”, Of course, there are not many results after searching , But among them “ flight ”,“ Technology ” And other words are correct words . At this time, we need to specify it as “ Lanxiang Technology ”. At present, there is no particularly suitable method among the traditional methods , The more violent way is , Try to replace every word in the phrase with a word close to it , Then combine them into new phrases , Then choose the phrase with the highest frequency .
c. Speech based correction method
One method of voice hashing is ** soundex Algorithm **, Words with similar pronunciation can be grouped into the same category .
summary
problem 1. According to the user's inquiry , How to find the inverted list corresponding to the words in the user's query ?
answer 1: Binary search tree perhaps String hash is ok
problem 2. If the user doesn't remember how to spell a word , How to realize fuzzy query ( Wildcard query )?
answer 2: Use wheel index or k-gram Indexes
problem 3. If the user writes a word incorrectly , How to help him correct , So as to return the word that the user really wants to query ?
answer 3: Use algorithms such as editing distance to correct word level errors , Correct phrase level errors based on word level errors .
Knowledge map in this chapter
边栏推荐
- CAD secondary development object
- The boss said: whoever wants to use double to define the amount of goods, just pack up and go
- IDEA2020中测试PySpark的运行出错
- PM2 simple use and daemon
- ssm垃圾分类管理系统
- 【论文介绍】R-Drop: Regularized Dropout for Neural Networks
- Oracle 11g sysaux table space full processing and the difference between move and shrink
- Oracle segment advisor, how to deal with row link row migration, reduce high water level
- view的绘制机制(三)
- Get the uppercase initials of Chinese Pinyin in PHP
猜你喜欢
外币记账及重估总账余额表变化(下)
Ceaspectuss shipping company shipping artificial intelligence products, anytime, anywhere container inspection and reporting to achieve cloud yard, shipping company intelligent digital container contr
The boss said: whoever wants to use double to define the amount of goods, just pack up and go
Build FRP for intranet penetration
IDEA2020中PySpark的两表关联(字段名相同)
【模型蒸馏】TinyBERT: Distilling BERT for Natural Language Understanding
Three principles of architecture design
MapReduce concepts and cases (Shang Silicon Valley Learning Notes)
【信息检索导论】第三章 容错式检索
Agile development of software development pattern (scrum)
随机推荐
PHP uses the method of collecting to insert a value into the specified position in the array
SSM学生成绩信息管理系统
離線數倉和bi開發的實踐和思考
Sparksql data skew
Oracle EBS DataGuard setup
Yolov5 practice: teach object detection by hand
Oracle EBS database monitoring -zabbix+zabbix-agent2+orabbix
图解Kubernetes中的etcd的访问
Pyspark build temporary report error
Oracle 11.2.0.3 handles the problem of continuous growth of sysaux table space without downtime
使用Matlab实现:Jacobi、Gauss-Seidel迭代
CAD secondary development object
SSM student achievement information management system
Oracle apex Ajax process + dy verification
DNS攻击详解
【信息检索导论】第三章 容错式检索
Message queue fnd in Oracle EBS_ msg_ pub、fnd_ Application of message in pl/sql
User login function: simple but difficult
第一个快应用(quickapp)demo
view的绘制机制(一)