当前位置：网站首页>[introduction to information retrieval] Chapter 3 fault tolerant retrieval

[introduction to information retrieval] Chapter 3 fault tolerant retrieval

2022-07-02 07:21:00 【lwgkzl】

The overview

This chapter mainly solves the following problems ：

According to the user's inquiry , How to find the inverted list corresponding to the words in the user's query ？
If the user doesn't remember how to spell a word , How to realize fuzzy query （ Wildcard query ）？
If the user writes a word incorrectly , How to help him correct , So as to return the word that the user really wants to query ？

The above questions correspond to the following three sections .

3.1 Search term dictionary

Preface ： In the first two chapters , We perform Boolean queries directly by default according to the word items queried by the user , Directly get his inverted list . But actually , We need to find the corresponding term in the term dictionary first , Can return the inverted table corresponding to the word item . For users query After participle , Get the word item to be queried , We must first look it up in the lexicon , Of course, the simplest method is linear search , However, the time complexity of this method $O (n)$ Too high . Of course, we can sort the word dictionary according to the dictionary order , In this way, we can happily use binary search $O (l o g n)$ 了 , But things are not so simple , If it is web Search words , The vocabulary needs to be constantly updated , And if you need to ensure that the word list is orderly , Every time the thesaurus is updated, it will also cause a huge waste of performance . So next we will introduce two solutions , Help us locate quickly in the dictionary of words ！

String hash ： Mengde once said ：“ Why melancholy , Only hash ”, The function of string hash is to define a function , Map a string to a number that uniquely corresponds to it , Here we can map directly to the memory address , So every query term , We go through the hash function “ Shua ” You can find the memory address corresponding to the word item at a glance , The time complexity is $O (1)$ , It has nothing to do with the size of the lexicon , But too fast is not a good thing , ~~Especially for some friends~~ , Hash functions have the following limitations .

Limitations of string hashing ：

With the dynamic increase of word dictionary , conflict （ Multiple strings are mapped to the same number by this function ） The possibility of
String hashes cannot handle the following *“ Prefix query ”*, It cannot be adapted to the following “ Wildcard search ”

Then people expect a perfect algorithm to solve this problem , He should be able to query quickly , Also support fast dynamic insertion . He is ------- Binary search tree ：
Insert picture description here
Each leaf node of the binary search tree is a string , Non leaf nodes are used as partition nodes , Divide the string smaller than this node into the left subtree , Divide strings larger than this node into right subtrees （ See Introduction to algorithm for details ）. Binary search tree can also be regarded as prefix tree of string , It can be used for prefix retrieval of strings . To make a long story short , Binary search tree can achieve dynamic increase , And the retrieval efficiency is $O (l o g n)$ , Then there are some improvements based on the simple binary search tree , among Balanced binary trees Dynamically maintain the balance of binary search tree , Ensure that the time complexity of each retrieval is $O (l o g n)$ about ,B- Trees Expand the binary tree to multi fork , Each branch represents a string of an interval , In order to reduce the cost of maintaining balance every time of balanced binary tree . Please check the saint level magic guide book for specific algorithm introduction —《 Introduction to algorithms 》.

3.2 Wildcard query

When Xiao Ming slowly typed out in the input box heavily “ Horse * mei ” When , What he wants to know is the heroic deeds of a great hero , Unfortunately, Xiao Ming forgot his full name , I remember his name is Ma Yimei . At this time, fuzzy retrieval is needed , The middle one entered by Xiao Ming * It stands for replacing any character . This kind of query is called wildcard query .

a. Single wildcard query

If the user's query contains only one wildcard , It can be easily solved in the existing term retrieval technology , Such as “ Horse * mei ”, Then I can construct two binary search trees , A word item that stores positive （A）, Another word item stored after inversion （B）. Then I just need to be in the tree A Retrieved in “ Horse ” A set of words prefixed as well as In the tree B Retrieved in “ mei ” A set of words prefixed , Then take the union of the two , It's all the word items corresponding to Ma Hemei .

b. Multiple Wildcard Queries

If the user's query contains multiple wildcards , Such as “ Horse * mei * Skey ”, At this time, we need to use some other means .
Wheel index ： Add special characters at the end of a single word $, And add all the forms of its rotation to the lexicon , Then construct a binary search tree based on the rotation word dictionary . The example is shown in the following figure ：
Insert picture description here
What are the benefits of doing so , At this time, we can query “ Horse * mei * Skey ” 了 , First, transform it into Skey $\ Horse * mei *, Then search the binary search tree for the information containing " Skey $\ Horse " The term of （ Prefix search ）, Finally, in the retrieved words , By measuring whether the retrieved words contain “ mei ” Use this keyword to filter , After screening, we get the final word list .
Defects of wheel index ： Multiply the original large number of words , Because he saved all the results of the original rotation of each word item , Make the computer storage that is not rich even worse .

k-gram Indexes ：
This method is more native, It's a single word k-gram Draw out the word items , Constitute the k-gram The dictionary , For example, consider words “ Ma Dongmei autosky ” Of 2-gram, It is 【‘$ Horse ’,‘ Ma Dong ’,‘ Wintersweet ’…‘ Skey ’,‘ The base $’】 To identify the start and end position , Mark him with special characters at the beginning and end $, Then the dictionary contains so many . Then according to the 2-gram Just search the corresponding words ： Such as “ Horse * mei ”, It becomes Boolean search “ $Horse ” A N D “ mei$ ”, Then find the corresponding word .

3.3 Spelling correction

Finally came the spelling correction , Obviously, when users query, due to input method or memory deviation and other reasons , He may lose by mistake . For example, Xiao Ming inquired about a name “ Ma Ximei ”, Obviously, there is no “ Ma Ximei ” The name of this hero . This leads to Spelling correction technology .
Principles and steps ：

When the user retrieves too little information about keywords , We can think that he entered the wrong information , So correct his spelling , And provide the search results of the corrected words as “ Spelling suggestions ”.
For a misspelled word , We need to find the nearest word in the correct spelling .
If there are multiple correctly spelled words closest to the wrong spelling , Choose the word with the most frequent occurrence as the final spelling result

According to the above principles and steps , We need to define a concept as the relationship between wrong words and correct words *“ distance ”* What determines .

a. Word independent correction method

This method is aimed at the user entering the wrong word , As shown above , We need to define a standard to measure the gap between the wrong words and the correct words , Here are two ways .

Edit distance ： The first measure is edit distance , This one is direct leetcode Understand , link
Insert picture description here
The minimum number of operations in the topic , That is the distance between two words .

k-gram Coincidence ：
ditto , Disassemble the word items into k-gram after , Compare the common of correct words and wrong words k-gram The number of , The more public , The closer the distance between the two words . For example, consider words “ Ma Dongmei ” Of 2-gram, It is 【‘$ Horse ’,‘ Ma Dong ’,‘ Wintersweet ’,‘ mei $’】, And the wrong word “ Ma Ximei ” Of 2-gram by 【‘$ Horse ’,‘ Ma Dong ’,‘ Ximei ’,‘ mei $’】, obviously , The distance between the two is relatively close . But another problem is , If the length of the word item is very long , There may be some deviation, for example, a word item is “ Ma Dongmei - Robertsberg - Ostrovsky ”, Then he and the wrong words “ Ma Ximei ” Of 2-gram Public quantity and “ Ma Dongmei ” It's the same , Obviously, from the perspective of users “ Ma Dongmei ” And “ Ma Ximei ” It's closer . To take this into account , We use an index called the jackard coefficient .

Jacobian coefficient ：
Insert picture description here
among A,B They are the wrong words and the correct words k-gram Set .

b. Context sensitive correction method

This method is applicable to , The words entered by the user are all right , But the combined phrase is not what the user wants to express . If the user enters “ Excavator of flying technology ”, Of course, there are not many results after searching , But among them “ flight ”,“ Technology ” And other words are correct words . At this time, we need to specify it as “ Lanxiang Technology ”. At present, there is no particularly suitable method among the traditional methods , The more violent way is , Try to replace every word in the phrase with a word close to it , Then combine them into new phrases , Then choose the phrase with the highest frequency .

c. Speech based correction method

One method of voice hashing is ** soundex Algorithm **, Words with similar pronunciation can be grouped into the same category .

summary

problem 1. According to the user's inquiry , How to find the inverted list corresponding to the words in the user's query ？
answer 1： Binary search tree perhaps String hash is ok

problem 2. If the user doesn't remember how to spell a word , How to realize fuzzy query （ Wildcard query ）？
answer 2： Use wheel index or k-gram Indexes

problem 3. If the user writes a word incorrectly , How to help him correct , So as to return the word that the user really wants to query ？
answer 3： Use algorithms such as editing distance to correct word level errors , Correct phrase level errors based on word level errors .