当前位置：网站首页>Solr series of full-text search engines - basic principles of full-text search

Solr series of full-text search engines - basic principles of full-text search

2022-07-03 14:17:00 【Brother Xing plays with the clouds】

scene ： We used Xinhua dictionary when we were children , Mom told you to open the second 38 page , find “ Cheating father ” Where it is , How would you look it up ？ without doubt , Your eyes will come from 38 The first word of the page is scanned from beginning to end , Until I find “ Cheating father ” Two words . This search method is called Sequential scanning . For a small amount of data , Sequential scanning is sufficient . But mom told you to find out what's wrong with dad “ pit ” When the word is on which page , If you scan one by one from the first word on the first page , Then you've really been cheated . At this point you need to use Indexes . The index records “ pit ” Which page is the word on , You just need to find it in the index “ pit ” word , Then find the corresponding page number , The answer comes out . Because looking in the index “ pit ” Words are very fast , Because you know its side , So you can quickly locate the word .

So the catalog of Xinhua Dictionary （ Index table ） How is it written ？ First of all, for the Xinhua Dictionary , After removing the directory , This book is a bunch of unstructured data sets . But intelligent human beings are good at thinking and summarizing , I found that every word corresponds to a page number , such as “ pit ” The word is on the 38 page ,“ Daddy ” Word on page 90 page . So they extract this information from it , Construct a structured data . Similar to the table structure in the database ：

word    page_no
---------------
 pit         38
 Daddy         90
...       ...

This forms a complete directory （ The index library ）, It is very convenient to search . The same principle applies to full-text retrieval , It comes down to two processes ：1. Index creation （Indexing）2. Search index （Search）. So how is the index created ？ What is stored in the index ？ How to find the index when searching ？ With this series of questions, continue to look down .

Indexes

Solr/Lucene A reverse index is used , So-called Reverse index ： It is the mapping process from keywords to documents , The index that holds this mapping information is called a reverse index

On the left is the string sequence
On the right is the document of string （Document） Numbered list , Called inverted table （Posting List）

The field list and the document number list form a dictionary . Now I want to search ”lucene”, So the index directly tells us , contains ”lucene” The documents for are ：2,3,10,35,92, Instead of looking through the entire document library one by one . If you want to search for both ”lucene” Contain, ”solr” Documents , Then the corresponding two inverted tables can be obtained by intersecting ：3、10、35、92.

Index creation

Suppose you have the following two original documents ： Document a ：Students should be allowed to go out with their friends, but not allowed to drink beer. Document 2 ：My friend Jerry went to school to see his students but found them drunk which is not allowed. The creation process is roughly divided into the following steps ：

One ： Give the original document to the word segmentation component (Tokenizer) Participle component (Tokenizer) Will do the following things ( This process is called ：Tokenize), The result of processing is lexical units （Token）

Divide the document into individual words
Remove punctuation
Remove stop words (stop word)
- Stop words (Stop word) There is no specific meaning in a language , Therefore, it will not be used as a search keyword in most cases , This reduces the size of the index when it is created . Stop words in English (Stop word) Such as ：”the”、”a”、”this”, Chinese has ：” Of , have to ” etc. . Word segmentation components in different languages (Tokenizer), Have their own stop words (stop word) aggregate . After participle (Tokenizer) The result is called lexical unit (Token). In the example above , You get the following Lexical unit (Token)： "Students","allowed","go","their","friends","allowed","drink","beer","My","friend","Jerry","went","school","see","his","students","found","them","drunk","allowed"

Two ： Lexical unit (Token) To the language processing component (Linguistic Processor) Language processing components (linguistic processor) It's mainly about the word element (Token) Do some language related processing . For English , Language processing components (Linguistic Processor) In general, do the following ：

Change to lowercase (Lowercase).
Reduce words to root form , Such as ”cars” To ”car” etc. . This operation is called ：stemming.
Turn words into root forms , Such as ”drove” To ”drive” etc. . This operation is called ：lemmatization.

Language processing components (linguistic processor) The result of processing is called word (Term), The words obtained after language processing in the example (Term) as follows ：

"student","allow","go","their","friend","allow","drink","beer","my","friend","jerry","go","school","see","his","student","find","them","drink","allow".

After language processing , Search for drive when drove Can also be searched out .Stemming and lemmatization Similarities and differences ：

Similarities ：
1. Stemming and lemmatization Make words root form .
The two are different ：
1. Stemming It's using ” cut ” The way ：”cars” To ”car”,”driving” To ”drive”.
2. Lemmatization It's using ” shift ” The way ：”drove” To ”drove”,”driving” To ”drive”.
The algorithms are different ：
1. Stemming The main thing is to adopt a fixed algorithm to do this reduction , If you remove ”s”, Remove ”ing” Add ”e”, take ”ational” Turn into ”ate”, take ”tional” Turn into ”tion”.
2. Lemmatization It is mainly stored in a dictionary in a predetermined format . For example, the dictionary has ”driving” To ”drive”,”drove” To ”drive”,”am, is, are” To ”be” Mapping , When making a change , Just convert it in the way agreed in the dictionary .
3. Stemming and lemmatization Not mutually exclusive , There are intersections , Some words can achieve the same transformation in both ways .

3、 ... and ： Get the word (Term) Pass to index component (Indexer) 1. Use the words you get (Term) Create a dictionary

Term Document ID

student 1

allow 1

go 1

their 1

friend 1

allow 1

drink 1

beer 1

my 2

friend 2

jerry 2

go 2

school 2

see 2

his 2

student 2

find 2

them 2

drink 2

allow 2

2. Sort the dictionary alphabetically ：

Term Document ID

allow 1

allow 2

beer 1

drink 1

drink 2

find 2

friend 1

friend 2

go 1

go 2

his 2

jerry 2

my 2

school 2

see 2

student 1

student 2

their 1

them 2

Merge the same words (Term) The owner can only show part of the content (Posting List) Linked list

Document Frequency： Document frequency , Indicates how many documents have this word appeared (Term)
Frequency： Word frequency , Indicates the word in a document (Term) There have been several times

rehearse lines (Term) “allow” Speaking of , There are two documents containing the word (Term), word （Term) There are two in the following document linked list , The first indicates that it contains ”allow” The first document of , namely 1 Document No , In this document ,”allow” There is 2 Time , The second representation contains ”allow” The second document for , yes 2 Document No , In this document ,”allow” There is 1 Time

At this point, the index creation is completed , Search for ”drive” when ,”driving”,”drove”,”driven” Can also be found . Because in the index ,”driving”,”drove”,”driven” Will be processed into ”drive”, In search , If you enter ”driving”, The input query statement is also processed by word segmentation component and language processing component , Change to query ”drive”, So you can search for the document you want .

Search steps

Search for ”microsoft job”, The user's goal is to find a job at Microsoft , If the result is :”Microsoft does a good job at software industry…”, This deviates too far from the expectations of users . How to search reasonably and effectively , Search out the results that users want most ？ The main search steps are as follows ：

One ： Lexical analysis of query content 、 Syntax analysis 、 language processing

Lexical analysis ： Distinguish between words and keywords in the query content , such as ：english and janpan,”and” It's keywords ,”english” and ”janpan” It's a common word .
Form a tree according to the syntax rules of the query syntax

language processing , It is handled in the same way as when creating an index . such as ：leaned–>lean,driven–>drive

Two ： Search index , Get a collection of documents that conform to the syntax tree 3、 ... and ： According to the relevance of the query statement to the document , Sort results

We treat the query statement as a document , The relevance between documents （relevance） scores （scoring）, The higher the score, the more relevant , The higher the ranking . Of course, you can also artificially affect the scoring , For example, Baidu search , It doesn't necessarily rank according to the correlation .

How to judge the relevance between documents ？ A document consists of multiple （ Or a ） word （Term） form , such as ：”solr”, “toturial”, Different words may have different meanings , such as solr Just like toturial important , If a document appears 10 Time toturial, But only once solr, And another document solr There is 4 Time ,toturial once , Then the latter is probably the result of the search we want . This leads to weight （Term weight） The concept of .

The weight Indicates the importance of the word in the document , The more important words are, of course, the higher the weight , Therefore, it has greater influence in calculating document relevance . The process of getting document relevance through the weight between words is called Space vector model algorithm (Vector Space Model)

There are two main ways to influence the importance of a word in a document ：

Term Frequencey（tf）,Term Frequency of occurrence in this document ,ft Bigger means more important
Document Frequency（df）, Indicates how many documents have this Trem,df Bigger means less important Hope is the most precious thing , Everyone has something , Naturally, it is not so valuable , Only what you own means that this thing is precious , The formula of weight ：

Space vector model

The weight of words in the document is regarded as a vector

Document = {term1, term2, …… ,term N}
Document Vector = {weight1, weight2, …… ,weight N}

Think of the statement to be queried as a simple document , Also expressed as a vector ：

Query = {term1, term 2, …… , term N}
Query Vector = {weight1, weight2, …… , weight N}

Put the searched document vector and query vector into N Dimensional space , Each word means one dimension ：

The smaller the angle , The more similar , The more relevant

原网站

版权声明
本文为[Brother Xing plays with the clouds]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207031352301166.html

当前位置：网站首页>Solr series of full-text search engines - basic principles of full-text search

Solr series of full-text search engines - basic principles of full-text search

Indexes

Index creation

Search steps

Space vector model

边栏推荐

猜你喜欢

随机推荐