当前位置:网站首页>Solr series of full-text search engines - basic principles of full-text search
Solr series of full-text search engines - basic principles of full-text search
2022-07-03 14:17:00 【Brother Xing plays with the clouds】
scene : We used Xinhua dictionary when we were children , Mom told you to open the second 38 page , find “ Cheating father ” Where it is , How would you look it up ? without doubt , Your eyes will come from 38 The first word of the page is scanned from beginning to end , Until I find “ Cheating father ” Two words . This search method is called Sequential scanning . For a small amount of data , Sequential scanning is sufficient . But mom told you to find out what's wrong with dad “ pit ” When the word is on which page , If you scan one by one from the first word on the first page , Then you've really been cheated . At this point you need to use Indexes . The index records “ pit ” Which page is the word on , You just need to find it in the index “ pit ” word , Then find the corresponding page number , The answer comes out . Because looking in the index “ pit ” Words are very fast , Because you know its side , So you can quickly locate the word .
So the catalog of Xinhua Dictionary ( Index table ) How is it written ? First of all, for the Xinhua Dictionary , After removing the directory , This book is a bunch of unstructured data sets . But intelligent human beings are good at thinking and summarizing , I found that every word corresponds to a page number , such as “ pit ” The word is on the 38 page ,“ Daddy ” Word on page 90 page . So they extract this information from it , Construct a structured data . Similar to the table structure in the database :
word page_no
---------------
pit 38
Daddy 90
... ...
This forms a complete directory ( The index library ), It is very convenient to search . The same principle applies to full-text retrieval , It comes down to two processes :1. Index creation (Indexing)2. Search index (Search). So how is the index created ? What is stored in the index ? How to find the index when searching ? With this series of questions, continue to look down .
Indexes
Solr/Lucene A reverse index is used , So-called Reverse index : It is the mapping process from keywords to documents , The index that holds this mapping information is called a reverse index
- On the left is the string sequence
- On the right is the document of string (Document) Numbered list , Called inverted table (Posting List)
The field list and the document number list form a dictionary . Now I want to search ”lucene”, So the index directly tells us , contains ”lucene” The documents for are :2,3,10,35,92, Instead of looking through the entire document library one by one . If you want to search for both ”lucene” Contain, ”solr” Documents , Then the corresponding two inverted tables can be obtained by intersecting :3、10、35、92.
Index creation
Suppose you have the following two original documents : Document a :Students should be allowed to go out with their friends, but not allowed to drink beer. Document 2 :My friend Jerry went to school to see his students but found them drunk which is not allowed. The creation process is roughly divided into the following steps :
One : Give the original document to the word segmentation component (Tokenizer) Participle component (Tokenizer) Will do the following things ( This process is called :Tokenize), The result of processing is lexical units (Token)
- Divide the document into individual words
- Remove punctuation
- Remove stop words (stop word)
- Stop words (Stop word) There is no specific meaning in a language , Therefore, it will not be used as a search keyword in most cases , This reduces the size of the index when it is created . Stop words in English (Stop word) Such as :”the”、”a”、”this”, Chinese has :” Of , have to ” etc. . Word segmentation components in different languages (Tokenizer), Have their own stop words (stop word) aggregate . After participle (Tokenizer) The result is called lexical unit (Token). In the example above , You get the following Lexical unit (Token): "Students","allowed","go","their","friends","allowed","drink","beer","My","friend","Jerry","went","school","see","his","students","found","them","drunk","allowed"
Two : Lexical unit (Token) To the language processing component (Linguistic Processor) Language processing components (linguistic processor) It's mainly about the word element (Token) Do some language related processing . For English , Language processing components (Linguistic Processor) In general, do the following :
- Change to lowercase (Lowercase).
- Reduce words to root form , Such as ”cars” To ”car” etc. . This operation is called :stemming.
- Turn words into root forms , Such as ”drove” To ”drive” etc. . This operation is called :lemmatization.
Language processing components (linguistic processor) The result of processing is called word (Term), The words obtained after language processing in the example (Term) as follows :
"student","allow","go","their","friend","allow","drink","beer","my","friend","jerry","go","school","see","his","student","find","them","drink","allow".
After language processing , Search for drive when drove Can also be searched out .Stemming and lemmatization Similarities and differences :
- Similarities :
- Stemming and lemmatization Make words root form .
- The two are different :
- Stemming It's using ” cut ” The way :”cars” To ”car”,”driving” To ”drive”.
- Lemmatization It's using ” shift ” The way :”drove” To ”drove”,”driving” To ”drive”.
- The algorithms are different :
- Stemming The main thing is to adopt a fixed algorithm to do this reduction , If you remove ”s”, Remove ”ing” Add ”e”, take ”ational” Turn into ”ate”, take ”tional” Turn into ”tion”.
- Lemmatization It is mainly stored in a dictionary in a predetermined format . For example, the dictionary has ”driving” To ”drive”,”drove” To ”drive”,”am, is, are” To ”be” Mapping , When making a change , Just convert it in the way agreed in the dictionary .
- Stemming and lemmatization Not mutually exclusive , There are intersections , Some words can achieve the same transformation in both ways .
3、 ... and : Get the word (Term) Pass to index component (Indexer) 1. Use the words you get (Term) Create a dictionary
Term Document ID
student 1
allow 1
go 1
their 1
friend 1
allow 1
drink 1
beer 1
my 2
friend 2
jerry 2
go 2
school 2
see 2
his 2
student 2
find 2
them 2
drink 2
allow 2
2. Sort the dictionary alphabetically :
Term Document ID
allow 1
allow 1
allow 2
beer 1
drink 1
drink 2
find 2
friend 1
friend 2
go 1
go 2
his 2
jerry 2
my 2
school 2
see 2
student 1
student 2
their 1
them 2
- Merge the same words (Term) The owner can only show part of the content (Posting List) Linked list
- Document Frequency: Document frequency , Indicates how many documents have this word appeared (Term)
- Frequency: Word frequency , Indicates the word in a document (Term) There have been several times
rehearse lines (Term) “allow” Speaking of , There are two documents containing the word (Term), word (Term) There are two in the following document linked list , The first indicates that it contains ”allow” The first document of , namely 1 Document No , In this document ,”allow” There is 2 Time , The second representation contains ”allow” The second document for , yes 2 Document No , In this document ,”allow” There is 1 Time
At this point, the index creation is completed , Search for ”drive” when ,”driving”,”drove”,”driven” Can also be found . Because in the index ,”driving”,”drove”,”driven” Will be processed into ”drive”, In search , If you enter ”driving”, The input query statement is also processed by word segmentation component and language processing component , Change to query ”drive”, So you can search for the document you want .
Search steps
Search for ”microsoft job”, The user's goal is to find a job at Microsoft , If the result is :”Microsoft does a good job at software industry…”, This deviates too far from the expectations of users . How to search reasonably and effectively , Search out the results that users want most ? The main search steps are as follows :
One : Lexical analysis of query content 、 Syntax analysis 、 language processing
- Lexical analysis : Distinguish between words and keywords in the query content , such as :english and janpan,”and” It's keywords ,”english” and ”janpan” It's a common word .
- Form a tree according to the syntax rules of the query syntax
- language processing , It is handled in the same way as when creating an index . such as :leaned–>lean,driven–>drive
Two : Search index , Get a collection of documents that conform to the syntax tree 3、 ... and : According to the relevance of the query statement to the document , Sort results
We treat the query statement as a document , The relevance between documents (relevance) scores (scoring), The higher the score, the more relevant , The higher the ranking . Of course, you can also artificially affect the scoring , For example, Baidu search , It doesn't necessarily rank according to the correlation .
How to judge the relevance between documents ? A document consists of multiple ( Or a ) word (Term) form , such as :”solr”, “toturial”, Different words may have different meanings , such as solr Just like toturial important , If a document appears 10 Time toturial, But only once solr, And another document solr There is 4 Time ,toturial once , Then the latter is probably the result of the search we want . This leads to weight (Term weight) The concept of .
The weight Indicates the importance of the word in the document , The more important words are, of course, the higher the weight , Therefore, it has greater influence in calculating document relevance . The process of getting document relevance through the weight between words is called Space vector model algorithm (Vector Space Model)
There are two main ways to influence the importance of a word in a document :
- Term Frequencey(tf),Term Frequency of occurrence in this document ,ft Bigger means more important
- Document Frequency(df), Indicates how many documents have this Trem,df Bigger means less important Hope is the most precious thing , Everyone has something , Naturally, it is not so valuable , Only what you own means that this thing is precious , The formula of weight :
Space vector model
The weight of words in the document is regarded as a vector
Document = {term1, term2, …… ,term N}
Document Vector = {weight1, weight2, …… ,weight N}
Think of the statement to be queried as a simple document , Also expressed as a vector :
Query = {term1, term 2, …… , term N}
Query Vector = {weight1, weight2, …… , weight N}
Put the searched document vector and query vector into N Dimensional space , Each word means one dimension :
The smaller the angle , The more similar , The more relevant
边栏推荐
- 7-23 currency conversion (using array conversion)
- JVM object lifecycle
- Simulated access
- npm install卡住与node-npy的各种奇怪报错
- Scroll detection of the navigation bar enables the navigation bar to slide and fix with no content
- etcd集群权限管理和账号密码使用
- Redis: operation command of string type data
- MongoDB索引
- Current situation, analysis and prediction of information and innovation industry
- Nucleic acid modified metal organic framework drug carrier | pcn-223 metal organic framework encapsulated ad adamantane | zif-8 encapsulated adriamycin (DOX)
猜你喜欢
Exercise 9-1 time conversion
QT learning 24 layout manager (III)
jvm-运行时数据区
Exercise 8-8 moving letters
Exercise 7-6 count capital consonants
Exercise 10-1 judge the three digits that meet the conditions
好看、好用、强大的手写笔记软件综合评测:Notability、GoodNotes、MarginNote、随手写、Notes Writers、CollaNote、CollaNote、Prodrafts、Noteshelf、FlowUs、OneNote、苹果备忘录
Why are grass-roots colleges and universities with "soil and poverty" called "Northeast small Tsinghua"?
Eight sorts
TS code automatically generates JS
随机推荐
Duet date picker (time plug-in that can manually enter the date)
Exercise 10-8 recursive implementation of sequential output of integers
Metal organic framework (MOFs) antitumor drug carrier | pcn-223 loaded with metronidazole | uio-66 loaded with ciprofloxacin hydrochloride(
虽然不一定最优秀,但一定是最努力的!
Exercise 9-1 time conversion
Exercise 10-3 recursive implementation of exponential functions
Redis: redis data structure and key operation commands
Learn to punch in today
Exercise 8-7 string sorting
The small project (servlet+jsp+mysql+el+jstl) completes a servlet with login function, with the operation of adding, deleting, modifying and querying. Realize login authentication, prevent illegal log
Nucleic acid modified metal organic framework drug carrier | pcn-223 metal organic framework encapsulated ad adamantane | zif-8 encapsulated adriamycin (DOX)
7-19 check denomination (solve binary linear equation)
Formation of mil-100 (FE) coated small molecule aspirin [email protected] (FE) | glycyrrhetinic acid modified metal organ
QT learning 23 layout manager (II)
Exercise 6-6 use a function to output an integer in reverse order
JS download files through URL links
Raft agreement
7-22 tortoise and rabbit race (result oriented)
Back to top implementation
7-11 calculation of residential water charges by sections