当前位置:网站首页>Solr series of full-text search engines - basic principles of full-text search
Solr series of full-text search engines - basic principles of full-text search
2022-07-03 14:17:00 【Brother Xing plays with the clouds】
scene : We used Xinhua dictionary when we were children , Mom told you to open the second 38 page , find “ Cheating father ” Where it is , How would you look it up ? without doubt , Your eyes will come from 38 The first word of the page is scanned from beginning to end , Until I find “ Cheating father ” Two words . This search method is called Sequential scanning . For a small amount of data , Sequential scanning is sufficient . But mom told you to find out what's wrong with dad “ pit ” When the word is on which page , If you scan one by one from the first word on the first page , Then you've really been cheated . At this point you need to use Indexes . The index records “ pit ” Which page is the word on , You just need to find it in the index “ pit ” word , Then find the corresponding page number , The answer comes out . Because looking in the index “ pit ” Words are very fast , Because you know its side , So you can quickly locate the word .
So the catalog of Xinhua Dictionary ( Index table ) How is it written ? First of all, for the Xinhua Dictionary , After removing the directory , This book is a bunch of unstructured data sets . But intelligent human beings are good at thinking and summarizing , I found that every word corresponds to a page number , such as “ pit ” The word is on the 38 page ,“ Daddy ” Word on page 90 page . So they extract this information from it , Construct a structured data . Similar to the table structure in the database :
word page_no
---------------
pit 38
Daddy 90
... ...This forms a complete directory ( The index library ), It is very convenient to search . The same principle applies to full-text retrieval , It comes down to two processes :1. Index creation (Indexing)2. Search index (Search). So how is the index created ? What is stored in the index ? How to find the index when searching ? With this series of questions, continue to look down .
Indexes
Solr/Lucene A reverse index is used , So-called Reverse index : It is the mapping process from keywords to documents , The index that holds this mapping information is called a reverse index
- On the left is the string sequence
- On the right is the document of string (Document) Numbered list , Called inverted table (Posting List)
The field list and the document number list form a dictionary . Now I want to search ”lucene”, So the index directly tells us , contains ”lucene” The documents for are :2,3,10,35,92, Instead of looking through the entire document library one by one . If you want to search for both ”lucene” Contain, ”solr” Documents , Then the corresponding two inverted tables can be obtained by intersecting :3、10、35、92.
Index creation
Suppose you have the following two original documents : Document a :Students should be allowed to go out with their friends, but not allowed to drink beer. Document 2 :My friend Jerry went to school to see his students but found them drunk which is not allowed. The creation process is roughly divided into the following steps :
One : Give the original document to the word segmentation component (Tokenizer) Participle component (Tokenizer) Will do the following things ( This process is called :Tokenize), The result of processing is lexical units (Token)
- Divide the document into individual words
- Remove punctuation
- Remove stop words (stop word)
- Stop words (Stop word) There is no specific meaning in a language , Therefore, it will not be used as a search keyword in most cases , This reduces the size of the index when it is created . Stop words in English (Stop word) Such as :”the”、”a”、”this”, Chinese has :” Of , have to ” etc. . Word segmentation components in different languages (Tokenizer), Have their own stop words (stop word) aggregate . After participle (Tokenizer) The result is called lexical unit (Token). In the example above , You get the following Lexical unit (Token): "Students","allowed","go","their","friends","allowed","drink","beer","My","friend","Jerry","went","school","see","his","students","found","them","drunk","allowed"
Two : Lexical unit (Token) To the language processing component (Linguistic Processor) Language processing components (linguistic processor) It's mainly about the word element (Token) Do some language related processing . For English , Language processing components (Linguistic Processor) In general, do the following :
- Change to lowercase (Lowercase).
- Reduce words to root form , Such as ”cars” To ”car” etc. . This operation is called :stemming.
- Turn words into root forms , Such as ”drove” To ”drive” etc. . This operation is called :lemmatization.
Language processing components (linguistic processor) The result of processing is called word (Term), The words obtained after language processing in the example (Term) as follows :
"student","allow","go","their","friend","allow","drink","beer","my","friend","jerry","go","school","see","his","student","find","them","drink","allow".After language processing , Search for drive when drove Can also be searched out .Stemming and lemmatization Similarities and differences :
- Similarities :
- Stemming and lemmatization Make words root form .
- The two are different :
- Stemming It's using ” cut ” The way :”cars” To ”car”,”driving” To ”drive”.
- Lemmatization It's using ” shift ” The way :”drove” To ”drove”,”driving” To ”drive”.
- The algorithms are different :
- Stemming The main thing is to adopt a fixed algorithm to do this reduction , If you remove ”s”, Remove ”ing” Add ”e”, take ”ational” Turn into ”ate”, take ”tional” Turn into ”tion”.
- Lemmatization It is mainly stored in a dictionary in a predetermined format . For example, the dictionary has ”driving” To ”drive”,”drove” To ”drive”,”am, is, are” To ”be” Mapping , When making a change , Just convert it in the way agreed in the dictionary .
- Stemming and lemmatization Not mutually exclusive , There are intersections , Some words can achieve the same transformation in both ways .
3、 ... and : Get the word (Term) Pass to index component (Indexer) 1. Use the words you get (Term) Create a dictionary
Term Document ID
student 1
allow 1
go 1
their 1
friend 1
allow 1
drink 1
beer 1
my 2
friend 2
jerry 2
go 2
school 2
see 2
his 2
student 2
find 2
them 2
drink 2
allow 2
2. Sort the dictionary alphabetically :
Term Document ID
allow 1
allow 1
allow 2
beer 1
drink 1
drink 2
find 2
friend 1
friend 2
go 1
go 2
his 2
jerry 2
my 2
school 2
see 2
student 1
student 2
their 1
them 2
- Merge the same words (Term) The owner can only show part of the content (Posting List) Linked list
- Document Frequency: Document frequency , Indicates how many documents have this word appeared (Term)
- Frequency: Word frequency , Indicates the word in a document (Term) There have been several times
rehearse lines (Term) “allow” Speaking of , There are two documents containing the word (Term), word (Term) There are two in the following document linked list , The first indicates that it contains ”allow” The first document of , namely 1 Document No , In this document ,”allow” There is 2 Time , The second representation contains ”allow” The second document for , yes 2 Document No , In this document ,”allow” There is 1 Time
At this point, the index creation is completed , Search for ”drive” when ,”driving”,”drove”,”driven” Can also be found . Because in the index ,”driving”,”drove”,”driven” Will be processed into ”drive”, In search , If you enter ”driving”, The input query statement is also processed by word segmentation component and language processing component , Change to query ”drive”, So you can search for the document you want .
Search steps
Search for ”microsoft job”, The user's goal is to find a job at Microsoft , If the result is :”Microsoft does a good job at software industry…”, This deviates too far from the expectations of users . How to search reasonably and effectively , Search out the results that users want most ? The main search steps are as follows :
One : Lexical analysis of query content 、 Syntax analysis 、 language processing
- Lexical analysis : Distinguish between words and keywords in the query content , such as :english and janpan,”and” It's keywords ,”english” and ”janpan” It's a common word .
- Form a tree according to the syntax rules of the query syntax
- language processing , It is handled in the same way as when creating an index . such as :leaned–>lean,driven–>drive
Two : Search index , Get a collection of documents that conform to the syntax tree 3、 ... and : According to the relevance of the query statement to the document , Sort results
We treat the query statement as a document , The relevance between documents (relevance) scores (scoring), The higher the score, the more relevant , The higher the ranking . Of course, you can also artificially affect the scoring , For example, Baidu search , It doesn't necessarily rank according to the correlation .
How to judge the relevance between documents ? A document consists of multiple ( Or a ) word (Term) form , such as :”solr”, “toturial”, Different words may have different meanings , such as solr Just like toturial important , If a document appears 10 Time toturial, But only once solr, And another document solr There is 4 Time ,toturial once , Then the latter is probably the result of the search we want . This leads to weight (Term weight) The concept of .
The weight Indicates the importance of the word in the document , The more important words are, of course, the higher the weight , Therefore, it has greater influence in calculating document relevance . The process of getting document relevance through the weight between words is called Space vector model algorithm (Vector Space Model)
There are two main ways to influence the importance of a word in a document :
- Term Frequencey(tf),Term Frequency of occurrence in this document ,ft Bigger means more important
- Document Frequency(df), Indicates how many documents have this Trem,df Bigger means less important Hope is the most precious thing , Everyone has something , Naturally, it is not so valuable , Only what you own means that this thing is precious , The formula of weight :
Space vector model
The weight of words in the document is regarded as a vector
Document = {term1, term2, …… ,term N}
Document Vector = {weight1, weight2, …… ,weight N}Think of the statement to be queried as a simple document , Also expressed as a vector :
Query = {term1, term 2, …… , term N}
Query Vector = {weight1, weight2, …… , weight N}Put the searched document vector and query vector into N Dimensional space , Each word means one dimension :
The smaller the angle , The more similar , The more relevant
边栏推荐
- Solve the problem of dormitory router campus network sharing login
- JVM垃圾回收机
- Concat and concat_ Ws() differences and groups_ Use of concat() and repeat() functions
- concat和concat_ws()区别及group_concat()和repeat()函数的使用
- 使用并行可微模拟加速策略学习
- Generate directories from web content
- Uio-66-cooh loaded bendamostine | hydroxyapatite (HA) coated MIL-53 (FE) nanoparticles | baicalin loaded manganese based metal organic skeleton material
- Formation of mil-100 (FE) coated small molecule aspirin [email protected] (FE) | glycyrrhetinic acid modified metal organ
- jvm-类加载
- JVM runtime data area
猜你喜欢

小项目(servelt+jsp+mysql+EL+JSTL)完成一个登录功能的Servlet,具有增删改查的操作。实现登录身份验证,防止非法登录,防止多点登录,记住用户名密码功能。

修改数据库中的记录为什么报这个错

Generate directories from web content

7-9 find a small ball with a balance
[email protected] (FE) | glycyrrhetinic acid modified metal organ"/>Formation of mil-100 (FE) coated small molecule aspirin [email protected] (FE) | glycyrrhetinic acid modified metal organ
![[clean up the extraordinary image of Disk C]](/img/0d/331115bc5d82d6337ae975a08494b2.jpg)
[clean up the extraordinary image of Disk C]

JVM class loading

protobuf与grpc

GRPC的四种数据流以及案例

必贝特医药冲刺科创板:年营收97万亏损1.37亿 拟募资20亿
随机推荐
Common mixins
concat和concat_ws()区别及group_concat()和repeat()函数的使用
JS Part 2
How to bold text in AI
Leetcode (4) - - trouver la médiane de deux tableaux ordonnés positifs
7-28 monkeys choose King (Joseph problem)
Toast UI editor (editor allows you to edit your markup document using text or WYSIWYG, with syntax highlighting, scrolling synchronization, real-time preview and chart functions.)
JS matrix zero
必贝特医药冲刺科创板:年营收97万亏损1.37亿 拟募资20亿
C language,%d% Difference between 2D%2d%02d
Doxorubicin loaded on metal organic framework MIL-88 DOX | folic acid modified uio-66-nh2 doxorubicin loaded [email
Scroll detection of the navigation bar enables the navigation bar to slide and fix with no content
Duet date picker (time plug-in that can manually enter the date)
Exercise 9-1 time conversion
战略、战术(和 OKR)
QT learning 22 layout manager (I)
Exercise 6-2 using functions to sum special A-string sequences
Scroll detection, so that the content in the lower right corner is not displayed at the top of the page, but is displayed as the mouse slides
全局事件总线
2021年区域赛ICPC沈阳站J-Luggage Lock(代码简洁)