当前位置:网站首页>Lucene mind map makes search engines no longer difficult to understand
Lucene mind map makes search engines no longer difficult to understand
2022-06-11 00:43:00 【Wenxiaowu】

today , Let's talk about lucene, Students, move the bench and sit down .

(lucene What are you doing ?)
First, let's look at a mind map :

That's us java Common full-text search engine framework , The search function of many items is based on the above 4 Completed by a framework .
therefore lucene What the hell is it ?
Lucene Is an open source library for full text retrieval and search , A simple but powerful core code base that can easily add search functionality to an application API.
Lucene, Now the most popular Java Full text search framework . The reason is simple ,hibernate search、solr、elasticsearch It's all based on lucene An expanded search engine .
Hibernate Search Is in apache Lucene Built on the basis of mainly used for Hibernate The persistence model of full-text retrieval tools .
Elasticsearch Also used Java Develop and use Lucene As its core to achieve all index and search functions , But its purpose is through simple RESTful API To hide Lucene Complexity , So that full-text search becomes simple .
Solr It is an open source 、 be based on Lucene Java Search server for , Easy to add to Web In the application . Provides layer search ( Statistics )、 Hit the eye-catching display and support a variety of output formats ( Include XML/XSLT and JSON Equiform ).
therefore lucene A cow is not a cow !!
Next , We divide it into the following parts to understand 、 open lucene The true face of .
Relevant concepts
Build index and query index process
Inverted index
Visualization tools
Project application guide
Relevant concepts
lucene Official website :http://lucene.apache.org/
Since it is a full-text search tool , There must be some sort structure and rules . When we enter keywords ,lucene Can install the internal hierarchy to quickly retrieve the content I need . There are several levels and concepts involved .

The index library (Index)
A directory an index library , All the files in the same folder form one file Lucene The index library . The concept of a table similar to a database .

(lucene Index instance for )
paragraph (Segment)
Lucene An index may consist of multiple sub indexes , These sub indexes become segments . Each segment is a complete and independent index , Can be searched .
file (Document)
An index can contain multiple segments , Segments are independent , Adding a new document generates a new segment , Different segments can be merged . A segment is a unit of indexed data storage . similar Rows in the database perhaps Documents in the document database The concept of .
Domain (Field)
A document contains different types of information , It can be indexed separately , Such as the title , Time , Text , Author, etc . Be similar to Fields in database tables .
word (Term)
Words are the smallest unit of index , It is a string after lexical analysis and language processing . One Field By one or more Term form . For example, the title is “hello lucene”, After the participle, there is “hello”,“lucene”, These two words are Term Content information of , When keyword search “hello” perhaps “lucene” The title will be searched out .
Word segmentation is (Analyzer)
A meaningful passage needs to pass Analyzer To be divided into words before you can search by keyword .StandartdAnalyzer yes Lucene Commonly used analyzers in , Chinese participles have CJKAnalyzer、SmartChinieseAnalyzer etc. .

(lucene Conceptual diagram of index storage structure )
The above figure can be roughly understood as follows , The index is internally composed of multiple segments , When a new document is added, a new segment is generated , Different segments can be merged (Segment-0、Segment-1、Segment-2 Merge into Segment-4), The section contains the document number and the index information of the document . Each document has multiple fields that can be indexed , Each domain can specify a different type (StringField,TextField).
therefore , As you can see from the diagram ,lucene The hierarchy of is as follows : Indexes (Index) –> paragraph (segment) –> file (Document) –> Domain (Field) –> word (Term).
Above we know lucene Some basic concepts of , Next, let's move on to the principle analysis .
( Why? lucene Search engine queries are so fast ?)
Inverted index
We all know that to improve the retrieval speed, we need to build an index , The point is here ,lucene Using inverted index ( Also called reverse index ) Structure .
Inverted index ( Reverse index ) Naturally there are forward index ( Forward index ).
Forward index means Retrieve words from documents , In normal queries, we search for the keyword word from the document .
Inverted index means Retrieve documents from words , And from the forward index is the inverse concept , You need to prepare keywords for the document in advance , Then, when querying, you can directly match keywords to get the corresponding documents .
There is a conclusion like this : Because it's not up to the record to determine the property value , It's the property value that determines the location of the record , So it's called inverted index (inverted index).


( How did you achieve it ?)
Let's take an example to study ( Examples come from the Internet ):
Suppose you now have two documents , The contents are :
file 1:home sales rise in July.
file 2:increase in home sales in July.

It can be seen from the analysis of the above figure , First, the document passes through the word breaker (Analyzer) After the participle , We can get the word (term), Words and documents ID It corresponds to , Next, these word sets are sorted once , Then merge the same words and count the frequency , And record the documents that appear ID.
therefore :
The implementation ,lucene Take the above three columns as Dictionary file (Term Dictionary)、 Frequency file (frequencies)、 Location file (positions) preservation . among The dictionary file not only stores every keyword , It also keeps pointers to frequency files and location files , The frequency information and location information of the keyword can be found through the pointer .
When indexing , Let's say you want to query the words “sales”,lucene First look up the binary Dictionary 、 Find the word , Read all article numbers by pointing to the frequency file , And then return the result . Dictionaries are usually very small , thus , The whole process takes milliseconds .
( So that's it !)
lucene Visualization tools Luke
https://github.com/DmitryKey/luke/releases

Build index and query index process
Above we know lucene The principle of building an index , Next, we use at the code level lucene.
Let's start with a picture :

Index files before retrieving them , So the picture above has to start from “ Documents to be retrieved ” The node starts to see .
Indexing process :
1、 Build... For each file to be retrieved Document Class object , Treat each part of the document as Field Class object .
2、 Use Analyzer Class implements word segmentation of natural language text in documents , And use IndexWriter Class to build indexes .
3、 Use FSDirectory Class to set the way and location of index storage , Realize the storage of index .
Retrieval indexing process :
4、 Use IndexReader Class to read the index .
5、 Use Term Class represents the keyword searched by the user and the field where the keyword is located , Use QueryParser Class represents the user's query criteria .
6、 Use IndexSearcher Class to retrieve the index , Returns the matching Document Class object .
The dotted line points to the package name of this class (package). Such as Analyzer stay org.apache.lucene.analysis It's a bag .

Build index code :
// Create index
public class CreateTest {
public static void main(String[] args) throws Exception {
Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");
// FSDirectory There are three main subclasses ,open Method will automatically select the most appropriate subclass to create according to the system environment
// MMapDirectory:Linux, MacOSX, Solaris
// NIOFSDirectory:other non-Windows JREs
// SimpleFSDirectory:other JREs on Windows
Directory dir = FSDirectory.open(indexPath);
// Word segmentation is
Analyzer analyzer = new StandardAnalyzer();
boolean create = true;
IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
if (create) {
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
} else {
// lucene Update is not supported , This is just to delete the old index , Then create a new index
indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
}
IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig);
Document doc = new Document();
// Field values are indexed , But it won't be segmented , Be regarded as a complete token Handle , Generally used in “ Country ” perhaps “ID
// Field.Store Indicates whether the original field value is stored in the index
// If you want to display the field values in the query results , You need to store it
// If the content is too large and the field value does not need to be displayed ( The whole article is ), Is not suitable for storage in the index
doc.add(new StringField("Title", "sean", Field.Store.YES));
long time = new Date().getTime();
// LongPoint Does not store domain values
doc.add(new LongPoint("LastModified", time));
// doc.add(new NumericDocValuesField("LastModified", time));
// Fields that are automatically indexed and segmented , Usually used in the body of an article
doc.add(new TextField("Content", "this is a test of sean", Field.Store.NO));
List<Document> docs = new LinkedList<>();
docs.add(doc);
indexWriter.addDocuments(docs);
// By default, it will be submitted before closing
indexWriter.close();
}
}Corresponding sequence diagram :

Query index code :
// Query index
public class QueryTest {
public static void main(String[] args) throws Exception {
Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");
Directory dir = FSDirectory.open(indexPath);
// Word segmentation is
Analyzer analyzer = new StandardAnalyzer();
IndexReader reader = DirectoryReader.open(dir);
IndexSearcher searcher = new IndexSearcher(reader);
// Query multiple domains at the same time
// String[] queryFields = {"Title", "Content", "LastModified"};
// QueryParser parser = new MultiFieldQueryParser(queryFields, analyzer);
// Query query = parser.parse("sean");
// A field is searched by word doc
// Term term = new Term("Title", "test");
// Query query = new TermQuery(term);
// Fuzzy query
// Term term = new Term("Title", "se*");
// WildcardQuery query = new WildcardQuery(term);
// Range queries
Query query1 = LongPoint.newRangeQuery("LastModified", 1L, 1637069693000L);
// Multi keyword query , Must specify slop(key Storage mode )
PhraseQuery.Builder phraseQueryBuilder = new PhraseQuery.Builder();
phraseQueryBuilder.add(new Term("Content", "test"));
phraseQueryBuilder.add(new Term("Content", "sean"));
phraseQueryBuilder.setSlop(10);
PhraseQuery query2 = phraseQueryBuilder.build();
// Composite query
BooleanQuery.Builder booleanQueryBuildr = new BooleanQuery.Builder();
booleanQueryBuildr.add(query1, BooleanClause.Occur.MUST);
booleanQueryBuildr.add(query2, BooleanClause.Occur.MUST);
BooleanQuery query = booleanQueryBuildr.build();
// return doc Sort
// Sort field must exist , Otherwise, an error will be reported
Sort sort = new Sort();
SortField sortField = new SortField("Title", SortField.Type.SCORE);
sort.setSort(sortField);
TopDocs topDocs = searcher.search(query, 10, sort);
if(topDocs.totalHits > 0)
for(ScoreDoc scoreDoc : topDocs.scoreDocs){
int docNum = scoreDoc.doc;
Document doc = searcher.doc(docNum);
System.out.println(doc.toString());
}
}
}Corresponding sequence diagram :

lucene Version information :
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>7.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>7.4.0</version>
</dependency>Project application guide
In actual development , I seldom use it directly lucene, Now the mainstream search framework solr、Elasticsearch It's all based on lucene, It provides us with a simpler API. Especially in distributed environments ,Elasticsearch You can ask us to solve the single point problem 、 Backup problems 、 Cluster fragmentation , More in line with the development trend .
边栏推荐
- How to handle file cache and session?
- Block queue - delayedworkqueue Source Analysis
- Kwai handled more than 54000 illegal accounts: how to crack down on illegal accounts on the platform
- [JVM] garbage collection mechanism
- Excel单元格
- Word在目录里插入引导符(页码前的小点点)的方法
- Room first use
- 安全培训管理办法
- Kubeflow 1.2.0 installation
- Synchronized keyword for concurrent programming
猜你喜欢

VTK例子--三个相交的平面
![[MVC&Core]ASP. Introduction to net core MVC view value transfer](/img/c2/3e69cda2fed396505b5aa5888b9e5f.png)
[MVC&Core]ASP. Introduction to net core MVC view value transfer

Unable to return to the default page after page Jump

452. detonate the balloon with the minimum number of arrows

How to install mathtype6.9 and related problem solutions in office2016 (word2016)

VTK example -- three intersecting planes

【数据库】Mysql索引面试题

MESI cache consistency protocol for concurrent programming
![[JVM] thread](/img/fc/c1f2eeaca639f319a4ef33f5fa6658.png)
[JVM] thread

【无标题】4555
随机推荐
Yum source update
[JVM] memory model
[network planning] 2.2.4 Web cache / proxy server
阻塞队列 — DelayedWorkQueue源码分析
【JVM】类加载机制
Is it safe to open an account for stock speculation in Shanghai?
VTK example -- three intersecting planes
Rich text activity test 1
数组的字典排序
Multipass Chinese document - Overview
项目连接不到远程虚拟机The driver has not received any packets from the server.
333333
LeetCode 1673. 找出最具竞争力的子序列**
Computer screen recording free software GIF and other format videos
Block queue - delayedworkqueue Source Analysis
非重叠矩形中的随机点
数的奥秘之幂数与完全平方数
Learning notes: hook point of plug-in activity
Safety training management measures
[no title] 4555