当前位置：网站首页>Lucene mind map makes search engines no longer difficult to understand

Lucene mind map makes search engines no longer difficult to understand

2022-06-11 00:43:00 【Wenxiaowu】

today , Let's talk about lucene, Students, move the bench and sit down .

（lucene What are you doing ？）

First, let's look at a mind map ：

That's us java Common full-text search engine framework , The search function of many items is based on the above 4 Completed by a framework .

therefore lucene What the hell is it ？

Lucene Is an open source library for full text retrieval and search , A simple but powerful core code base that can easily add search functionality to an application API.

Lucene, Now the most popular Java Full text search framework . The reason is simple ,hibernate search、solr、elasticsearch It's all based on lucene An expanded search engine .

Hibernate Search Is in apache Lucene Built on the basis of mainly used for Hibernate The persistence model of full-text retrieval tools .

Elasticsearch Also used Java Develop and use Lucene As its core to achieve all index and search functions , But its purpose is through simple RESTful API To hide Lucene Complexity , So that full-text search becomes simple .

Solr It is an open source 、 be based on Lucene Java Search server for , Easy to add to Web In the application . Provides layer search ( Statistics )、 Hit the eye-catching display and support a variety of output formats （ Include XML/XSLT and JSON Equiform ）.

therefore lucene A cow is not a cow ！！

Next , We divide it into the following parts to understand 、 open lucene The true face of .

Relevant concepts

Build index and query index process

Inverted index

Visualization tools

Project application guide

Relevant concepts

lucene Official website ：http://lucene.apache.org/

Since it is a full-text search tool , There must be some sort structure and rules . When we enter keywords ,lucene Can install the internal hierarchy to quickly retrieve the content I need . There are several levels and concepts involved .

The index library (Index)

A directory an index library , All the files in the same folder form one file Lucene The index library . The concept of a table similar to a database .

（lucene Index instance for ）

paragraph (Segment)

Lucene An index may consist of multiple sub indexes , These sub indexes become segments . Each segment is a complete and independent index , Can be searched .

file (Document)

An index can contain multiple segments , Segments are independent , Adding a new document generates a new segment , Different segments can be merged . A segment is a unit of indexed data storage . similar Rows in the database perhaps Documents in the document database The concept of .

Domain (Field)

A document contains different types of information , It can be indexed separately , Such as the title , Time , Text , Author, etc . Be similar to Fields in database tables .

word (Term)

Words are the smallest unit of index , It is a string after lexical analysis and language processing . One Field By one or more Term form . For example, the title is “hello lucene”, After the participle, there is “hello”,“lucene”, These two words are Term Content information of , When keyword search “hello” perhaps “lucene” The title will be searched out .

Word segmentation is （Analyzer）

A meaningful passage needs to pass Analyzer To be divided into words before you can search by keyword .StandartdAnalyzer yes Lucene Commonly used analyzers in , Chinese participles have CJKAnalyzer、SmartChinieseAnalyzer etc. .

（lucene Conceptual diagram of index storage structure ）

The above figure can be roughly understood as follows , The index is internally composed of multiple segments , When a new document is added, a new segment is generated , Different segments can be merged （Segment-0、Segment-1、Segment-2 Merge into Segment-4）, The section contains the document number and the index information of the document . Each document has multiple fields that can be indexed , Each domain can specify a different type （StringField,TextField）.

therefore , As you can see from the diagram ,lucene The hierarchy of is as follows ： Indexes (Index) –> paragraph (segment) –> file (Document) –> Domain (Field) –> word (Term).

Above we know lucene Some basic concepts of , Next, let's move on to the principle analysis .

（ Why? lucene Search engine queries are so fast ？）

Inverted index

We all know that to improve the retrieval speed, we need to build an index , The point is here ,lucene Using inverted index （ Also called reverse index ） Structure .

Inverted index （ Reverse index ） Naturally there are forward index （ Forward index ）.

Forward index means Retrieve words from documents , In normal queries, we search for the keyword word from the document .
Inverted index means Retrieve documents from words , And from the forward index is the inverse concept , You need to prepare keywords for the document in advance , Then, when querying, you can directly match keywords to get the corresponding documents .

There is a conclusion like this ： Because it's not up to the record to determine the property value , It's the property value that determines the location of the record , So it's called inverted index (inverted index).

（ How did you achieve it ？）

Let's take an example to study （ Examples come from the Internet ）：

Suppose you now have two documents , The contents are ：

file 1：home sales rise in July.
file 2：increase in home sales in July.

It can be seen from the analysis of the above figure , First, the document passes through the word breaker （Analyzer） After the participle , We can get the word （term）, Words and documents ID It corresponds to , Next, these word sets are sorted once , Then merge the same words and count the frequency , And record the documents that appear ID.

therefore ：

The implementation ,lucene Take the above three columns as Dictionary file （Term Dictionary）、 Frequency file (frequencies)、 Location file (positions) preservation . among The dictionary file not only stores every keyword , It also keeps pointers to frequency files and location files , The frequency information and location information of the keyword can be found through the pointer .

When indexing , Let's say you want to query the words “sales”,lucene First look up the binary Dictionary 、 Find the word , Read all article numbers by pointing to the frequency file , And then return the result . Dictionaries are usually very small , thus , The whole process takes milliseconds .

（ So that's it ！）

lucene Visualization tools Luke

https://github.com/DmitryKey/luke/releases

Build index and query index process

Above we know lucene The principle of building an index , Next, we use at the code level lucene.

Let's start with a picture ：

Index files before retrieving them , So the picture above has to start from “ Documents to be retrieved ” The node starts to see .

Indexing process ：

1、 Build... For each file to be retrieved Document Class object , Treat each part of the document as Field Class object .

2、 Use Analyzer Class implements word segmentation of natural language text in documents , And use IndexWriter Class to build indexes .

3、 Use FSDirectory Class to set the way and location of index storage , Realize the storage of index .

Retrieval indexing process ：

4、 Use IndexReader Class to read the index .

5、 Use Term Class represents the keyword searched by the user and the field where the keyword is located , Use QueryParser Class represents the user's query criteria .

6、 Use IndexSearcher Class to retrieve the index , Returns the matching Document Class object .

The dotted line points to the package name of this class （package）. Such as Analyzer stay org.apache.lucene.analysis It's a bag .

Build index code ：

// Create index 
public class CreateTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");

//        FSDirectory There are three main subclasses ,open Method will automatically select the most appropriate subclass to create according to the system environment 
//        MMapDirectory：Linux, MacOSX, Solaris
//        NIOFSDirectory：other non-Windows JREs
//        SimpleFSDirectory：other JREs on Windows
        Directory dir = FSDirectory.open(indexPath);

        //  Word segmentation is 
        Analyzer analyzer = new StandardAnalyzer();
        boolean create = true;
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        if (create) {
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // lucene Update is not supported , This is just to delete the old index , Then create a new index 
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }
        IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig);

        Document doc = new Document();
        //  Field values are indexed , But it won't be segmented , Be regarded as a complete token Handle , Generally used in “ Country ” perhaps “ID
        // Field.Store Indicates whether the original field value is stored in the index 
        //  If you want to display the field values in the query results , You need to store it 
        //  If the content is too large and the field value does not need to be displayed （ The whole article is ）, Is not suitable for storage in the index 
        doc.add(new StringField("Title", "sean", Field.Store.YES));
        long time = new Date().getTime();
        // LongPoint Does not store domain values 
        doc.add(new LongPoint("LastModified", time));
//        doc.add(new NumericDocValuesField("LastModified", time));
        //  Fields that are automatically indexed and segmented , Usually used in the body of an article 
        doc.add(new TextField("Content", "this is a test of sean", Field.Store.NO));

        List<Document> docs = new LinkedList<>();
        docs.add(doc);

        indexWriter.addDocuments(docs);
        //  By default, it will be submitted before closing 
        indexWriter.close();
    }
}

Corresponding sequence diagram ：

Query index code ：

// Query index 
public class QueryTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");
        Directory dir = FSDirectory.open(indexPath);
        //  Word segmentation is 
        Analyzer analyzer = new StandardAnalyzer();

        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);

        //  Query multiple domains at the same time 
//        String[] queryFields = {"Title", "Content", "LastModified"};
//        QueryParser parser = new MultiFieldQueryParser(queryFields, analyzer);
//        Query query = parser.parse("sean");

        //  A field is searched by word doc
//        Term term = new Term("Title", "test");
//        Query query = new TermQuery(term);

        //  Fuzzy query 
//        Term term = new Term("Title", "se*");
//        WildcardQuery query = new WildcardQuery(term);

        //  Range queries 
        Query query1 = LongPoint.newRangeQuery("LastModified", 1L, 1637069693000L);

        //  Multi keyword query , Must specify slop（key Storage mode ）
        PhraseQuery.Builder phraseQueryBuilder = new PhraseQuery.Builder();
        phraseQueryBuilder.add(new Term("Content", "test"));
        phraseQueryBuilder.add(new Term("Content", "sean"));
        phraseQueryBuilder.setSlop(10);
        PhraseQuery query2 = phraseQueryBuilder.build();

        //  Composite query 
        BooleanQuery.Builder booleanQueryBuildr = new BooleanQuery.Builder();
        booleanQueryBuildr.add(query1, BooleanClause.Occur.MUST);
        booleanQueryBuildr.add(query2, BooleanClause.Occur.MUST);
        BooleanQuery query = booleanQueryBuildr.build();

        //  return doc Sort 
        //  Sort field must exist , Otherwise, an error will be reported 
        Sort sort = new Sort();
        SortField sortField = new SortField("Title", SortField.Type.SCORE);
        sort.setSort(sortField);

        TopDocs topDocs = searcher.search(query, 10, sort);
        if(topDocs.totalHits > 0)
            for(ScoreDoc scoreDoc : topDocs.scoreDocs){
                int docNum = scoreDoc.doc;
                Document doc = searcher.doc(docNum);
                System.out.println(doc.toString());
            }
    }
}

Corresponding sequence diagram ：

lucene Version information ：

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.4.0</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>

Project application guide

In actual development , I seldom use it directly lucene, Now the mainstream search framework solr、Elasticsearch It's all based on lucene, It provides us with a simpler API. Especially in distributed environments ,Elasticsearch You can ask us to solve the single point problem 、 Backup problems 、 Cluster fragmentation , More in line with the development trend .

原网站

版权声明
本文为[Wenxiaowu]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/162/202206102327575040.html