当前位置:网站首页>Lucene mind map makes search engines no longer difficult to understand

Lucene mind map makes search engines no longer difficult to understand

2022-06-11 00:43:00 Wenxiaowu

  today , Let's talk about lucene, Students, move the bench and sit down .

(lucene What are you doing ?)

First, let's look at a mind map :

That's us java Common full-text search engine framework , The search function of many items is based on the above 4 Completed by a framework .

therefore lucene What the hell is it ?

Lucene Is an open source library for full text retrieval and search , A simple but powerful core code base that can easily add search functionality to an application API.

Lucene, Now the most popular Java Full text search framework . The reason is simple ,hibernate search、solr、elasticsearch It's all based on lucene An expanded search engine .

Hibernate Search Is in apache Lucene Built on the basis of mainly used for Hibernate The persistence model of full-text retrieval tools .

Elasticsearch Also used Java Develop and use Lucene As its core to achieve all index and search functions , But its purpose is through simple RESTful API To hide Lucene Complexity , So that full-text search becomes simple .

Solr It is an open source 、 be based on Lucene Java Search server for , Easy to add to Web In the application . Provides layer search ( Statistics )、 Hit the eye-catching display and support a variety of output formats ( Include XML/XSLT and JSON Equiform ).

therefore lucene A cow is not a cow !!

Next , We divide it into the following parts to understand 、 open lucene The true face of .

  • Relevant concepts

  • Build index and query index process

  • Inverted index

  • Visualization tools

  • Project application guide

Relevant concepts

lucene Official website :http://lucene.apache.org/

Since it is a full-text search tool , There must be some sort structure and rules . When we enter keywords ,lucene Can install the internal hierarchy to quickly retrieve the content I need . There are several levels and concepts involved .

The index library (Index)

A directory an index library , All the files in the same folder form one file Lucene The index library . The concept of a table similar to a database .

(lucene Index instance for )

paragraph (Segment)

Lucene An index may consist of multiple sub indexes , These sub indexes become segments . Each segment is a complete and independent index , Can be searched .

file (Document)

An index can contain multiple segments , Segments are independent , Adding a new document generates a new segment , Different segments can be merged . A segment is a unit of indexed data storage . similar Rows in the database perhaps Documents in the document database The concept of .

Domain (Field)

A document contains different types of information , It can be indexed separately , Such as the title , Time , Text , Author, etc . Be similar to Fields in database tables .

word (Term)

Words are the smallest unit of index , It is a string after lexical analysis and language processing . One Field By one or more Term form . For example, the title is “hello lucene”, After the participle, there is “hello”,“lucene”, These two words are Term Content information of , When keyword search “hello” perhaps “lucene” The title will be searched out .

Word segmentation is (Analyzer)

A meaningful passage needs to pass Analyzer To be divided into words before you can search by keyword .StandartdAnalyzer yes Lucene Commonly used analyzers in , Chinese participles have CJKAnalyzer、SmartChinieseAnalyzer etc. .

(lucene Conceptual diagram of index storage structure )

The above figure can be roughly understood as follows , The index is internally composed of multiple segments , When a new document is added, a new segment is generated , Different segments can be merged (Segment-0、Segment-1、Segment-2 Merge into Segment-4), The section contains the document number and the index information of the document . Each document has multiple fields that can be indexed , Each domain can specify a different type (StringField,TextField).

therefore , As you can see from the diagram ,lucene The hierarchy of is as follows : Indexes (Index) –> paragraph (segment) –> file (Document) –> Domain (Field) –> word (Term).

Above we know lucene Some basic concepts of , Next, let's move on to the principle analysis .

( Why? lucene Search engine queries are so fast ?)

Inverted index

We all know that to improve the retrieval speed, we need to build an index , The point is here ,lucene Using inverted index ( Also called reverse index ) Structure .

Inverted index ( Reverse index ) Naturally there are forward index ( Forward index ).

  • Forward index means Retrieve words from documents , In normal queries, we search for the keyword word from the document .

  • Inverted index means Retrieve documents from words , And from the forward index is the inverse concept , You need to prepare keywords for the document in advance , Then, when querying, you can directly match keywords to get the corresponding documents .

There is a conclusion like this : Because it's not up to the record to determine the property value , It's the property value that determines the location of the record , So it's called inverted index (inverted index).

( How did you achieve it ?)

Let's take an example to study ( Examples come from the Internet ):

Suppose you now have two documents , The contents are :

  • file 1:home sales rise in July.

  • file 2:increase in home sales in July.     

It can be seen from the analysis of the above figure , First, the document passes through the word breaker (Analyzer) After the participle , We can get the word (term), Words and documents ID It corresponds to , Next, these word sets are sorted once , Then merge the same words and count the frequency , And record the documents that appear ID.

therefore :

The implementation ,lucene Take the above three columns as Dictionary file (Term Dictionary) Frequency file (frequencies)、 Location file (positions) preservation . among The dictionary file not only stores every keyword , It also keeps pointers to frequency files and location files , The frequency information and location information of the keyword can be found through the pointer

When indexing , Let's say you want to query the words “sales”,lucene First look up the binary Dictionary 、 Find the word , Read all article numbers by pointing to the frequency file , And then return the result . Dictionaries are usually very small , thus , The whole process takes milliseconds .  

( So that's it !)

lucene Visualization tools Luke

  • https://github.com/DmitryKey/luke/releases

Build index and query index process

Above we know lucene The principle of building an index , Next, we use at the code level lucene.

Let's start with a picture :

Index files before retrieving them , So the picture above has to start from “ Documents to be retrieved ” The node starts to see .

Indexing process :

1、 Build... For each file to be retrieved Document Class object , Treat each part of the document as Field Class object .

2、 Use Analyzer Class implements word segmentation of natural language text in documents , And use IndexWriter Class to build indexes .

3、 Use FSDirectory Class to set the way and location of index storage , Realize the storage of index .

Retrieval indexing process :

4、 Use IndexReader Class to read the index .

5、 Use Term Class represents the keyword searched by the user and the field where the keyword is located , Use QueryParser Class represents the user's query criteria .

6、 Use IndexSearcher Class to retrieve the index , Returns the matching Document Class object .

The dotted line points to the package name of this class (package). Such as Analyzer stay org.apache.lucene.analysis It's a bag .

Build index code :

// Create index 
public class CreateTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");

//        FSDirectory There are three main subclasses ,open Method will automatically select the most appropriate subclass to create according to the system environment 
//        MMapDirectory:Linux, MacOSX, Solaris
//        NIOFSDirectory:other non-Windows JREs
//        SimpleFSDirectory:other JREs on Windows
        Directory dir = FSDirectory.open(indexPath);

        //  Word segmentation is 
        Analyzer analyzer = new StandardAnalyzer();
        boolean create = true;
        IndexWriterConfig indexWriterConfig = new IndexWriterConfig(analyzer);
        if (create) {
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        } else {
            // lucene Update is not supported , This is just to delete the old index , Then create a new index 
            indexWriterConfig.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        }
        IndexWriter indexWriter = new IndexWriter(dir, indexWriterConfig);

        Document doc = new Document();
        //  Field values are indexed , But it won't be segmented , Be regarded as a complete token Handle , Generally used in “ Country ” perhaps “ID
        // Field.Store Indicates whether the original field value is stored in the index 
        //  If you want to display the field values in the query results , You need to store it 
        //  If the content is too large and the field value does not need to be displayed ( The whole article is ), Is not suitable for storage in the index 
        doc.add(new StringField("Title", "sean", Field.Store.YES));
        long time = new Date().getTime();
        // LongPoint Does not store domain values 
        doc.add(new LongPoint("LastModified", time));
//        doc.add(new NumericDocValuesField("LastModified", time));
        //  Fields that are automatically indexed and segmented , Usually used in the body of an article 
        doc.add(new TextField("Content", "this is a test of sean", Field.Store.NO));

        List<Document> docs = new LinkedList<>();
        docs.add(doc);

        indexWriter.addDocuments(docs);
        //  By default, it will be submitted before closing 
        indexWriter.close();
    }
}

Corresponding sequence diagram :

Query index code :

// Query index 
public class QueryTest {

    public static void main(String[] args) throws Exception {
        Path indexPath = FileSystems.getDefault().getPath("d:\\index\\");
        Directory dir = FSDirectory.open(indexPath);
        //  Word segmentation is 
        Analyzer analyzer = new StandardAnalyzer();

        IndexReader reader = DirectoryReader.open(dir);
        IndexSearcher searcher = new IndexSearcher(reader);

        //  Query multiple domains at the same time 
//        String[] queryFields = {"Title", "Content", "LastModified"};
//        QueryParser parser = new MultiFieldQueryParser(queryFields, analyzer);
//        Query query = parser.parse("sean");

        //  A field is searched by word doc
//        Term term = new Term("Title", "test");
//        Query query = new TermQuery(term);

        //  Fuzzy query 
//        Term term = new Term("Title", "se*");
//        WildcardQuery query = new WildcardQuery(term);

        //  Range queries 
        Query query1 = LongPoint.newRangeQuery("LastModified", 1L, 1637069693000L);

        //  Multi keyword query , Must specify slop(key Storage mode )
        PhraseQuery.Builder phraseQueryBuilder = new PhraseQuery.Builder();
        phraseQueryBuilder.add(new Term("Content", "test"));
        phraseQueryBuilder.add(new Term("Content", "sean"));
        phraseQueryBuilder.setSlop(10);
        PhraseQuery query2 = phraseQueryBuilder.build();

        //  Composite query 
        BooleanQuery.Builder booleanQueryBuildr = new BooleanQuery.Builder();
        booleanQueryBuildr.add(query1, BooleanClause.Occur.MUST);
        booleanQueryBuildr.add(query2, BooleanClause.Occur.MUST);
        BooleanQuery query = booleanQueryBuildr.build();

        //  return doc Sort 
        //  Sort field must exist , Otherwise, an error will be reported 
        Sort sort = new Sort();
        SortField sortField = new SortField("Title", SortField.Type.SCORE);
        sort.setSort(sortField);

        TopDocs topDocs = searcher.search(query, 10, sort);
        if(topDocs.totalHits > 0)
            for(ScoreDoc scoreDoc : topDocs.scoreDocs){
                int docNum = scoreDoc.doc;
                Document doc = searcher.doc(docNum);
                System.out.println(doc.toString());
            }
    }
}

Corresponding sequence diagram :

lucene Version information :

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>7.4.0</version>
</dependency>

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-queryparser</artifactId>
    <version>7.4.0</version>
</dependency>

Project application guide

In actual development , I seldom use it directly lucene, Now the mainstream search framework solr、Elasticsearch It's all based on lucene, It provides us with a simpler API. Especially in distributed environments ,Elasticsearch You can ask us to solve the single point problem 、 Backup problems 、 Cluster fragmentation , More in line with the development trend .

原网站

版权声明
本文为[Wenxiaowu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206102327575040.html