当前位置：网站首页>Notes on index building and search execution in Lucene

Notes on index building and search execution in Lucene

2022-06-09 05:32:00 【kaims】

Build index

  1. establish Directory object , Specify the storage location of the index library 
  2. establish Analyzer object , Specify the parser type 
  3. be based on 1 and 2 establish IndexWriter object 
  4. establish Document object 
  5. establish Field object , And will Field Object added to Document In the object 
  6. Use IndexWriter Object will Document Object is written to the index library 
  7. close IndexWriter object

Directory object

Lucene in ,Directory Abstract classes have two subclasses , Namely RAMDirectory and FSDirectory.

SimpleFSDirectory class ：FSDirectory Simple implementation of , Limited concurrency , When multiple threads read the same file, they will encounter a bottleneck .
NIOFSDirectory class ： adopt java.nio’s FileChannel Implement positioning reading , Support for multithreaded reading （ Thread safe by default ）. This class only uses FileChannel Read it , The write operation is through FSIndexOutput Realization . Be careful ：NIOFSDirectory Do not apply to Windows System , In addition, if a thread accessing this class , stay IO Blocked by interrupt or cancel, This will cause the underlying file descriptor to be closed , Subsequent threads visit again NIOFSDirectory There will be ClosedChannelException abnormal , In this case... Shall be applied SimpleFSDirectory Instead of .
RAMDirectory class ： Memory resident Directory Realization way . Default by SingleInstanceLockFactory（ Single instance lock factory ） Implement the lock . This class is not suitable for a large number of indexes . It also does not apply to multithreading . In case of large amount of index data, it is recommended to use MMapDirectory Instead of .RAMDirectory yes Directory Abstract classes use the most file - stored implementation classes in memory , It mainly saves all index files to memory . This can improve efficiency . But if the index file is too large , Will result in insufficient memory , therefore , Small systems are recommended , If a large one , Index file reached G On a level , Recommended FSDirectory.
MMapDirectory class ： Read through memory mapping , adopt FSIndexOutput To write FSDirectory Implementation class . When using this class, make sure to use enough virtual address space . In addition, when passing IndexInput Of close Method does not immediately close the underlying file handle , Only GC It will be closed only when recycling .

Analyzer object

WhitespaceAnalyzer class ： Based on white space characters only （whitespace） Carry out word segmentation .
KeywordAnalyzer class ： Don't do any participle , Take the entire raw input as a token. So you can see that the output is only 1 individual token, Is the original sentence .
SimpleAnalyzer class ： According to non alphabetic （non-letters） participle , And will token Convert all to lowercase . So the output of the participle is terms They are all composed of lowercase letters .
StopAnalyzer class ： stay SimpleAnalyzer On the basis of the addition of removal StopWords The function of .
StandardAnalyzer class ： be based on JFlex Do the grammatical word segmentation （ Chinese is divided by word , English is divided by blank space ）, Then delete the stop words , And will token Convert all to lowercase .
ChineseAnalyzer class ： Performance similar to StandardAnalyzer, The disadvantage is that it does not support Chinese and English mixed word segmentation .
CJKAnalyzer class ：chedong Written CJKAnalyzer The function of in English processing and StandardAnalyzer identical , But in Chinese participle , You cannot filter out punctuation , That is, using binary segmentation .

Field object

Three categories of attributes ：

Whether to analyze ： Whether to segment the content of the domain . The premise is that we need to query the content of the domain .
Index or not ： take Field The analyzed word or the whole Field Value to index , Only the index can search . such as ： Name of commodity 、 The product profile is analyzed and indexed , The order number 、 I. D. numbers need not be analyzed, but also indexed , These will be used as query criteria in the future .
Whether to store ： take Field Values are stored in the document , Stored in a document Field Only from Document In order to get . such as ： Name of commodity 、 The order number , Everything in the future will come from Document From Field All have to be stored .

Field Each subclass of the implements the storage of different types of fields , At the same time, different field attributes are selected , Here are a few common ：

TextField： Store string type data .indexing+analyze; By default, the original data is not stored . Apply to need Full text search The data of , E.g. email content , Web content, etc .
StringField： Store string type data .indexing But I don't analyze, That is, the whole string is a token; By default, the original data is not stored . Applicable to article title 、 The person's name 、ID Just wait Exactly match String .
IntPoint, LongPoint, FloatPoint, DoublePoint： It is used to store various types of numerical data .indexing; By default, the original data is not stored . It is applicable to the storage of numerical data .

Execution search

https://www.cnblogs.com/leeSmall/p/9027172.html

1.  Create a Directory object , That is, where the index library is stored 
2.  Create a DirectoryReader object , You need to specify the Directory object 
3.  Create a IndexSearcher object , You need to specify the IndexReader object 
4.  establish Query object , And execute the query 
6.  Return query results , Traverse the query results and output 
7.  close DirectoryReader object

Query object

TermQuery： Single keyword precise query , Be careful TermQuery Submit the search text directly for search , Don't make analyze operation .

TermQuery tq = new TermQuery(new Term(“name", “thinkpad"));

RangeQuery： Range queries
PhraseQuery： Multi keyword query
MultiPhraseQuery： Multi keyword query , Support multiple words in the same position OR matching
BooleanQuery： Multiconditional query

//  Boolean query 
Query query1 = new TermQuery(new Term(filedName, "thinkpad"));
query1 = new TermQuery(new Term(filedName, "thinkpad"))
Query query2 = new TermQuery(new Term("simpleIntro", " Intel "));
BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder();
booleanQueryBuilder.add(query1, Occur.SHOULD);
booleanQueryBuilder.add(query2, Occur.MUST);
BooleanQuery booleanQuery = booleanQueryBuilder.build();

Parser

Traditional parsers ：QueryParser and MultiFieldQueryParser

//  Traditional query parser - Multiple default fields 
QueryParser parser = new QueryParser("defaultFiled", analyzer);
Query query = parser.parse("query String");

//  Traditional query parser - Multiple default fields 
String[] multiDefaultFields = {
     "name", "type", "simpleIntro" };
MultiFieldQueryParser multiFieldQueryParser = new MultiFieldQueryParser(multiDefaultFields, analyzer);
//  Set the default combination operation , The default is  OR
multiFieldQueryParser.setDefaultOperator(Operator.OR);
Query query4 = multiFieldQueryParser.parse(" laptop  AND price:1999900");

Based on the new flexible Framework parser ：StandardQueryParser

StandardQueryParser queryParserHelper = new StandardQueryParser(analyzer);
//  Set default fields 
// queryParserHelper.setMultiFields(CharSequence[] fields);
// queryParserHelper.setPhraseSlop(8);
// Query query = queryParserHelper.parse("a AND b", "defaultField");
Query query5 = queryParserHelper.parse("(\" Lenovo laptop \" OR simpleIntro: Intel ) AND type: The computer  AND price:1999900","name");