当前位置:网站首页>Notes on index building and search execution in Lucene
Notes on index building and search execution in Lucene
2022-06-09 05:32:00 【kaims】
Build index
1. establish Directory object , Specify the storage location of the index library
2. establish Analyzer object , Specify the parser type
3. be based on 1 and 2 establish IndexWriter object
4. establish Document object
5. establish Field object , And will Field Object added to Document In the object
6. Use IndexWriter Object will Document Object is written to the index library
7. close IndexWriter object
Directory object
Lucene in ,Directory Abstract classes have two subclasses , Namely RAMDirectory and FSDirectory.
SimpleFSDirectoryclass :FSDirectory Simple implementation of , Limited concurrency , When multiple threads read the same file, they will encounter a bottleneck .NIOFSDirectoryclass : adopt java.nio’s FileChannel Implement positioning reading , Support for multithreaded reading ( Thread safe by default ). This class only uses FileChannel Read it , The write operation is through FSIndexOutput Realization . Be careful :NIOFSDirectory Do not apply to Windows System , In addition, if a thread accessing this class , stay IO Blocked by interrupt or cancel, This will cause the underlying file descriptor to be closed , Subsequent threads visit again NIOFSDirectory There will be ClosedChannelException abnormal , In this case... Shall be applied SimpleFSDirectory Instead of .RAMDirectoryclass : Memory resident Directory Realization way . Default by SingleInstanceLockFactory( Single instance lock factory ) Implement the lock . This class is not suitable for a large number of indexes . It also does not apply to multithreading . In case of large amount of index data, it is recommended to use MMapDirectory Instead of .RAMDirectory yes Directory Abstract classes use the most file - stored implementation classes in memory , It mainly saves all index files to memory . This can improve efficiency . But if the index file is too large , Will result in insufficient memory , therefore , Small systems are recommended , If a large one , Index file reached G On a level , Recommended FSDirectory.MMapDirectoryclass : Read through memory mapping , adopt FSIndexOutput To write FSDirectory Implementation class . When using this class, make sure to use enough virtual address space . In addition, when passing IndexInput Of close Method does not immediately close the underlying file handle , Only GC It will be closed only when recycling .
Analyzer object
WhitespaceAnalyzerclass : Based on white space characters only (whitespace) Carry out word segmentation .KeywordAnalyzerclass : Don't do any participle , Take the entire raw input as a token. So you can see that the output is only 1 individual token, Is the original sentence .SimpleAnalyzerclass : According to non alphabetic (non-letters) participle , And will token Convert all to lowercase . So the output of the participle is terms They are all composed of lowercase letters .StopAnalyzerclass : stay SimpleAnalyzer On the basis of the addition of removal StopWords The function of .StandardAnalyzerclass : be based on JFlex Do the grammatical word segmentation ( Chinese is divided by word , English is divided by blank space ), Then delete the stop words , And will token Convert all to lowercase .ChineseAnalyzerclass : Performance similar to StandardAnalyzer, The disadvantage is that it does not support Chinese and English mixed word segmentation .CJKAnalyzerclass :chedong Written CJKAnalyzer The function of in English processing and StandardAnalyzer identical , But in Chinese participle , You cannot filter out punctuation , That is, using binary segmentation .
Field object
Three categories of attributes :
- Whether to analyze : Whether to segment the content of the domain . The premise is that we need to query the content of the domain .
- Index or not : take Field The analyzed word or the whole Field Value to index , Only the index can search . such as : Name of commodity 、 The product profile is analyzed and indexed , The order number 、 I. D. numbers need not be analyzed, but also indexed , These will be used as query criteria in the future .
- Whether to store : take Field Values are stored in the document , Stored in a document Field Only from Document In order to get . such as : Name of commodity 、 The order number , Everything in the future will come from Document From Field All have to be stored .
Field Each subclass of the implements the storage of different types of fields , At the same time, different field attributes are selected , Here are a few common :
TextField: Store string type data .indexing+analyze; By default, the original data is not stored . Apply to need Full text search The data of , E.g. email content , Web content, etc .StringField: Store string type data .indexing But I don't analyze, That is, the whole string is a token; By default, the original data is not stored . Applicable to article title 、 The person's name 、ID Just wait Exactly match String .IntPoint,LongPoint,FloatPoint,DoublePoint: It is used to store various types of numerical data .indexing; By default, the original data is not stored . It is applicable to the storage of numerical data .
Execution search
https://www.cnblogs.com/leeSmall/p/9027172.html
1. Create a Directory object , That is, where the index library is stored
2. Create a DirectoryReader object , You need to specify the Directory object
3. Create a IndexSearcher object , You need to specify the IndexReader object
4. establish Query object , And execute the query
6. Return query results , Traverse the query results and output
7. close DirectoryReader object
Query object
TermQuery: Single keyword precise query , Be careful TermQuery Submit the search text directly for search , Don't make analyze operation .
TermQuery tq = new TermQuery(new Term(“name", “thinkpad"));
RangeQuery: Range queriesPhraseQuery: Multi keyword queryMultiPhraseQuery: Multi keyword query , Support multiple words in the same position OR matchingBooleanQuery: Multiconditional query
// Boolean query
Query query1 = new TermQuery(new Term(filedName, "thinkpad"));
query1 = new TermQuery(new Term(filedName, "thinkpad"))
Query query2 = new TermQuery(new Term("simpleIntro", " Intel "));
BooleanQuery.Builder booleanQueryBuilder = new BooleanQuery.Builder();
booleanQueryBuilder.add(query1, Occur.SHOULD);
booleanQueryBuilder.add(query2, Occur.MUST);
BooleanQuery booleanQuery = booleanQueryBuilder.build();
Parser
- Traditional parsers :
QueryParserandMultiFieldQueryParser
// Traditional query parser - Multiple default fields
QueryParser parser = new QueryParser("defaultFiled", analyzer);
Query query = parser.parse("query String");
// Traditional query parser - Multiple default fields
String[] multiDefaultFields = {
"name", "type", "simpleIntro" };
MultiFieldQueryParser multiFieldQueryParser = new MultiFieldQueryParser(multiDefaultFields, analyzer);
// Set the default combination operation , The default is OR
multiFieldQueryParser.setDefaultOperator(Operator.OR);
Query query4 = multiFieldQueryParser.parse(" laptop AND price:1999900");
- Based on the new flexible Framework parser :
StandardQueryParser
StandardQueryParser queryParserHelper = new StandardQueryParser(analyzer);
// Set default fields
// queryParserHelper.setMultiFields(CharSequence[] fields);
// queryParserHelper.setPhraseSlop(8);
// Query query = queryParserHelper.parse("a AND b", "defaultField");
Query query5 = queryParserHelper.parse("(\" Lenovo laptop \" OR simpleIntro: Intel ) AND type: The computer AND price:1999900","name");
边栏推荐
- SQL optimization notes - forward
- (Application of reflection + introspection mechanism) processing the result set of JDBC
- Alibaba cloud AI training camp - machine learning 3:lightgbm
- Ribbon vs feign - with simple examples
- Sword finger: duplicate number in array (JS)
- Number that appears only once -leetcode
- 对pyqt5和SQL Server数据库进行连接
- Missing digit JS in sword finger 0~n-1
- function
- Product weekly report issue 28 | CSDN editor upgrade, adding the function of inserting existing videos
猜你喜欢

Yolov5-6.0 series | yolov5 module design

AI video cloud: a good wife in the era of we media

Apache Devlake 代码库导览

Morsel-Driven Parallelism: 一种NUMA感知的并行Query Execution框架

Common interview questions

Alibaba cloud AI training camp - SQL basics 4: set operation - addition and subtraction of tables, join, etc

Alibaba cloud AI training camp -sql foundation 2: query and sorting

Deque of STL

2021 national vocational skills competition Liaoning "Cyberspace Security Competition" and its analysis (ultra detailed)

Load research of Marathon LB
随机推荐
Local redis cluster setup
Alibaba cloud AI training camp - SQL basics 3: complex query methods - views, subqueries, functions, etc
reids 缓存与数据库数据不一致、缓存过期删除问题
Remove duplicates from sort array -leetcode
优视慕V8投影仪,打开高清新“视”界
Failed to crawl HTML into MySQL insert
When classical music meets NFT
冒泡排序,打印菱形,打印直角三角形,打印倒三角,打印等边三角形,打印九九乘法表
好榛子出辽阳!
(Application of reflection + introspection mechanism) processing the result set of JDBC
Recurrence and solution of long jump in data warehouse
微信小程序wx.getLocation定位错误信息汇总
Myql error expression 1 of select list is not in group by claim and contains nonaggregated column
AQS之 ReentrantLock 源码分析
AI video cloud: a good wife in the era of we media
Analysis of reentrantlock source code of AQS
pytorch DDP加速之gradient accumulation设置
Data Summit 2022 大会资料分享(共23个)
Product weekly report issue 28 | CSDN editor upgrade, adding the function of inserting existing videos
Cuijian hasn't changed. BAIC Jihu should make a change