当前位置:网站首页>Lucene full text search toolkit learning notes summary
Lucene full text search toolkit learning notes summary
2022-06-30 11:43:00 【Full stack programmer webmaster】
Lucene—- Full text retrieval Tools package Affiliated to the apache(solr Is belong to apache,solr The underlying implementation is Lucene)
One 、 Classification of data : Structured data Data of fixed type and length such as : database (mysql/oracl) Data in , Metadata (windows Documents in )
Unstructured data
There is no fixed type and length of data
such as : mail /word The data in it Two 、 How to find data Structured data The data in the database passes through sql Statement can search for Metadata (windows Medium ) adopt windows Search in the search bar provided Unstructured data Word Document usage ctrl+F To search for Sequential search method ( Low efficiency , As long as there are certain in the document, you can find ) Full text search ( Inverted lookup ), Similar to the dictionary lookup method 3、 ... and 、 Full text search meaning : Extract the contents of the file , Divide the file into phrases one by one ( branch ), Assemble phrases into an index , When searching, search the index first , Find documents by indexing , This process is called full-text search advantage : Search fast , Efficient shortcoming : Use space to buy time . Full text search mimics dictionary search
Four 、Lucene 1. meaning : Lucene Is a full-text retrieval toolkit (jar); adopt Lucene Can build a full-text retrieval system . Full text retrieval system : It's just that you can be in tomcat Running independently under war package , Provide full-text retrieval services for external users .
2. Application field :
(1) Internet full-text search ( such as baidu/goole Wait for the search engine );
(2) Website full text search ( such as : TaoBao 、jd On-site search );
(3) Optimize the database (like Fuzzy query , It uses sequential search , Slow query );
3.Lucene structure :
( It's like a dictionary )
Lucene structure = Indexes +Document file ( There can be multiple );
4.Document Document object
Get the document first , Then create the document object Document;
Document The object contains [ domain name name; Domain values value] Key value pairs of form , We become Field( Domain );
Field You can store file names 、 file size 、 file type 、 Path to file storage 、 Contents in the document, etc ;
such as : One document Is a piece of data in the database , One Field Corresponding to one row and one column in the database
Be careful :
(1) After creating the document object , We need to segment the document object ,
What participle is used here , Use the same word breaker when querying
(2) Every Document There can be multiple Field, Different Document There can be different ones Field,
The same Document There can be the same Field( The domain name and domain value are the same )
5. participle
It is to split the extracted document objects one by one ;
You need to remove the stop words when splitting (a, an, the , Of , The earth , have to , ah , Um. , ha-ha ),
Because searching for these words is meaningless , Break a sentence into words , Remove punctuation and spaces
The resulting word is called a lexical element (term)5、 ... and 、Document In the document object Field Domain
6、 ... and 、 The process of creating an index Get the file that needs to be indexed —-> Wear Document object —-> Carry out word segmentation —-> Create index write object —-> Add the document to the index and the write object of the document —-> Index write object commit and close index write object stream @Test public void testIndexManager() throws Exception { List documents = new ArrayList<>();// Create a collection of document objects // Read the file that needs to be indexed File f = new File(“D:\Indexsearchsource”); for (File file : f.listFiles()) { // file name String fileName = file.getName(); // The contents of the document String fileContent = FileUtils.readFileToString(file); // file size Long fileSize = FileUtils.sizeOf(file);
// Put the filename The contents of the document file size Put in Field domain
TextField nameField = new TextField("fileName", fileName, Store.YES);
TextField contentField = new TextField("fileContent", fileContent, Store.YES);
LongField sizeField = new LongField("fileSize", fileSize, Store.YES);
// Put the field into the document Document In the object
Document document = new Document();
document.add(nameField);
document.add(contentField);
document.add(sizeField);
// Put it into a document Collection
documents.add(document);
}
// Create a word breaker
//Analyzer analyzer = new StandardAnalyzer();// Standard participator
Analyzer analyzer = new IKAnalyzer();//IK Chinese word segmentation
// Where the index is placed FS---- disk RAM---- Memory
Directory directory = FSDirectory.open(new File("d:\\indexDir"));
// Write object configuration What kind of word splitter to use lucene edition
IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
// Create index write object
IndexWriter indexWriter = new IndexWriter(directory, conf);
for (Document document : documents) {
indexWriter.addDocument(document);
}
indexWriter.commit();
indexWriter.close();
}7、 ... and 、 Full text search delete Delete the... Used by the index IndexWriter object So the word breaker should be consistent with the index creation
Delete all indexWriter.deleteAll();
Delete according to a word element indexWriter.deleteDocuments(new Term("fileName", "apache"));
@Test
public void testIndexDel() throws Exception{
// Create a word breaker ,StandardAnalyzer Standard participator , The standard word separator has a good effect on English word segmentation , For Chinese, it is word segmentation
Analyzer analyzer = new IKAnalyzer();
// Specify the directory for indexing and document storage
Directory directory = FSDirectory.open(new File("E:\\dic"));
// Create initialization object for write object
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
// Create indexes and document write objects
IndexWriter indexWriter = new IndexWriter(directory, config);
// Delete all
//indexWriter.deleteAll();
// Delete by name
//Term Morpheme , It's just a word , The first parameter : domain name , The second parameter : To delete data containing this keyword
indexWriter.deleteDocuments(new Term("fileName", "apache"));
// Submit
indexWriter.commit();
// close
indexWriter.close();
}8、 ... and 、 Full text search modification
/**
* The update is based on the incoming Term To search , If the result is found, delete , Regenerate the updated content into a Document object
* If no results are found , Then add a new one directly to the updated content Document object
* @throws Exception
*/
@Test
public void testIndexUpdate() throws Exception{
// Create a word breaker ,StandardAnalyzer Standard participator , The standard word separator has a good effect on English word segmentation , For Chinese, it is word segmentation
Analyzer analyzer = new IKAnalyzer();
// Specify the directory for indexing and document storage
Directory directory = FSDirectory.open(new File("E:\\dic"));
// Create initialization object for write object
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
// Create indexes and document write objects
IndexWriter indexWriter = new IndexWriter(directory, config);
// Update according to file name
Term term = new Term("fileName", "web");
// Updated objects
Document doc = new Document();
doc.add(new TextField("fileName", "xxxxxx", Store.YES));
doc.add(new TextField("fileContext", "think in java xxxxxxx", Store.NO));
doc.add(new LongField("fileSize", 100L, Store.YES));
// to update
indexWriter.updateDocument(term, doc);
// Submit
indexWriter.commit();
// close
indexWriter.close();
}Nine 、 Full text search query ( a key ) TermQuery: Search by word ( Search only from text ) QueryParser: Search by domain name , You can set the default search domain , Recommended . ( Search only from text ) NumericRangeQuery: Search from a range of values BooleanQuery: Combination query , Combination conditions can be set ,not and or. Query from multiple domains must amount to and keyword , Is and means should, amount to or The meaning of a keyword or must_not amount to not keyword , Non meaning Be careful : Use alone must_not perhaps Use alone must_not It doesn't make any sense MatchAllDocsQuery: Find all documents MultiFieldQueryParser:
You can query from multiple domains , Only the existence of keywords in these fields can be queried .
@Test
public void testIndexSearch() throws Exception {
// The word breaker that creates the query should be consistent with the word breaker that creates the index
Analyzer analyzer = new IKAnalyzer();
// Directory object
Directory directory = FSDirectory.open(new File("d:\\indexDir"));
// Create a read index object
IndexReader indexReader = IndexReader.open(directory);
// Create index search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Create query object , The first parameter : Default search domain , The second parameter : Word segmentation is
// Default search scope : If the domain name is specified in the search syntax, search from the specified domain ,
// If only the query keyword is written during the search , Search from the default search domain
QueryParser queryParser = new QueryParser("fileContent", analyzer);
// The query syntax = domain name : Search keywords
Query query = queryParser.parse("fileName:apache");
TopDocs topDocs = indexSearcher.search(query, 5);
System.out.println(" How many records have been searched ====" + topDocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topDocs.scoreDocs;
for (ScoreDoc scoreDoc : scoreDocs) {
int docId = scoreDoc.doc;// Indexed id
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docId);
String fileName = document.get("fileName");
String fileContent = document.get("fileContent");
String fileSize = document.get("fileSize");
System.out.println(fileName);
//System.out.println(fileContent);
System.out.println(fileSize);
System.out.println("==============");
}
}
@Test
public void testIndexTermQuery() throws Exception{
// Create a word breaker ( The word breaker used to create the index and all must be consistent )
Analyzer analyzer = new IKAnalyzer();
// Create lexical element : It's the word ,
Term term = new Term("fileName", "apache");
// Use TermQuery Inquire about , according to term Object to query
TermQuery termQuery = new TermQuery(term);
// Specify the index and the directory of the document
Directory dir = FSDirectory.open(new File("E:\\dic"));
// Read objects of indexes and documents
IndexReader indexReader = IndexReader.open(dir);
// Create indexed search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Search for : The first parameter is the query statement object , The second parameter : Specify how many... Are displayed
TopDocs topdocs = indexSearcher.search(termQuery, 5);
// How many records have been searched
System.out.println("=====count=====" + topdocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topdocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
// obtain docID
int docID = scoreDoc.doc;
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docID);
//get The domain name can take out the value Print
System.out.println("fileName:" + document.get("fileName"));
System.out.println("fileSize:" + document.get("fileSize"));
System.out.println("===================================");
}
}
@Test
public void testNumericRangeQuery() throws Exception{
// Create a word breaker ( The word breaker used to create the index and all must be consistent )
Analyzer analyzer = new IKAnalyzer();
// Query according to the number range
// Query file size , Greater than 100 Less than 1000 The article
// The first parameter : domain name
// The second parameter : minimum value ,
// The third parameter : Maximum ,
// Fourth parameter : Whether the minimum value is included ,
// Fifth parameter : Whether to include the maximum value
Query query = NumericRangeQuery.newLongRange("fileSize", 100L, 1000L, true, true);
// Specify the index and the directory of the document
Directory dir = FSDirectory.open(new File("E:\\dic"));
// Read objects of indexes and documents
IndexReader indexReader = IndexReader.open(dir);
// Create indexed search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Search for : The first parameter is the query statement object , The second parameter : Specify how many... Are displayed
TopDocs topdocs = indexSearcher.search(query, 5);
// How many records have been searched
System.out.println("=====count=====" + topdocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topdocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
// obtain docID
int docID = scoreDoc.doc;
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docID);
//get The domain name can take out the value Print
System.out.println("fileName:" + document.get("fileName"));
System.out.println("fileSize:" + document.get("fileSize"));
System.out.println("====================================");
}
}
@Test
public void testBooleanQuery() throws Exception{
// Create a word breaker ( The word breaker used to create the index and all must be consistent )
Analyzer analyzer = new IKAnalyzer();
// Boolean query , That is, you can query according to the combination of multiple conditions
// The file name contains apache Of , And the file size is greater than or equal to 100 Less than or equal to 1000 Byte article
BooleanQuery query = new BooleanQuery();
// Query according to the number range
// Query file size , Greater than 100 Less than 1000 The article
// The first parameter : domain name
// The second parameter : minimum value ,
// The third parameter : Maximum ,
// Fourth parameter : Whether the minimum value is included ,
// Fifth parameter : Whether to include the maximum value
Query numericQuery = NumericRangeQuery.newLongRange("fileSize", 100L, 1000L, true, true);
// Create lexical element : It's the word ,
Term term = new Term("fileName", "apache");
// Use TermQuery Inquire about , according to term Object to query
TermQuery termQuery = new TermQuery(term);
//Occur Is a logical condition
//must amount to and keyword , Is and means
//should, amount to or The meaning of a keyword or
//must_not amount to not keyword , Non meaning
// Be careful : Use alone must_not perhaps Use alone must_not It doesn't make any sense
query.add(termQuery, Occur.MUST);
query.add(numericQuery, Occur.MUST);
// Specify the index and the directory of the document
Directory dir = FSDirectory.open(new File("E:\\dic"));
// Read objects of indexes and documents
IndexReader indexReader = IndexReader.open(dir);
// Create indexed search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Search for : The first parameter is the query statement object , The second parameter : Specify how many... Are displayed
TopDocs topdocs = indexSearcher.search(query, 5);
// How many records have been searched
System.out.println("=====count=====" + topdocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topdocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
// obtain docID
int docID = scoreDoc.doc;
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docID);
//get The domain name can take out the value Print
System.out.println("fileName:" + document.get("fileName"));
System.out.println("fileSize:" + document.get("fileSize"));
System.out.println("===================================");
}
}
@Test
public void testMathAllQuery() throws Exception{
// Create a word breaker ( The word breaker used to create the index and all must be consistent )
Analyzer analyzer = new IKAnalyzer();
// Query all documents
MatchAllDocsQuery query = new MatchAllDocsQuery();
// Specify the index and the directory of the document
Directory dir = FSDirectory.open(new File("E:\\dic"));
// Read objects of indexes and documents
IndexReader indexReader = IndexReader.open(dir);
// Create indexed search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Search for : The first parameter is the query statement object , The second parameter : Specify how many... Are displayed
TopDocs topdocs = indexSearcher.search(query, 5);
// How many records have been searched
System.out.println("=====count=====" + topdocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topdocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
// obtain docID
int docID = scoreDoc.doc;
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docID);
//get The domain name can take out the value Print
System.out.println("fileName:" + document.get("fileName"));
System.out.println("fileSize:" + document.get("fileSize"));
System.out.println("======================================");
}
}
@Test
public void testMultiFieldQueryParser() throws Exception{
// Create a word breaker ( The word breaker used to create the index and all must be consistent )
Analyzer analyzer = new IKAnalyzer();
String [] fields = {"fileName","fileContext"};
// Query from the file name and file content , Only contains apache You can find out
MultiFieldQueryParser multiQuery = new MultiFieldQueryParser(fields, analyzer);
// Enter the keyword you want to search
Query query = multiQuery.parse("apache");
// Specify the index and the directory of the document
Directory dir = FSDirectory.open(new File("E:\\dic"));
// Read objects of indexes and documents
IndexReader indexReader = IndexReader.open(dir);
// Create indexed search objects
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
// Search for : The first parameter is the query statement object , The second parameter : Specify how many... Are displayed
TopDocs topdocs = indexSearcher.search(query, 5);
// How many records have been searched
System.out.println("=====count=====" + topdocs.totalHits);
// Get the result set from the search result object
ScoreDoc[] scoreDocs = topdocs.scoreDocs;
for(ScoreDoc scoreDoc : scoreDocs){
// obtain docID
int docID = scoreDoc.doc;
// Through documentation ID Read the corresponding document from the hard disk
Document document = indexReader.document(docID);
//get The domain name can take out the value Print
System.out.println("fileName:" + document.get("fileName"));
System.out.println("fileSize:" + document.get("fileSize"));
System.out.println("===================================");
}
}}
Publisher : Full stack programmer stack length , Reprint please indicate the source :https://javaforall.cn/100804.html Link to the original text :https://javaforall.cn
边栏推荐
- 10 reasons for system performance failure
- wallys/600VX – 2×2 MIMO 802.11ac Mini PCIe Wi-Fi Module, Dual Band, 2,4GHz / 5GHz QCA 9880
- TypeScript ReadonlyArray(只读数组类型) 详细介绍
- koa - 洋葱模型浅析
- 【小程序实战系列】小程序框架 页面注册 生命周期 介绍
- 数据库连接池 druid
- Train an image classifier demo in pytorch [learning notes]
- Alibaba cloud lifeifei: China's cloud database has taken the lead in many mainstream technological innovations abroad
- 线代(高斯消元法、线性基)
- Alibaba cloud database represented by polardb ranks first in the world
猜你喜欢

Filter error in dplyr: can't transform a data frame with duplicate names

再不上市,旷视科技就熬不住了

Pointdistiller: structured knowledge distillation for efficient and compact 3D detection

深入解析 Apache BookKeeper 系列:第四篇—背压

Kongsong (ICT Institute) - cloud security capacity building and trend in the digital age

暑假学习记录

他是上海两大产业的第一功臣,却在遗憾中默默离世

The jetpack compose dropdownmenu is displayed following the finger click position

达梦数据冲刺科创板,或成A股市场“国产数据库第一股”

Multiparty Cardinality Testing for Threshold Private Set-2021:解读
随机推荐
科普達人丨漫畫圖解什麼是eRDMA?
Typescript readonlyarray (read only array type) details
dplyr 中的filter报错:Can‘t transform a data frame with duplicate names
谁还记得「张同学」?
高通发布物联网案例集 “魔镜”、数字农业已经成为现实
R language view version R package view version
A theoretical defect of relative position coding transformer and Its Countermeasures
Le talent scientifique 丨 dessins animés qu'est - ce qu'erdma?
Livedata source code appreciation III - frequently asked questions
关于IP定位查询接口的测评Ⅲ
How to analyze native crash through GDB
[xi'anjiaotonguniversity] information sharing of the first and second postgraduate entrance examinations
自定义一个注解来获取数据库的链接
1175. 质数排列
Go语言学习之Switch语句的使用
CVPR 2022 | greatly reduce the manual annotation required for zero sample learning. Mapu and Beiyou proposed category semantic embedding rich in visual information
100 important knowledge points that SQL must master: using subquery
Kongsong (ICT Institute) - cloud security capacity building and trend in the digital age
TypeScript ReadonlyArray(只读数组类型) 详细介绍
学习redis实现分布式锁—–自己的一个理解