当前位置：网站首页>Lucene full text search toolkit learning notes summary

Lucene full text search toolkit learning notes summary

2022-06-30 11:43:00 【Full stack programmer webmaster】

Lucene—- Full text retrieval Tools package Affiliated to the apache(solr Is belong to apache,solr The underlying implementation is Lucene)

One 、 Classification of data ： Structured data Data of fixed type and length such as ： database (mysql/oracl) Data in , Metadata (windows Documents in )

 Unstructured data 
     There is no fixed type and length of data 
     such as ： mail /word The data in it

Two 、 How to find data Structured data The data in the database passes through sql Statement can search for Metadata (windows Medium ) adopt windows Search in the search bar provided Unstructured data Word Document usage ctrl+F To search for Sequential search method ( Low efficiency , As long as there are certain in the document, you can find ) Full text search ( Inverted lookup ), Similar to the dictionary lookup method 3、 ... and 、 Full text search meaning ： Extract the contents of the file , Divide the file into phrases one by one ( branch ), Assemble phrases into an index , When searching, search the index first , Find documents by indexing , This process is called full-text search advantage ： Search fast , Efficient shortcoming ： Use space to buy time . Full text search mimics dictionary search

Four 、Lucene 1. meaning ： Lucene Is a full-text retrieval toolkit (jar); adopt Lucene Can build a full-text retrieval system . Full text retrieval system ： It's just that you can be in tomcat Running independently under war package , Provide full-text retrieval services for external users .

2. Application field ：
    (1) Internet full-text search ( such as baidu/goole Wait for the search engine );
    (2) Website full text search ( such as ： TaoBao 、jd On-site search );
    (3) Optimize the database (like Fuzzy query , It uses sequential search , Slow query );

3.Lucene structure ：
    ( It's like a dictionary )
    Lucene structure = Indexes +Document file ( There can be multiple );

4.Document Document object 
     Get the document first , Then create the document object Document;
    Document The object contains [ domain name name; Domain values value] Key value pairs of form , We become Field( Domain );
    Field You can store file names 、 file size 、 file type 、 Path to file storage 、 Contents in the document, etc ;
     such as ： One document Is a piece of data in the database , One Field Corresponding to one row and one column in the database 
     Be careful ：
        (1) After creating the document object , We need to segment the document object ,
            What participle is used here , Use the same word breaker when querying 
        (2) Every Document There can be multiple Field, Different Document There can be different ones Field,
            The same Document There can be the same Field（ The domain name and domain value are the same ）

5. participle 
     It is to split the extracted document objects one by one ;
     You need to remove the stop words when splitting (a, an, the , Of ,  The earth ,  have to ,  ah ,  Um.  , ha-ha ),
     Because searching for these words is meaningless , Break a sentence into words , Remove punctuation and spaces 
     The resulting word is called a lexical element (term)

5、 ... and 、Document In the document object Field Domain

6、 ... and 、 The process of creating an index Get the file that needs to be indexed —-> Wear Document object —-> Carry out word segmentation —-> Create index write object —-> Add the document to the index and the write object of the document —-> Index write object commit and close index write object stream @Test public void testIndexManager() throws Exception { List documents = new ArrayList<>();// Create a collection of document objects // Read the file that needs to be indexed File f = new File(“D:\Indexsearchsource”); for (File file : f.listFiles()) { // file name String fileName = file.getName(); // The contents of the document String fileContent = FileUtils.readFileToString(file); // file size Long fileSize = FileUtils.sizeOf(file);

         // Put the filename   The contents of the document   file size   Put in Field domain 
         TextField nameField = new TextField("fileName", fileName, Store.YES);
         TextField contentField = new TextField("fileContent", fileContent, Store.YES);
         LongField sizeField = new LongField("fileSize", fileSize, Store.YES);

         // Put the field into the document Document In the object 
         Document document = new Document();
         document.add(nameField);
         document.add(contentField);
         document.add(sizeField);

         // Put it into a document Collection 
         documents.add(document);
    }
     // Create a word breaker 
     //Analyzer analyzer = new StandardAnalyzer();// Standard participator 
     Analyzer analyzer = new IKAnalyzer();//IK Chinese word segmentation 

     // Where the index is placed   FS---- disk   RAM---- Memory 
     Directory directory = FSDirectory.open(new File("d:\\indexDir"));

     // Write object configuration    What kind of word splitter to use   lucene edition 
     IndexWriterConfig conf = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);

     // Create index write object 
     IndexWriter indexWriter = new IndexWriter(directory, conf);

     for (Document document : documents) {
        indexWriter.addDocument(document);
    }

     indexWriter.commit();
     indexWriter.close();
}

7、 ... and 、 Full text search delete Delete the... Used by the index IndexWriter object So the word breaker should be consistent with the index creation

 Delete all indexWriter.deleteAll();
 Delete according to a word element indexWriter.deleteDocuments(new Term("fileName", "apache"));


@Test
public void testIndexDel() throws Exception{
    // Create a word breaker ,StandardAnalyzer Standard participator , The standard word separator has a good effect on English word segmentation , For Chinese, it is word segmentation 
    Analyzer analyzer = new IKAnalyzer();
    // Specify the directory for indexing and document storage 
    Directory directory = FSDirectory.open(new File("E:\\dic"));
    // Create initialization object for write object 
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
    // Create indexes and document write objects 
    IndexWriter indexWriter = new IndexWriter(directory, config);

    // Delete all 
    //indexWriter.deleteAll();

    // Delete by name 
    //Term Morpheme , It's just a word ,  The first parameter : domain name ,  The second parameter : To delete data containing this keyword 
    indexWriter.deleteDocuments(new Term("fileName", "apache"));

    // Submit 
    indexWriter.commit();
    // close 
    indexWriter.close();
}

8、 ... and 、 Full text search modification

/**
 *  The update is based on the incoming Term To search , If the result is found, delete , Regenerate the updated content into a Document object 
 *  If no results are found , Then add a new one directly to the updated content Document object 
 * @throws Exception
 */
@Test
public void testIndexUpdate() throws Exception{
    // Create a word breaker ,StandardAnalyzer Standard participator , The standard word separator has a good effect on English word segmentation , For Chinese, it is word segmentation 
    Analyzer analyzer = new IKAnalyzer();
    // Specify the directory for indexing and document storage 
    Directory directory = FSDirectory.open(new File("E:\\dic"));
    // Create initialization object for write object 
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_4_10_3, analyzer);
    // Create indexes and document write objects 
    IndexWriter indexWriter = new IndexWriter(directory, config);


    // Update according to file name 
    Term term = new Term("fileName", "web");
    // Updated objects 
    Document doc = new Document();
    doc.add(new TextField("fileName", "xxxxxx", Store.YES));
    doc.add(new TextField("fileContext", "think in java xxxxxxx", Store.NO));
    doc.add(new LongField("fileSize", 100L, Store.YES));

    // to update 
    indexWriter.updateDocument(term, doc);

    // Submit 
    indexWriter.commit();
    // close 
    indexWriter.close();
}

Nine 、 Full text search query ( a key ) TermQuery: Search by word ( Search only from text ) QueryParser: Search by domain name , You can set the default search domain , Recommended . ( Search only from text ) NumericRangeQuery: Search from a range of values BooleanQuery: Combination query , Combination conditions can be set ,not and or. Query from multiple domains must amount to and keyword , Is and means should, amount to or The meaning of a keyword or must_not amount to not keyword , Non meaning Be careful : Use alone must_not perhaps Use alone must_not It doesn't make any sense MatchAllDocsQuery: Find all documents MultiFieldQueryParser:

You can query from multiple domains , Only the existence of keywords in these fields can be queried .

@Test
public void testIndexSearch() throws Exception {
    //  The word breaker that creates the query should be consistent with the word breaker that creates the index 
    Analyzer analyzer = new IKAnalyzer();
    //  Directory object 
    Directory directory = FSDirectory.open(new File("d:\\indexDir"));
    //  Create a read index object 
    IndexReader indexReader = IndexReader.open(directory);

    //  Create index search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);

    //  Create query object , The first parameter : Default search domain ,  The second parameter : Word segmentation is 
    //  Default search scope : If the domain name is specified in the search syntax, search from the specified domain ,
    //  If only the query keyword is written during the search , Search from the default search domain 
    QueryParser queryParser = new QueryParser("fileContent", analyzer);
    //  The query syntax = domain name : Search keywords 
    Query query = queryParser.parse("fileName:apache");

    TopDocs topDocs = indexSearcher.search(query, 5);
    System.out.println(" How many records have been searched ====" + topDocs.totalHits);

    //  Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topDocs.scoreDocs;
    for (ScoreDoc scoreDoc : scoreDocs) {
        int docId = scoreDoc.doc;//  Indexed id
        //  Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docId);
        String fileName = document.get("fileName");
        String fileContent = document.get("fileContent");
        String fileSize = document.get("fileSize");
        System.out.println(fileName);
        //System.out.println(fileContent);
        System.out.println(fileSize);
        System.out.println("==============");
    }
}

@Test
public void testIndexTermQuery() throws Exception{
    // Create a word breaker ( The word breaker used to create the index and all must be consistent )
    Analyzer analyzer = new IKAnalyzer();

    // Create lexical element : It's the word ,   
    Term term = new Term("fileName", "apache");
    // Use TermQuery Inquire about , according to term Object to query 
    TermQuery termQuery = new TermQuery(term);


    // Specify the index and the directory of the document 
    Directory dir = FSDirectory.open(new File("E:\\dic"));
    // Read objects of indexes and documents 
    IndexReader indexReader = IndexReader.open(dir);
    // Create indexed search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Search for : The first parameter is the query statement object ,  The second parameter : Specify how many... Are displayed 
    TopDocs topdocs = indexSearcher.search(termQuery, 5);
    // How many records have been searched 
    System.out.println("=====count=====" + topdocs.totalHits);
    // Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topdocs.scoreDocs;

    for(ScoreDoc scoreDoc : scoreDocs){
        // obtain docID
        int docID = scoreDoc.doc;
        // Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docID);
        //get The domain name can take out the value   Print 
        System.out.println("fileName:" + document.get("fileName"));
        System.out.println("fileSize:" + document.get("fileSize"));
        System.out.println("===================================");
    }
}
@Test
public void testNumericRangeQuery() throws Exception{
    // Create a word breaker ( The word breaker used to create the index and all must be consistent )
    Analyzer analyzer = new IKAnalyzer();

    // Query according to the number range 
    // Query file size , Greater than 100  Less than 1000 The article 
    // The first parameter : domain name       
    // The second parameter : minimum value ,  
    // The third parameter : Maximum , 
    // Fourth parameter : Whether the minimum value is included ,   
    // Fifth parameter : Whether to include the maximum value 
    Query query = NumericRangeQuery.newLongRange("fileSize", 100L, 1000L, true, true);      

    // Specify the index and the directory of the document 
    Directory dir = FSDirectory.open(new File("E:\\dic"));
    // Read objects of indexes and documents 
    IndexReader indexReader = IndexReader.open(dir);
    // Create indexed search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Search for : The first parameter is the query statement object ,  The second parameter : Specify how many... Are displayed 
    TopDocs topdocs = indexSearcher.search(query, 5);
    // How many records have been searched 
    System.out.println("=====count=====" + topdocs.totalHits);
    // Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topdocs.scoreDocs;

    for(ScoreDoc scoreDoc : scoreDocs){
        // obtain docID
        int docID = scoreDoc.doc;
        // Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docID);
        //get The domain name can take out the value   Print 
        System.out.println("fileName:" + document.get("fileName"));
        System.out.println("fileSize:" + document.get("fileSize"));
        System.out.println("====================================");
    }
}

@Test
public void testBooleanQuery() throws Exception{
    // Create a word breaker ( The word breaker used to create the index and all must be consistent )
    Analyzer analyzer = new IKAnalyzer();

    // Boolean query , That is, you can query according to the combination of multiple conditions 
    // The file name contains apache Of , And the file size is greater than or equal to 100  Less than or equal to 1000 Byte article 
    BooleanQuery query = new BooleanQuery();

    // Query according to the number range 
    // Query file size , Greater than 100  Less than 1000 The article 
    // The first parameter : domain name       
    // The second parameter : minimum value ,  
    // The third parameter : Maximum , 
    // Fourth parameter : Whether the minimum value is included ,   
    // Fifth parameter : Whether to include the maximum value 
    Query numericQuery = NumericRangeQuery.newLongRange("fileSize", 100L, 1000L, true, true);

    // Create lexical element : It's the word ,   
    Term term = new Term("fileName", "apache");
    // Use TermQuery Inquire about , according to term Object to query 
    TermQuery termQuery = new TermQuery(term);

    //Occur Is a logical condition 
    //must amount to and keyword , Is and means 
    //should, amount to or The meaning of a keyword or 
    //must_not amount to not keyword ,  Non meaning 
    // Be careful : Use alone must_not   perhaps   Use alone must_not It doesn't make any sense 
    query.add(termQuery, Occur.MUST);
    query.add(numericQuery, Occur.MUST);

    // Specify the index and the directory of the document 
    Directory dir = FSDirectory.open(new File("E:\\dic"));
    // Read objects of indexes and documents 
    IndexReader indexReader = IndexReader.open(dir);
    // Create indexed search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Search for : The first parameter is the query statement object ,  The second parameter : Specify how many... Are displayed 
    TopDocs topdocs = indexSearcher.search(query, 5);
    // How many records have been searched 
    System.out.println("=====count=====" + topdocs.totalHits);
    // Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topdocs.scoreDocs;

    for(ScoreDoc scoreDoc : scoreDocs){
        // obtain docID
        int docID = scoreDoc.doc;
        // Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docID);
        //get The domain name can take out the value   Print 
        System.out.println("fileName:" + document.get("fileName"));
        System.out.println("fileSize:" + document.get("fileSize"));
        System.out.println("===================================");
    }
}

@Test
public void testMathAllQuery() throws Exception{
    // Create a word breaker ( The word breaker used to create the index and all must be consistent )
    Analyzer analyzer = new IKAnalyzer();

    // Query all documents 
    MatchAllDocsQuery query = new MatchAllDocsQuery();

    // Specify the index and the directory of the document 
    Directory dir = FSDirectory.open(new File("E:\\dic"));
    // Read objects of indexes and documents 
    IndexReader indexReader = IndexReader.open(dir);
    // Create indexed search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Search for : The first parameter is the query statement object ,  The second parameter : Specify how many... Are displayed 
    TopDocs topdocs = indexSearcher.search(query, 5);
    // How many records have been searched 
    System.out.println("=====count=====" + topdocs.totalHits);
    // Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topdocs.scoreDocs;

    for(ScoreDoc scoreDoc : scoreDocs){
        // obtain docID
        int docID = scoreDoc.doc;
        // Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docID);
        //get The domain name can take out the value   Print 
        System.out.println("fileName:" + document.get("fileName"));
        System.out.println("fileSize:" + document.get("fileSize"));
        System.out.println("======================================");
    }
}

@Test
public void testMultiFieldQueryParser() throws Exception{
    // Create a word breaker ( The word breaker used to create the index and all must be consistent )
    Analyzer analyzer = new IKAnalyzer();

    String [] fields = {"fileName","fileContext"};
    // Query from the file name and file content , Only contains apache You can find out 
    MultiFieldQueryParser multiQuery = new MultiFieldQueryParser(fields, analyzer);
    // Enter the keyword you want to search 
    Query query = multiQuery.parse("apache");

    // Specify the index and the directory of the document 
    Directory dir = FSDirectory.open(new File("E:\\dic"));
    // Read objects of indexes and documents 
    IndexReader indexReader = IndexReader.open(dir);
    // Create indexed search objects 
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);
    // Search for : The first parameter is the query statement object ,  The second parameter : Specify how many... Are displayed 
    TopDocs topdocs = indexSearcher.search(query, 5);
    // How many records have been searched 
    System.out.println("=====count=====" + topdocs.totalHits);
    // Get the result set from the search result object 
    ScoreDoc[] scoreDocs = topdocs.scoreDocs;

    for(ScoreDoc scoreDoc : scoreDocs){
        // obtain docID
        int docID = scoreDoc.doc;
        // Through documentation ID Read the corresponding document from the hard disk 
        Document document = indexReader.document(docID);
        //get The domain name can take out the value   Print 
        System.out.println("fileName:" + document.get("fileName"));
        System.out.println("fileSize:" + document.get("fileSize"));
        System.out.println("===================================");
    }
}

}

Publisher ： Full stack programmer stack length , Reprint please indicate the source ：https://javaforall.cn/100804.html Link to the original text ：https://javaforall.cn

原网站

版权声明
本文为[Full stack programmer webmaster]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206301105592400.html

当前位置：网站首页>Lucene full text search toolkit learning notes summary

Lucene full text search toolkit learning notes summary

You can query from multiple domains , Only the existence of keywords in these fields can be queried .

边栏推荐

猜你喜欢

随机推荐