当前位置：网站首页>Use Stanford parse (intelligent language processing) to implement word segmentation

Use Stanford parse (intelligent language processing) to implement word segmentation

2022-07-23 07:50:00 【Wu Nian】

I studied yesterday Stanford Parse , Want to use Stanford Parse The effect of intelligent word segmentation is combined with lucene Ideas in the word breaker ; Due to the project time

hasty , Some studies have not been completed . Code still exists bug, I hope you have this idea , Can perfect ..

lucene edition ：lucene4.10.3, introduce jar package ：stanford-parser-3.3.0-models.jar ,stanford-parser.jar

First build the word breaker test class , The code is as follows ：


   
    
      
       
        
       
       
        
        package main.test; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import java.io.IOException; 
        
      
      
       
        
       
       
        
        import java.io.StringReader; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.Analyzer; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.TokenStream; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        public 
        class 
        AnalyzerTest { 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        public 
        static 
        void 
        analyzer 
        (Analyzer analyzer,String text){ 
        
      
      
       
        
       
       
        
        try { 
        
      
      
       
        
       
       
       
         System.out.println( 
        " Participator name ："+analyzer.getClass()); 
        
      
      
       
        
       
       
        
        // obtain tokenStream flow  
        
      
      
       
        
       
       
       
         TokenStream tokenStream=analyzer.tokenStream( 
        "", 
        new 
        StringReader(text)); 
        
      
      
       
        
       
       
       
         tokenStream.reset(); 
        
      
      
       
        
       
       
        
        while(tokenStream.incrementToken()){ 
        
      
      
       
        
       
       
       
         CharTermAttribute cta1=tokenStream.getAttribute(CharTermAttribute.class); 
        
      
      
       
        
       
       
       
         OffsetAttribute ofa=tokenStream.getAttribute(OffsetAttribute.class); 
        
      
      
       
        
       
       
        
        // Attribute of position increment , The distance between stored words   
        
      
      
       
        
       
       
        
        // PositionIncrementAttribute pia=tokenStream.getAttribute(PositionIncrementAttribute.class); 
        
      
      
       
        
       
       
        
        // System.out.print(pia.getPositionIncrement()+":");  
        
      
      
       
        
       
       
       
         System.out.print( 
        "["+ofa.startOffset()+ 
        "-"+ofa.endOffset()+ 
        "]-->"+cta1.toString()+ 
        "\n"); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         tokenStream.end(); 
        
      
      
       
        
       
       
       
         tokenStream.close(); 
        
      
      
       
        
       
       
       
         } 
        catch (IOException e) { 
        
      
      
       
        
       
       
        
        // TODO Auto-generated catch block 
        
      
      
       
        
       
       
       
         e.printStackTrace(); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        public 
        static 
        void 
        main 
        (String[] args){ 
        
      
      
       
        
       
       
        
        String 
        chText 
        = 
        " Tsinghua university students say they are studying the origin of life "; 
        
      
      
       
        
       
       
        
        Analyzer 
        analyzer 
        = 
        new 
        NlpHhcAnalyzer(); 
        
      
      
       
        
       
       
       
         analyzer(analyzer,chText); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         }

Redefine a new word breaker , Realization Analyzer class , Rewrite it ：TokenStreamComponentscreateComponents Method . Note here ：lucene4.x edition

Ben's TokenStreamComponents Contained as a component lucene3.x Version of filter and tokenizer.


   
    
      
       
        
       
       
        
        package main.test; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import java.io.Reader; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.Analyzer; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        public 
        class 
        NlpHhcAnalyzer 
        extends 
        Analyzer{ 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        @Override 
        
      
      
       
        
       
       
        
        protected TokenStreamComponents 
        createComponents 
        (String arg0, Reader reader) { 
        
      
      
       
        
       
       
        
        return 
        new 
        TokenStreamComponents( 
        new 
        aaa(reader)); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
       
         }

Realize a new one Tokenizer class aaa： This part of the code also has bug, There is no time to debug and learn .. Friends who have time can try to improve .


   
    
      
       
        
       
       
        
        package main.test; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import java.io.IOException; 
        
      
      
       
        
       
       
        
        import java.io.Reader; 
        
      
      
       
        
       
       
        
        import java.util.Collection; 
        
      
      
       
        
       
       
        
        import java.util.concurrent.ConcurrentHashMap; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.Tokenizer; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.analysis.tokenattributes.OffsetAttribute; 
        
      
      
       
        
       
       
        
        import org.apache.lucene.util.AttributeFactory; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        import edu.stanford.nlp.parser.lexparser.LexicalizedParser; 
        
      
      
       
        
       
       
        
        import edu.stanford.nlp.trees.Tree; 
        
      
      
       
        
       
       
        
        import edu.stanford.nlp.trees.TypedDependency; 
        
      
      
       
        
       
       
        
        import edu.stanford.nlp.trees.international.pennchinese.ChineseGrammaticalStructure; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        public 
        class 
        aaa 
        extends 
        Tokenizer{ 
        
      
      
       
        
       
       
        
        // Word meta text properties  
        
      
      
       
        
       
       
        
        private CharTermAttribute termAtt; 
        
      
      
       
        
       
       
        
        // Morpheme displacement attribute  
        
      
      
       
        
       
       
        
        private OffsetAttribute offsetAtt; 
        
      
      
       
        
       
       
        
        // Record the end position of the last word element  
        
      
      
       
        
       
       
        
        // private int finalOffset; 
        
      
      
       
        
       
       
        
        private String str; 
        
      
      
       
        
       
       
        
        private LexicalizedParser lp; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        public 
        aaa 
        (Reader in) { 
        
      
      
       
        
       
       
        
        super(in); 
        
      
      
       
        
       
       
       
         StringBuilder sb= 
        new 
        StringBuilder(); 
        
      
      
       
        
       
       
        
        try { 
        
      
      
       
        
       
       
        
        for ( 
        int 
        i 
        = 
        0; i < 
        100; i++) { 
        
      
      
       
        
       
       
       
         sb.append(( 
        char) in.read()); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         } 
        catch (IOException e) { 
        
      
      
       
        
       
       
       
         e.printStackTrace(); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         str=sb.toString(); 
        
      
      
       
        
       
       
       
         String modelpath= 
        "edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz"; 
        
      
      
       
        
       
       
       
         lp = LexicalizedParser.loadModel(modelpath); 
        
      
      
       
        
       
       
       
         offsetAtt = addAttribute(OffsetAttribute.class); 
        
      
      
       
        
       
       
       
         termAtt = addAttribute(CharTermAttribute.class); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        protected 
        aaa 
        (AttributeFactory factory, Reader input) { 
        
      
      
       
        
       
       
        
        super(factory, input); 
        
      
      
       
        
       
       
        
        // TODO Auto-generated constructor stub 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        @SuppressWarnings("unchecked") 
        
      
      
       
        
       
       
        
        @Override 
        
      
      
       
        
       
       
        
        public 
        boolean 
        incrementToken 
        () 
        throws IOException { 
        
      
      
       
        
       
       
        
        // Clear all lexical attributes  
        
      
      
       
        
       
       
       
         clearAttributes(); 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
        
        Tree 
        t 
        = lp.parse(str); 
        
      
      
       
        
       
       
        
        ChineseGrammaticalStructure 
        gs 
        = 
        new 
        ChineseGrammaticalStructure(t); 
        
      
      
       
        
       
       
       
         Collection<TypedDependency> tdl = gs.typedDependenciesCollapsed(); 
        
      
      
       
        
       
       
       
         ConcurrentHashMap map= 
        new 
        ConcurrentHashMap(); 
        
      
      
       
        
       
       
        
        for( 
        int i= 
        0;i<tdl.size();i++) 
        
      
      
       
        
       
       
       
         { 
        
      
      
       
        
       
       
        
        TypedDependency 
        td 
        = (TypedDependency)tdl.toArray()[i]; 
        
      
      
       
        
       
       
        
        String 
        term 
        = td.dep().nodeString().trim(); 
        
      
      
       
        
       
       
        
        // take Lexeme Turn into Attributes 
        
      
      
       
        
       
       
        
        // Set word meta text  
        
      
      
       
        
       
       
       
         termAtt.append(term); 
        
      
      
       
        
       
       
        
        // Set the word element length  
        
      
      
       
        
       
       
       
         termAtt.setLength(term.length()); 
        
      
      
       
        
       
       
        
        // Set the morpheme displacement  
        
      
      
       
        
       
       
        
        if(i== 
        0){ 
        
      
      
       
        
       
       
       
         map.put( 
        "beginPosition", i*term.length()); 
        
      
      
       
        
       
       
       
         } 
        else{ 
        
      
      
       
        
       
       
       
         map.put( 
        "beginPosition", Integer.parseInt(map.get( 
        "beginPosition").toString())+term.length()); 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
       
         offsetAtt.setOffset(Integer.parseInt(map.get( 
        "beginPosition").toString()), Integer.parseInt(map.get( 
        "beginPosition").toString())+term.length()); 
        
      
      
       
        
       
       
        
        // Record the last position of the participle  
        
      
      
       
        
       
       
        
        // finalOffset = nextLexeme.getEndPosition(); 
        
      
      
       
        
       
       
        
        // Return meeting true Tell me there's the next word  
        
      
      
       
        
       
       
        
        return 
        true; 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        // Return meeting false Inform that the word element output is complete  
        
      
      
       
        
       
       
        
        return 
        false; 
        
      
      
       
        
       
       
       
         } 
        
      
      
       
        
       
       
        
        
      
      
       
        
       
       
       
         }

原网站

版权声明
本文为[Wu Nian]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207222122578166.html

当前位置：网站首页>Use Stanford parse (intelligent language processing) to implement word segmentation

Use Stanford parse (intelligent language processing) to implement word segmentation

边栏推荐

猜你喜欢

随机推荐