当前位置:网站首页>Analysis and understanding of Jieba stutter word segmentation principle HMM application in Chinese word segmentation and partial code reading

Analysis and understanding of Jieba stutter word segmentation principle HMM application in Chinese word segmentation and partial code reading

2022-06-24 11:07:00 Goose

1. background

This blog mainly describes the word segmentation database stuttering that we often use in word segmentation tasks jieba The realization principle of word segmentation , As mentioned in the previous blog HMM The application of participles , It's a review and a deeper understanding HMM Knowledge .jieba Participle as a thesaurus ten years ago , The update is still very easy to use and classic for learning .

2. Chinese word segmentation background

2.1 characteristic

  1. In Chinese , Words are the smallest meaningful language elements that can act independently ;
  2. Chinese is in the unit of character position , Unlike western languages , There is no space between words to indicate the boundary of words ;
  3. Word segmentation is the basic work of Chinese text processing , Word segmentation plays a key role in Chinese information processing .

2.2 difficulty

  1. Word segmentation specification , The definition of the word is not clear 《 Statistical natural language processing 》 Zong Chengqing
  2. Ambiguity segmentation problem , Intersection type segmentation problem , Polysemy, combination, segmentation, ambiguity, etc Married monks and unmarried monks => marry / Of / and / not yet / marry / Of marry / Of / Buddhist monk / not / marry / Of
  3. There are two explanations for unlisted words : First, the words not included in the existing vocabulary , The second is the words that have not appeared in the existing training corpus , The unlisted words in the second meaning are also called OOV(Out of Vocabulary). For large-scale real texts , The influence of unlisted words on the precision of word segmentation is far greater than that of ambiguous segmentation . Some new Internet words , Self made words generally belong to these words .

2.3 The basic method

  • Based on the dictionary 、 Word segmentation method of thesaurus matching ( rule-based )
    • Match word segmentation based on string , Mechanical word segmentation algorithm . Match the string to be divided with the entries in a sufficiently large machine dictionary . It is divided into forward matching and reverse matching ; Maximum length matching and minimum length matching ; Simple word segmentation and the integration of word segmentation and tagging . So commonly used are : Positive maximum match , Reverse Max match , Least segmentation . Practical application , Take mechanical word segmentation as a means of initial division , Using language information to improve segmentation accuracy . Give priority to identifying words with obvious characteristics , Take these words as a break , Divide the original string into smaller strings and match them mechanically , To reduce the matching error rate , Or combine word segmentation with part of speech tagging .
  • Word segmentation method based on word frequency statistics ( Based on Statistics )
    • The more adjacent words appear at the same time , The more likely it is to form a word , Make statistics on the frequency of word groups in the corpus , The word segmentation method based on word frequency statistics is a full segmentation method .jieba It is a word segmentation method based on statistics ,jieba Word segmentation uses dynamic programming to find the maximum probability path , Find out the maximum segmentation combination based on word frequency , For unregistered words , Based on the ability of Chinese characters to form words HMM Model , Used Viterbi Algorithm
  • Word segmentation method based on knowledge understanding
    • This method is mainly based on syntax 、 Syntax analysis , Combined with semantic analysis , Delimit words by analyzing the information provided by the context , It usually consists of three parts : Word segmentation subsystem 、 Syntactic and semantic subsystem 、 The general control part . Under the coordination of the general control department , The word segmentation subsystem can obtain the relevant words 、 Sentence and other syntactic and semantic information to judge word segmentation ambiguity . This kind of method tries to make the machine have the ability of human understanding , Need to use a lot of language knowledge and information . Because of the generality of Chinese language knowledge 、 complexity , It is difficult to organize all kinds of language information into machine readable form . Therefore, the knowledge-based word segmentation system is still in the experimental stage .

3. A brief introduction to stutter algorithm

3.1 review

  1. Efficient word graph scanning based on prefix dictionary , Generate a directed acyclic graph of all possible word formation of Chinese characters in a sentence (DAG);
    1. Prefix dictionary is used to store thesaurus ( namely dict.txt Contents of the file ); Generate a directed acyclic graph of all possible word formation of Chinese characters in a sentence .DAG According to the prefix dictionary we generated, we construct such a DAG, To a sentence DAG In order to {key:listi,j…, …} The dictionary structure stores , among key yes Word in sentence Position in ,list Stored in sentence China and Israel key Start with the word sentencekey:i+1 In our prefix dictionary key Start i The last position of the ending word i A list of , namely list Deposit is sentence Middle position key The beginning of a possible word, the end of a word , In this way, we can get the word by looking it up in the dictionary , Starting position + End position list .
    2. for example : The sentence “ Counter-Japanese War ” Generated DAG in {0:0,1,3} It's a simple one DAG, That is to say 0 Position start , stay 0,1,3 Positions are words , That is to say 0~0,0~1,0~3 namely “ resist ”,“ Resistance against Japanese aggression ”,“ Counter-Japanese War ” These three words stay dict.txt Chinese is a word .
  2. Dynamic programming is used to find the maximum probability path , Find out the maximum segmentation combination based on word frequency ; Based on the above DAG Using dynamic programming to find the maximum probability path . The basic idea of finding the maximum probability path according to dynamic programming is to calculate the maximum probability from right to left ,.. By analogy , Finally, the maximum probability path , Get the segmentation combination with the maximum probability
  3. For unregistered words , Based on the ability of Chinese characters to form words HMM Model , Used Viterbi Algorithm ;
    1. What is said in the unlisted words OOV, It's actually a dictionary dict.txt Words not recorded in . Here we use HMM Model .HMM See the previous blog for the basics ;
    2. Key corresponding parameters of the model λ=(A,B,π), After the training of a large number of corpora , Got it finalseg Three files in the directory ( Initialize the state probability (π) That is, the probability that a word begins with a certain state , There are only two , Or B, Or S. This is the starting vector , Namely HMM The initial model state of the system , Corresponding documents prob_start.py; Implicit state probability transition moment A That is, several position states of words (BEMS Four states to mark , B It's the beginning begin Location , E yes end, It's the end position , M yes middle, It's the middle position , S yes single, The position of a single word ) The conversion probability of , Corresponding documents prob_trans.py; Observation state emission probability matrix B That is, the transmission probability from the position state to the word , such as P(“ Dog ”|M) Indicates that a word appears in the middle ” Dog ” The probability of this word , Corresponding documents prob_emit.py).

After reading so much, I may still be as confused as I am , You can follow along to see the specific implementation of each step .

3.2 Efficient word graph scanning based on prefix dictionary , Generate a directed acyclic graph of all possible word formation of Chinese characters in a sentence (DAG)

3.2.1 Trie Prefix tree

The stutter participle comes with a word called dict.txt The dictionary of , There are 349046 Article words , Each line contains entry Number of entries ( This number is trained by the stuttering author based on the people's daily corpus and other resources ) and The part of speech , As shown in the figure below .

In the dictionary based Chinese word segmentation method , Dictionary matching algorithm It's the foundation . In order to ensure the cutting speed , You need to choose a good dictionary lookup algorithm .

Here we introduce the Trie Tree organization . Word graph scanning based on prefix dictionary , That's what 34 More than 10000 words , In a trie Trees In the data structure of ,trie Trees are also called Prefix tree or Dictionary tree , That is to say, the first few words of a word are the same , It means they have the same prefix , You can use trie Trees to store , It has the advantage of fast search speed .

A search for numbers trie Only one character is reserved in one node of the tree . If a word is longer than a character , Then the node containing the th character has a pointer to the node of the next character , And so on . This makes up a hierarchical tree , The first layer of the tree contains the first character of all words , The second layer of the tree contains the second character of all words , And so on . Obviously ,trie The maximum height of the tree is the length of the longest word in the dictionary .

3.2.2 DAG Directed acyclic graph

DAG Directed acyclic graph , In the last sentence Generate a directed acyclic graph of all possible word formation of Chinese characters in a sentence , This means , Give a sentence to be segmented , Match all its words , Make up the word map , It is a directed acyclic graph DAG, As shown in the figure below :

DAG Words and pictures , On each side is a word

So how to do it specifically ? Divided into two steps :

  1. according to dict.txt Dictionary generation trie Trees ;
  2. Treat participle sentences , according to trie Trees , Generate DAG.

actually , Popular said , Is to treat participle sentences , Search the dictionary according to the given dictionary , Generate several possible sentence segmentation , Form something like the one shown above DAG chart .

about DAG The implementation of the , In the source code , What the author records is the beginning of a word in a sentence , from 0 To n-1(n Is the length of the sentence ), Set up a python Dictionary , Each start position is used as the key of the dictionary ,value It's a python Of list, Where the end position of possible words is saved ( Look up the words in the dictionary , Then add the beginning position of the word to the length of the word to get the end position ). for example :{0:[1,2,3]} It's a simple one DAG, That is to say 0 Position start , stay 1,2,3 The position is the ending position of a word , That is to say 0 ~ 1, 0 ~ 2,0 ~ 3 The characters between these three position ranges , stay dict.txt There are words in .

3.3 Dynamic programming is used to find the maximum probability path , Find out the maximum segmentation combination based on word frequency

The author's code will generate the dictionary trie At the same time as the tree , Also convert the number of occurrences of each word into frequency . Frequency is also a 0~1 Decimal between , yes The number of times the event occurred / The total number of times in the experiment , Therefore, when the number of tests is large enough , Frequency is approximately equal to probability , Or the limit of frequency is probability .

In dynamic programming , First find the words that have been segmented in the sentence to be segmented , Look for the frequency of the word ( frequency / total ), If there is no such word ( Since it is based on dictionary search , It should be possible that there is no such word ), Take the frequency of the word with the lowest frequency in the dictionary as the frequency of the word . in other words ,P( A word )=FREQ.get(‘ A word ’, min_freq), Then find the maximum probability path according to the dynamic programming . Calculate the maximum probability from right to left ( It can also be from left to right , The reverse here is because the focus of Chinese sentences often falls behind , Just fall on the right , It is mainly because there are too many adjectives in general , The back is the trunk , therefore , From right to left , The accuracy rate is higher than that calculated from left to right , A word segmentation method similar to reverse maximum matching )

So the state transfer equation is

P(NodeN)=1.0, P(NodeN-1)=P(NodeN)*Max(P( The last word )), By analogy , Finally, the maximum probability path , That is, get the maximum probability of the segmentation combination .

So in the picture above DAG in , The maximum segmentation combination obtained is - opinion - Differences , namely 0-1-3-5

3.4 For unregistered words , Based on the ability of Chinese characters to form words HMM Model , Used Viterbi Algorithm

Not logged in words (Out Of Vocabulary word, OOV word), Here it means a dictionary dict.txt Words not recorded in . If you put dict.txt All words in have been deleted , Stuttering participles can be used as participles , That is to say . How to do it ? This is based on the HMM Model , Chinese vocabulary according to BEMS Four states to mark ,B It's the beginning begin Location ,E It's the end end Location ,M It's in the middle middle Location ,S yes single, The position of a single word . in other words , He adopted a state of (B,E,M,S) These four states are used to mark Chinese words , For example, Beijing can be marked as BE, namely north /B Beijing /E, North is the starting position , Jing is the end position . The Chinese nation can be marked as BMME, It's the beginning , middle , middle , end .

After the author's training on a large number of corpus , Got it finalseg Three files in the directory :

There are mainly three probability tables to be counted :

1) Position conversion probability , namely B( start ),M( middle ),E( ending ),S( Independent words ) Transition probabilities of four states , The form is stored in prob_trans.py in , The following is the contents of the table ;

{'B': {'E': 0.8518218565181658, 'M': 0.14817814348183422},
'E': {'B': 0.5544853051164425, 'S': 0.44551469488355755},
'M': {'E': 0.7164487459986911, 'M': 0.2835512540013088},
'S': {'B': 0.48617017333894563, 'S': 0.5138298266610544}}

for example ,P(E|B) = 0.851, P(M|B) = 0.149, Explain that when we are at the beginning of a word , The next word is the probability of ending . It is much higher than the probability that the next word is the middle word , In line with our intuition , Because two word words are more common than multi word words .

2) Transmission probability from position to word , such as P(" and "|M) Indicates that a word appears in the middle " and " The probability of this word , The form is stored in prob_emit.py in ;

3) The probability that a word begins with a state , There are only two , Or B, Or S, The form is stored in prob_start.py. This is the starting vector , Namely HMM The initial model state of the system

actually ,BEMS The transformation between is a bit similar to 2 Metamodel , Namely 2 Transfer between words . The binary model considers the probability of another word after one word , yes N One of the metamodels . for example : Generally speaking , China After that Beijing The probability is greater than China After that The north sea Probability , That is to say : Beijing China Than Beihai, China The probability of occurrence is higher , It is more likely to be a Chinese word .

Treat a given sentence to be segmented as an observation sequence , Yes HMM(BEMS) For the four state model , Just to find the best BEMS Hidden state sequence , This requires the use of Viterbi Algorithm to get the best hidden state sequence . By training in advance HMM Transfer probability 、 Emission probability , Use dynamic programming based methods viterbi Method of algorithm , You can find one that maximizes the probability BEMS Sequence . according to B Lead ,E The way it ends , Recombine sentences that treat participles , You get the result of participle .

such as , Sentences that treat participles The whole world is learning Chinese Get one BEMS Sequence S,B,E,S,S,S,B,E,S This sequence is just an example , Not necessarily right , By putting continuous BE Get together to get a word ,S For a single word , You get a word segmentation result : above BE The position corresponds to the position of a single Chinese character in the sentence , obtain whole /S The world /BE all /S stay /S learn /S China /BE word /S, Thus the sentence is divided into words .

Next, let's look at how to use HMM Of viterbi find BEMS Sequence

3.4.1 HMM Basic concepts

Review the ,HMM The typical model of is a five tuple :

  • StatusSet: Set of state values
  • ObservedSet: Set of observations
  • TransProbMatrix: Transfer probability matrix
  • EmitProbMatrix: Launch probability matrix
  • InitStatus: Initial state distribution

HMM The three basic assumptions of the model are as follows :

  • Limited historical assumptions ( The current state is only related to the previous state ):
    • P(Statusi|Statusi-1,Statusi-2,… Status1) = P(Statusi|Statusi-1)
  • Homogeneity Hypothesis ( The status has nothing to do with the current time ):
    • P(Statusi|Statusi-1) = P(Statusj|Statusj-1)
  • The observation independence Hypothesis ( The observed value only depends on the current state value ):
    • P(Observedi|Statusi,Statusi-1,…,Status1) = P(Observedi|Statusi)

3.4.2 HMM The middle quintuple corresponds to the stutter participle

Back to the point , In the stutter participle, the five tuples are :

  • Set of state values (StatusSet) by (B, M, E, S): {B:begin, M:middle, E:end, S:single}. Each state is represented by The position of the word in the word ,B It means that the word is the starting word in the word ,M The representative is the middle word in the word ,E Represents the ending word in a word ,S It means that a word becomes a word . Such as : Let me give you an example of a hidden Markov chain . It can be marked as : to /S you /S One /BE Hidden Markov chain /BMMMME Of /S Example /BE ./S
  • Set of observations (ObservedSet) Wei is all Chinese characters ( Southeast, northwest, you and me …), Even a collection of punctuation marks . The status value is the value we require . stay HMM Model Chinese word segmentation , Input is a sentence ( That is, the sequence of observations ), The output is the status value of each word in this sentence .
  • Probability distribution of initial state (InitStatus ) Such as :

B -0.26268660809250016 E -3.14e+100 M -3.14e+100 S -1.4652633398537678

The value is the probability value 【 logarithm 】 Later results ( You can make the probability 【 Multiply 】 The calculation of becomes logarithm 【 Add up 】). among -3.14e+100 As negative infinity , That is, the corresponding probability value is 0.

That is, the first word of the sentence belongs to {B,E,M,S} The probability of these four states

  • Transfer probability matrix (TransProbMatrix ) According to the above Limited historical assumptions The transition probability is a Markov chain Status(i) And the only Status(i-1) relevant , This assumption can greatly simplify the problem . therefore , It's actually a 4x4(4 Is the size of the set of state values ) Two dimensional matrix of . The order of the abscissa and ordinate of the matrix is BEMS x BEMS.( The value is the logarithm of probability )
  • Launch probability matrix (EmitProbMatrix ) As mentioned above The observation independence Hypothesis P(Observedi, Statusj) = P(Statusj) * P(Observedi|Statusj) among ,P(Observedi|Statusj) This value is from EmitProbMatrix In order to get .

3.4.3 Use HMM Viterbi Algorithm decode

The five yuan relationship is through a Viterbi The algorithms are concatenated ,ObservedSet The sequence value is Viterbi The input of , and StatusSet The sequence value is Viterbi Output , Between input and output Viterbi The algorithm also needs three model parameters , Namely InitStatus, TransProbMatrix, EmitProbMatrix.

Take the following sentence as an example :

Xiao Ming graduated from the institute of computing technology, Chinese academy of sciences

Defining variables

Two dimensional array weight4,4 Is the number of States (0:B,1:E,2:M,3:S),15 Is the number of words in the input sentence . such as weight0 representative state B Under the condition of , appear ’ Master ’ The possibility of this word .

Two dimensional array path4,4 Is the number of States (0:B,1:E,2:M,3:S),15 Is the number of words in the input sentence . such as path0 representative weight0 When it reaches the maximum , The state of the previous word , such as path0 = 1, Then represent weight0 When it reaches the maximum , The previous word ( That is, Ming ) The state of is E. To record the status of the previous word is to use viterbi The algorithm calculates the complete weight4 after , Can trace back the input sentences from right to left , Find the corresponding state sequence .

The specific code implementation is as follows :

B:-0.26268660809250016
E:-3.14e+100
M:-3.14e+100
S:-1.4652633398537678

 And by the EmitProbMatrix We can draw 

Status(B) -> Observed( Small )  :  -5.79545
Status(E) -> Observed( Small )  :  -7.36797
Status(M) -> Observed( Small )  :  -5.09518
Status(S) -> Observed( Small )  :  -6.2475

 So you can initialize  weight[i][0]  The values are as follows :

weight[0][0] = -0.26268660809250016 + -5.79545 = -6.05814
weight[1][0] = -3.14e+100 + -7.36797 = -3.14e+100
weight[2][0] = -3.14e+100 + -5.09518 = -3.14e+100
weight[3][0] = -1.4652633398537678 + -6.2475 = -7.71276

 Note that the above formula is added rather than multiplied , Because I've taken logarithms before .

// Traversal sentence , Subscript i from 1 At the beginning, it is because the initialization has been completed 0 Initialization is over 
for(size_t i = 1; i < 15; i++)
{
    //  Traverse possible states 
    for(size_t j = 0; j < 4; j++) 
    {
        weight[j][i] = MIN_DOUBLE;
        path[j][i] = -1;
        // Traverse the possible states of the previous word 
        for(size_t k = 0; k < 4; k++)
        {
            double tmp = weight[k][i-1] + _transProb[k][j] + _emitProb[j][sentence[i]];
            if(tmp > weight[j][i]) //  Find the biggest weight[j][i] value 
            {
                weight[j][i] = tmp;
                path[j][i] = k;
            }
        }
    }
}

Determine boundary conditions and path backtracking

The boundary conditions are as follows :

For each sentence , The state of the last word can only be E perhaps S, It can't be M perhaps B.

So in this example, we only need to compare weight1(E) and weight3(S) The size is ok .

In this case :

weight1 = -102.492;

weight3 = -101.632;

therefore S > E, That is, the starting point for path backtracking is path3.

The path of backtracking is :

SEBEMBEBEMBEBEB

In reverse order :

BE/BE/BME/BE/BME/BE/S

So the result of word segmentation is :

Xiao Ming / master / Graduated from / China / Academy of sciences, / Calculation / the

This can be done by understanding all of the , Carry out word segmentation .

3.4.4 Further understanding Viterbi Algorithm

Viterbi algorithm is a hidden Markov model solution based on dynamic programming . The principle of dynamic programming mentions , Suppose there is an optimal path , Then cut the path into N paragraph , So this N Each path is the optimal path in this environment , Otherwise, there are other unknown paths , Can form a better path than the optimal path , This is obviously not true .

Based on the above principles , We just need to start from the moment t=1 Start , Recursively calculate the sub - Safety time t Status as i The maximum probability of each partial path , Until you get the moment t=T The status of is i The maximum probability of each path , Then we can get the optimal path .

hypothesis HMM Model , Hidden state ={1,2,3}, Observation sequence ={ red , white , red }, The model parameters are as follows , Solve the maximum probability hidden state sequence .

Calculation t=1 The probability of the moment

It is known that t=1 moment , Observed as red , Calculate separately in the state 1,2,3 The probability of observation is obtained under the condition of :

From above , At this time, take the status =3 when , Get the maximum local probability , however , This node is not necessarily the node of the optimal path .

Calculation t>1 The probability of the moment

stay t=2 White is observed all the time ,t=3 Red observed at all times , The observation probabilities are calculated as follows :

Pictured above , stay t=2 when , For the State s=1, Calculated separately by t-1 The state of the moment s={1,2,3} The local probability of t Local probability of time , Get the biggest t Moment probability , And so on . among ,0.1*0.5 Medium 0.5 From the hidden state 1 Move to hidden state 1 Probability ,0.16*0.3 Medium 0.3 From the hidden state 2 Move to hidden state 1 Probability , And so on .

Recursion ends

stay t=3 moment , You can get the maximum probability p=0.0147, At this point, the end point of the optimal path is i_3 = 3.

Backtracking the optimal path

By the end of the optimal path 3 Start , Go ahead and find the best of the previous moment :

(1) stay t=2 moment , because i_3 = 3, state 3 The maximum probability of is derived from the state 3( The above figure does not show , But you can refer to the status 1 Calculation process )

(2) stay t=1 moment , because i_2 = 3, We can also get that the maximum probability comes from the state 3

Finally, the optimal path is (3,3,3)

4. Concrete realization

jieba Participles are all calls jieba.cut This function ,cut Function is the entry of word segmentation , This function is in the file jieba/__init__.py , The code is as follows :

#jieba The main function of the participle , The return result is an iteratable  generator
 def cut(self, sentence, cut_all=False, HMM=True):
     '''
     The main function that segments an entire sentence that contains
     Chinese characters into seperated words.
     Parameter:
         - sentence: The str(unicode) to be segmented.
         - cut_all: Model type. True for full pattern, False for accurate pattern.
         - HMM: Whether to use the Hidden Markov Model.
     '''
     sentence = strdecode(sentence) #  Decoding for unicode
     #  Regularity in different modes 
     if cut_all:
         re_han = re_han_cut_all
         re_skip = re_skip_cut_all
     else:
         re_han = re_han_default
         re_skip = re_skip_default
 
      #  Set... In different modes cut_block Participle method 
     if cut_all:
         cut_block = self.__cut_all
     elif HMM:
         cut_block = self.__cut_DAG
     else:
         cut_block = self.__cut_DAG_NO_HMM
     #  First, use regular to segment sentences 
     blocks = re_han.split(sentence)
     for blk in blocks:
         if not blk:
             continue
         if re_han.match(blk): # re_han Matching string 
             for word in cut_block(blk):#  Word segmentation according to different patterns 
                 yield word
         else:#  according to re_skip Regular table pairs blk Perform a new segmentation 
             tmp = re_skip.split(blk)#  return list
             for x in tmp:
                 if re_skip.match(x):
                     yield x
                 elif not cut_all: #  Output character by character in precise mode 
                     for xx in x:
                         yield xx
                 else:
                     yield x

The parameter sentence It is a sample of sentences that need word segmentation ;cut_all It's a participle pattern , Accurate model , All model , By default HMM Model . According to the below cut Function to draw the corresponding flow chart

5. Other Chinese word segmentation tools

  • Baidu NLP Open source tools LAC
  • Alibaba cloud participle and NER service
  • Harbin Institute of technology LTP
  • tsinghua THULAC
  • Stanford University Chinese CoreNLP
  • Fudan University Fnlp
  • Of the Chinese Academy of Sciences ICTCLAS.

6. Ref

  1. https://www.cnblogs.com/baiboy/p/jieba2.html
  2. https://zhuanlan.zhihu.com/p/40502333
  3. https://www.jianshu.com/p/0eee07a5bf38
  4. https://zhuanlan.zhihu.com/p/137802990
原网站

版权声明
本文为[Goose]所创,转载请带上原文链接,感谢
https://yzsam.com/2021/06/20210607045510893V.html