当前位置：网站首页>WFST decoding process

WFST decoding process

2022-07-28 20:10:00 【Hu Xi Hu Xi】

WFST The composition result of is static , One WFST By a set of States （state） And the directed jump between States （transition) form ,WFST It should have a starting state and At least one termination state , It is customary to use thick circles to indicate the initial state , Use double circles to indicate the termination state .

WFST Of viterbi decode , To preserve the traversal path , stay WFST Token propagation is used in the decoding process （token passing) Mechanism ,token passing yes viterbi A more general version of decoding .

I understand it ：（ Input a frame of voice , The corresponding alignment file can be obtained from the acoustic model , namely transition-id ？） This understanding is actually incorrect , Input acoustic model during decoding , It's to get through transition-id Go to inquire about the sound credits .

WFST Of viterbi Decoding is done frame by frame . First, calculate the acoustic score of each frame separately （ The likelihood probability of the feature frame is reversed ）, Then combine the weight on the transfer arc （ Figure cost : Pronunciation dictionary 、 Language models and HMM Transfer probability ）, Get the cumulative cost of expanding the path at each time , These costs are used token Of cost preservation .WFST Of viterbi The decoding process is , By comparing the causes of different paths pointing to the same state token Cumulative cost of preservation （ The token Associated with the state node , If the status node has not token, Creates a new one token）, Choose a path with a smaller value and update token Information .（ Because the decoding diagram is huge , therefore token There may be multiple paths for the spread of , Corresponding t There will be multiple frames of time token. these token adopt WFST Node label of decoding graph stateID distinguish , That is, through t and stateID You can find the only one token） The token passing process is carried out by frame , When the execution reaches the last frame , End of token passing . here , View tokens on all statuses , Choose the best one or more tokens , According to the information on it, the corresponding paths of these tokens can be removed or traced , In this way, the recognition result can be obtained . Get word level word case files by backtracking , And then through lattice-to-phone-lattice Convert the word case file at the dimension phoneme level , Re pass lattice-best-path Get the final phoneme alignment file .

Decoding is more common than outputting only one best path , Instead, output a word grid （word lattice), stay kaldi Chinese word grid is defined as a special WFST, The WFST The weight value of each jump of consists of two values , These two values represent acoustic score and language score respectively . and HCLG equally , The input tag and output tag of the word grid are transition-id and word-id.

H The input is transition ID, The output is triphonic （ Triphones can be bound by States , Make a difference HMM State shared parameter model ）

Torture of the soul ： Why do we need to do state binding ？

answer ： If there is 218 phoneme , If the triphone model is used, there are 218 Of 3 Power triphones .（ The middle phoneme may be 218 One of them , The left phoneme may be 218 One of them , The right phoneme may be 218 One of them ）
If not clustering , Need to establish 218*218*218*3 Mix gmm Model ( Suppose that each triphone Yes 3 Status ).
On the one hand, the amount of calculation is huge , On the other hand, it will cause data sparsity . Therefore, the state of triphones will be bound according to the data characteristics .

We Use aishell Self contained phones.txt And your own dictionary 、text Build your own HCLG, Then decoded ：

Then we put our own HCLG.fst and aishell Their own HCLG.fst Conduct union, Then decode ：

I found that I got stuck in the second sentence , At this time, my understanding is through fstunion Will be original HCLG.fst And our HCLG.fst Just ordinary merge and union , Then input a frame of speech decoding , Two when decoding HCLG At the same time . When the second sentence is phonetically traced , There may be multiple paths cost Both are relatively small , Constitute the words Level word map file , And then through lattice-best-path, Determine the optimal path , This path happens to be original HCLG.fst Upper , Then the one above him word-id Corresponding to the original words.txt, And our own words.txt Of course, it doesn't correspond to .

Then we remove the second sentence , Reuse union After fst decode , Decoding effect and direct use of our own HCLG.fst Agreement , This proves that even union Two HCLG, When decoding, two HCLG Also at the same time , Then maybe except for the second sentence , Other sentences are in our own HCLG On cost Relatively small , So in union Of HCLG Decode on and on yourself HCLG The effect of decoding is the same .

Re grade the question ： Use HCLG After decoding , A large language model is used to modify the language model of the word case , The weights on the word lattice are stored separately according to the sound score and the inherent score of the picture , Language is divided into and HMM Transfer probability 、 The specific pronunciation probabilities in polyphonic words are mixed together to form the intrinsic score of the graph , The language model re scores the adjusted knowledge language score , Therefore, we need to find a way to get rid of the old language model score in the original intrinsic score , Then apply the new language model score .

In the re scoring stage, we use the re scoring command ,lattice-lmrescore --lm-scale=-1.0 ark:lat "fstproject --project_output=true G.fst |" ark:nolm.lat（ Remove the old language model ）,lattice-lmrescore --lm-scale=1.0 ark:nolm.lat "fstproject --project_output=true G_union.fst |" ark:newlm.lat（ Add a new language model ）.1、 The new language model added for the first time is our own 12 Composed of text Text Documents and their own little dictionary .

2、 The new language model added for the second time uses its own 12 Composed of text Text File and put your own little dictionary with aishell Big dictionary of combine The dictionary formed by .

3、 The third addition of the new language model is to put your own 12 The text is repeated several times to form Text File and put your own little dictionary with aishell Big dictionary of combine The dictionary formed by .