当前位置:网站首页>WFST decoding process
WFST decoding process
2022-07-28 20:10:00 【Hu Xi Hu Xi】
WFST The composition result of is static , One WFST By a set of States (state) And the directed jump between States (transition) form ,WFST It should have a starting state and At least one termination state , It is customary to use thick circles to indicate the initial state , Use double circles to indicate the termination state .
WFST Of viterbi decode , To preserve the traversal path , stay WFST Token propagation is used in the decoding process (token passing) Mechanism ,token passing yes viterbi A more general version of decoding .
I understand it :( Input a frame of voice , The corresponding alignment file can be obtained from the acoustic model , namely transition-id ?) This understanding is actually incorrect , Input acoustic model during decoding , It's to get through transition-id Go to inquire about the sound credits .
WFST Of viterbi Decoding is done frame by frame . First, calculate the acoustic score of each frame separately ( The likelihood probability of the feature frame is reversed ), Then combine the weight on the transfer arc ( Figure cost : Pronunciation dictionary 、 Language models and HMM Transfer probability ), Get the cumulative cost of expanding the path at each time , These costs are used token Of cost preservation .WFST Of viterbi The decoding process is , By comparing the causes of different paths pointing to the same state token Cumulative cost of preservation ( The token Associated with the state node , If the status node has not token, Creates a new one token), Choose a path with a smaller value and update token Information .( Because the decoding diagram is huge , therefore token There may be multiple paths for the spread of , Corresponding t There will be multiple frames of time token. these token adopt WFST Node label of decoding graph stateID distinguish , That is, through t and stateID You can find the only one token) The token passing process is carried out by frame , When the execution reaches the last frame , End of token passing . here , View tokens on all statuses , Choose the best one or more tokens , According to the information on it, the corresponding paths of these tokens can be removed or traced , In this way, the recognition result can be obtained . Get word level word case files by backtracking , And then through lattice-to-phone-lattice Convert the word case file at the dimension phoneme level , Re pass lattice-best-path Get the final phoneme alignment file .
Decoding is more common than outputting only one best path , Instead, output a word grid (word lattice), stay kaldi Chinese word grid is defined as a special WFST, The WFST The weight value of each jump of consists of two values , These two values represent acoustic score and language score respectively . and HCLG equally , The input tag and output tag of the word grid are transition-id and word-id.

H The input is transition ID, The output is triphonic ( Triphones can be bound by States , Make a difference HMM State shared parameter model )
Torture of the soul : Why do we need to do state binding ?
answer : If there is 218 phoneme , If the triphone model is used, there are 218 Of 3 Power triphones .( The middle phoneme may be 218 One of them , The left phoneme may be 218 One of them , The right phoneme may be 218 One of them )
If not clustering , Need to establish 218*218*218*3 Mix gmm Model ( Suppose that each triphone Yes 3 Status ).
On the one hand, the amount of calculation is huge , On the other hand, it will cause data sparsity . Therefore, the state of triphones will be bound according to the data characteristics .
We Use aishell Self contained phones.txt And your own dictionary 、text Build your own HCLG, Then decoded :

Then we put our own HCLG.fst and aishell Their own HCLG.fst Conduct union, Then decode :

I found that I got stuck in the second sentence , At this time, my understanding is through fstunion Will be original HCLG.fst And our HCLG.fst Just ordinary merge and union , Then input a frame of speech decoding , Two when decoding HCLG At the same time . When the second sentence is phonetically traced , There may be multiple paths cost Both are relatively small , Constitute the words Level word map file , And then through lattice-best-path, Determine the optimal path , This path happens to be original HCLG.fst Upper , Then the one above him word-id Corresponding to the original words.txt, And our own words.txt Of course, it doesn't correspond to .
Then we remove the second sentence , Reuse union After fst decode , Decoding effect and direct use of our own HCLG.fst Agreement , This proves that even union Two HCLG, When decoding, two HCLG Also at the same time , Then maybe except for the second sentence , Other sentences are in our own HCLG On cost Relatively small , So in union Of HCLG Decode on and on yourself HCLG The effect of decoding is the same .

Re grade the question : Use HCLG After decoding , A large language model is used to modify the language model of the word case , The weights on the word lattice are stored separately according to the sound score and the inherent score of the picture , Language is divided into and HMM Transfer probability 、 The specific pronunciation probabilities in polyphonic words are mixed together to form the intrinsic score of the graph , The language model re scores the adjusted knowledge language score , Therefore, we need to find a way to get rid of the old language model score in the original intrinsic score , Then apply the new language model score .
In the re scoring stage, we use the re scoring command ,lattice-lmrescore --lm-scale=-1.0 ark:lat "fstproject --project_output=true G.fst |" ark:nolm.lat( Remove the old language model ),lattice-lmrescore --lm-scale=1.0 ark:nolm.lat "fstproject --project_output=true G_union.fst |" ark:newlm.lat( Add a new language model ).1、 The new language model added for the first time is our own 12 Composed of text Text Documents and their own little dictionary .
2、 The new language model added for the second time uses its own 12 Composed of text Text File and put your own little dictionary with aishell Big dictionary of combine The dictionary formed by .
3、 The third addition of the new language model is to put your own 12 The text is repeated several times to form Text File and put your own little dictionary with aishell Big dictionary of combine The dictionary formed by .

aishell_decode:




边栏推荐
- HSETNX KEY_ Name field value usage
- JVM(二十四) -- 性能监控与调优(五) -- 分析GC日志
- In the second half of 2022, the system integration project management engineer certification starts on August 20
- 【NPP安装插件】
- WPF--实现WebSocket服务端
- C language functions and pointers
- How many types of rain do you know?
- mmo及时战斗游戏中的场景线程分配
- Two methods to judge the size end
- [C language] header file of complex number four operations and complex number operations
猜你喜欢

English translation Portuguese - batch English conversion Portuguese - free translation and conversion of various languages
![最大交换[贪心思想&单调栈实现]](/img/ad/8f0914f23648f37e1d1ce69086fd2e.png)
最大交换[贪心思想&单调栈实现]

4. Const and difine and the problem of initializing arrays with const and define

Netcoreapi operation excel table

利用STM32的HAL库驱动1.54寸 TFT屏(240*240 ST7789V)
![[C language] guessing numbers game [function]](/img/db/8ebdb02f137878224367503b730803.png)
[C language] guessing numbers game [function]

2022年下半年系统集成项目管理工程师认证8月20日开班

JVM(二十四) -- 性能监控与调优(五) -- 分析GC日志

Design of air combat game based on qtgui image interface

Leetcode Day1 score ranking
随机推荐
The privatized instant messaging platform protects the security of enterprise mobile business
2022年下半年系统集成项目管理工程师认证8月20日开班
adb remount of the / superblock failed: Permission denied
Two methods to judge the size end
Longest Palindromic Substring
[C language] initial C language reflection and summary
Circular linked list OJ question
[C language] guessing numbers game [function]
Rand function generates pseudo-random numbers
[in depth study of 4g/5g/6g topics -44]: urllc-15 - in depth interpretation of 3GPP urllc related protocols, specifications and technical principles -9-low delay technology -3-non slot scheduling mini
MySQL command statement (personal summary)
9. Pointer of C language (1) what is pointer and how to define pointer variables
WPF -- implement websocket server
83.(cesium之家)cesium示例如何运行
Why is there no log output in the telnet login interface?
Data system of saltstack
Design of air combat game based on qtgui image interface
JVM(二十四) -- 性能监控与调优(五) -- 分析GC日志
Overcome the "fear of looking at teeth", and we use technology to change the industry
Integration and implementation of login click graphic verification code in personal blog system