当前位置:网站首页>How to solve the problem that iterative semi supervised training is difficult to implement in ASR training? RTC dev Meetup

How to solve the problem that iterative semi supervised training is difficult to implement in ASR training? RTC dev Meetup

2022-06-23 14:46:00 InfoQ

Preface

「 Voice Processing 」 It is a very important scene in the field of real-time interaction , Launched on the sound network 「
RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction
」 In the activity , From Microsoft Research Asia 、 Acoustic network 、 Technical experts of Sumi technology , Relevant sharing was conducted around this topic .

This paper is based on Sumi technology  NLP  Li Tian, the technical director, shared the contents in the activity . Official account 「
Sound network developer
」, Reply key words 「
DM0428
」 You can download activity related  PPT  Information .

null

01  Semi supervised training in  ASR  The necessity of the field

Universal  ASR  Although the accuracy of Chinese characters is already very high , But in specific scenarios ( Game scenario 、 Private chat scene 、 Group chat scene 、 Anchor scenario ) when , There is still a scenario mismatch problem , Because the universal  ASR  The application in these fields is relatively difficult , There are mainly the following problems .
1、 Mark the scarcity of resources
It is difficult to obtain the annotation of the corresponding scene , Usually, it is impossible to quickly obtain a large number of annotation samples required by business scenarios . Even if the sample acquisition is simple , However, it is still very difficult to obtain annotation samples , Because the marking cost is very high . When creating a project or determining the product direction , You will find that there are  ASR  The data problem should be solved before the task . In the past, when using phoneme and text splitting , The data volume is required to be small , Now, end-to-end technology is often used , Hold a candle to  1000  The amount of data started in an hour , Whether it is self labeling or with the help of well-known data companies , Before the product starts , The cost is hard to accept .
2、 Instability of dimension quality
Wake up 、Siri  Interaction and other scenarios , The user knows that the back end will transcribe , But in most business scenarios, people are concerned about  ASR  Transcription is imperceptible .

For example, with  Siri  When communicating , If  Siri  I didn't hear the speaker clearly , Then people will try again , Make the expression more clear . But the real business level , Most of the time, the customer doesn't know that the back end is  ASR  Transcription , For example, live broadcasting platform . It may provide audit level requirements , It is impossible to inform the anchor that the voice is being transcribed , You need to pronounce more clearly . The annotation quality caused by unclear enunciation and broken syntactic components is very unstable .

So how to solve these problems when labeling ? For the US business , Because it covers a large number of similar social scenes throughout the Internet , Faced with a wide variety of data and specific terms , Therefore, it is very difficult to obtain such annotations , At the same time, it is difficult to guarantee the marking quality , But the data of the scene can be easily obtained from the homologous data , We believe that the semi supervised scheme is an ideal choice .

If you have ever touched  NLP  perhaps  CV, I believe you will have a clear definition of semi supervision . stay  ASR  This field , Especially based on end-to-end , At present, it is generally divided into two types :Self-training  and  Pre-training, Others are less common , Or it can't be in at present  ASR  The field has achieved a good landing .

Self-training  The system mainly revolves around the well-known  Pseudo labeling. The core scheme is mainly based on  consistency regularization  Logic . In theory ,Pseudo label  It's actually  true label  A kind of noise , During model training , take  Pseudo label  and  true label  Train together , This itself is the process of training anti noise , It can make the model learn step by step .Pre-training  It's simple . If you do  NLP  You will know better from birth , It was originally to train the more appropriate representation of the corresponding field in the corresponding field . This task generally revolves around the reconstruction of the meaning or content of representation , No need for extra labels , This data can be constructed without labels / Having no artificially transcribed words  Pre-training  The training task of , Then use the artificially transcribed data of the corresponding scene to  ASR  Task training .

01  Semi supervised training in  ASR  The development of the field

1、Self-training
Generally speaking ,Self-training  From  CV. from  2013  Year of  Pseudo label ICML  Put forward for the first time  Pseudo label  since , Various new systems have emerged , Such as  2014  year  Learning with pseudo-ensembles( The first system ), take  Pseudo label  And model  Ensemble  To merge ;2016  year  Regularization With Stochastic Transformations and Perturbations for Deep Semi-Supervised Learning  Think  Pseudo label  Its own generation logic should also be different disturbances of the same model ;2017  year  Mean teachers are better role models: Weight-averaged consistency targets  Focus on how to generate higher quality labels , It uses model averaging to get better  teacher  Model , So as to ensure the quality of false labels .

As early as  2014  year 、2016  Two papers in , It has already been mentioned in  CV  Compare and learn from the popular fields in the , The formula argument in this paper is almost the same from many aspects , It can be said that the development of technology is a historical cycle .
2、Pre-training
Pre-training  Mainly focused on  NLP  field , Of course.  CV  There are also areas such as  ladder network  system , contain  Pre-training  Concept . however  Pre-training  The better developed fields are still  NLP. The core problem is  NLP  The underlying feature of is the character , This itself is a very discrete system , It is difficult to communicate with  CV  This dense data input for comparison .

In terms of this system ,NLP  After years of development , from  1994  Year of  N-gram-based  features , Based on the  NN  system , And then later on  NN  Generated by the design of the internal framework of the system  RNN  and  LSTM  And other language models ,2017  year  ELMO  Born in the sky , Until then  2018  year  transformer  The architecture appears . Now? , Whether it's  BERT  Or is it  GPT  Waiting  NLP  Various downstream businesses in the field have been fully verified .
3、ASR  Semi supervised development in the field
Generally speaking, according to  ASR  Its own times split it into two sections :

① Based on phonemes / The era of text splitting : In many cases, people still use  kaidi  As a business level  ASR  Underlying technical solutions . The semi supervised training logic of the scheme is , An acoustic model can be trained to  general  Phoneme model , Then through the downstream language model or  rescore  The model outputs the text required by the specific business , So as to achieve the function of partial semi supervision . From the process , It is more like a kind of transfer learning . But as the  Alex Graves  stay  2013  Years to complete  CTC  After my doctoral thesis , The end-to-end system began to emerge gradually . Two years later , EESEN  The team renewed  CTC  To the phoneme level , Make phonemes / The text splitting system returns briefly .

② End to end era :LAS(listen attendance style)  The rise of the system , as well as  CTC/LAS + LM hybrid  The rise of the system , Make the end-to-end effect 、 data 、 Model quality and reasoning speed , Begin to surpass  Kaldi  Or traditional phonemes / Text split model architecture , The industry has also begun to step into an end-to-end era . The time sequence is  CTC,Deep speech,Listen,attend and spell,  as well as  Hybrid CTC/attention.

stay  2017  Years later , With  Watanabi  Put forward  CTC/attention hybrid  and  ESPNET  Release of the frame , The end-to-end system has been preliminarily improved and can be applied to various businesses in industry . It provides a set of the same Lattice  The same flexible combination decode frame : be based on hypotheses route The design of the , Give follow-up shallow fusion  More flexible integration solutions . In fact, if you've ever used  ESPnet, You can see the whole  hypotheses  Path design is very flexible , Various technical solutions can be introduced to route Joint scoring or  rescore.

Because no longer use phoneme and other basic , And  CTC  and  Seq2Seq  The training cost is very high , In addition, it is difficult to obtain the actual annotation data , The short board that the end-to-end system relies on data has gradually become the core bottleneck of its implementation . If in the early days, especially  2015  year -2016  I worked in a big factory in  ASR, Your actual landing experience is , stay  1000  Consider end-to-end after hours .

thus , How to constrain end-to-end data requirements becomes a late problem ( from  2019  year -2020  Year begins ) Optimize end-to-end , And then solve the problem of end-to-end landing , It is also the core consideration of academia and industry . Since then , be based on  ASR  Of  Pre-training  and  Self-training  Began to step onto the stage of history . before , Although relevant research has been done , But the scope of influence is small , until  2019  Years and  2020  year ,Facebook AI  It is proposed that these two fields can be industrialized , Two papers with great development prospects were published , People began to pay attention to .

wav2vec: Unsupervised pre-training for speech recognition  yes  Facebook  Based on  Pre-training  Technical solution . The principle is the same as  word2vec  Very close to , Using the negative sampling technique to train a future time representation prediction task . Because the training results can be used as the characteristics of any audio downstream task , So this system is a very important audio technology foundation used by many large factories in the industry .

Self-training for end to end speech recognition  yes  Facebook AI  Of  Jacob  Team research , Aimed at a comprehensive analysis  Pseudo label  System for  ASR  The actual application effect . They gave  Pseudo label  The system is in English  ASR  On several core data sets in the field  strong baseline, And the first systematic exposition of  Pseudo label  The system is  ASR  Several core problems that need to be solved in the field landing .
4、Pre-training VS Self-training in ASR
stay  2020  year , As the number of customers gradually increases , More and more scenes are covered , We also face : You need to do a separate... For some specific scenarios  ASR  structure , To obtain better model effect than competitive products . Simply use phonemes / Text structure , We can not achieve the desired effect by replacing the language model to meet the needs of various fields . But at the same time , Build your own end-to-end for each scenario individually  ASR, It is difficult to accept from the data annotation . So we began to consider the choice  Pre-training  still  Self-training.

Originally, we considered to choose similar systems of other large factories , such as  Pre-training  Of  wav2vec, But we tried many times  wav2vec  Actual operation of , The cost is very high , The downstream  Post-pretraining  Training in the corresponding field plus  Pre-training  The training time itself is also very long , As a result, the iteration cycle of the model will be prolonged . It is important to , stay  Pre-training+Post-pretraining  There is nothing at this stage  ASR  The output of the model , For new business scenarios that require rapid iteration , It's hard to accept .

Based on the above contradiction , We ultimately prefer to use... In our business  Self-training  Technical solution . because  Self-training  The technical scheme of can be evaluated for each training model , Use first and optimize later , This is a friendly system for business .
5、 In the near future  ASR  field  Self-training  Development track
Anchored  Self-training  After the goal , from  2020  Since, we have been conducting research and follow-up in this field . We found that , In this field, it is mainly  Facebook,Google, Mitsubishi   It has been done quite well , Others such as old brands  ASR  company  Nuance  And some colleges and universities will also publish some improvement plans or problem research for some specific problems . stay  2020  year , Their main research interests are as follows :

(1) 2020  year

Facebook:

SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION,

END-TO-END ASR: FROM SUPERVISED TO SEMI-SUPERVISED LEARNING WITH MODERN ARCHITECTURES,

ITERATIVE PSEUDO-LABELING FOR SPEECH RECOGNITION

The research context is   simple  Pseudo label  stay  CTC  In frame  strong baseline  And research ; simple  Pseudo label  stay  CTC/Attention hybrid  Architectural effects ; Multi round iteration  Pseudo label  The study of systems .

Google:

because  Google  Of  Iterative pseudo-labeling  stay  CV  The field has a very strong technical background , So as soon as they came up, they gave their multi round iterative formula  Pseudo label+model ensemble  programme :Noisy Student Training, And take the year  Librispeech100 + 860 SOTA. Of course ,Iterative  There are many holes in training , In particular, the explosion in the number of data experiments caused by multiple rounds of iteration . This is clearly stated in our plan .

Mitsubishi :

Iterative  Pattern , In the process, the first step is to  teacher  Carry out multiple rounds of  pseudo-labeling  Training , Every training one  pseudo-labeling, The interior needs to be labeled , Such multiple rounds will make the training very cumbersome . So from  2021  Year begins , We have also gradually seen in various fields  on-the-fly  The way . For example, Mitsubishi is  2012  Put forward in  MPL( be based on  mean teacher  Evolved ). however  on-the-fly  This means that real-time generation is required  label, and  ASR  Of  label  The generation quality is the same as  decode  The calculation of costs is directly related to . ordinary  CTC  Of  greedy search  Faster , However, the quality of the transcribed text produced by it is poor ; And the more common  shallow fusion  programme , Only multiple models are used for scoring  decode  Transcribe to produce words , It is basically impossible to generate in real time during training . So generally speaking ,on-the-fly  The final effect of the mode is actually not as good as  Iterative  Pattern .

other :

Saleforce  Once “ The Renaissance ”,  The pseudo tag training is used again in  Essen  On the frame . Its label generation uses  CTC greedy search.Nuance  As the old  ASR  Technology manufacturers , By expounding  FixMatch  The theory interprets the theoretical essence of semi supervision, which is actually  Consistency Training.

(2) 2021  year

Mitsubishi :

because  on-the-fly  The defects of the model , Mitsubishi in  2021  In published  advanced MPL, It's back  Iterative  Pattern . They will  teacher  Model and subsequent  on the flying  Split the training process , At the same time, it switches to a more robust audio effect  Conformer  frame . Finally, it surpassed  Google  Of  NST  programme , Become the second place at present .

Facebook:

Facebook AI  stay  2021  Used in  cache  Mechanism , Synchronize another process during model training  decode, If  cache decode  Full of , Just cut the training into  cache  Data and  label  Data for joint training ,N  After a step  catch  Empty , And then do it again  decode. so , although  Facebook AI  Said he is  on-the-fly  Pattern , But in essence, it is the concept of rounds . Its use  36  layer  transformer, Got it so far  Librispeech100+860  Of  SOTA, It can even be balanced  ESPnet  Direct training  Librispeech960  了 .

03  Our semi supervised solution solves the problem

1、Iterative or on-the-fly
It is in the effect demand and the conclusion of the current academic and industrial circles , Our technical direction is finally anchored  Iterative  Pattern .
2、Iterative  The problem of
but  Iterative  Pattern training is very cumbersome , Since the generation of pseudo tag data needs to be regenerated after each round of training , And to achieve good results , according to  Google  and  Facebook  Experience , Multiple iterations are required .

So each iteration has three problems , First of all , How to generate high-quality data on pseudo tags ? This is essentially the simplest problem , We have all kinds of  decode  Algorithm , Use whichever algorithm is good . second , How to filter out high-quality pseudo tag data ? Because we don't know which label is right , No matter how high the quality , There will be some problems , At this point, we need to study how to reduce the proportion of problems , What can be done to reduce . Third , Whole  Iterative  The biggest problem in the pattern is , How to balance labeled data and non labeled data .

Google  Of  NST  Our system needs five iterations , It means that the ratio of marked and unmarked in each round is different . The second round is probably  2:7, The third round is  1:3, stay  librispeech 100+860, This one has a label : No label   Maintenance in  1:3  The upper and lower ratios are verified to be reasonable . But on different mission lines , The ratio is also different .Facebook  stay  Librispeech+LibriVox  The experimental results on the data set show that the ratio needs to be within  1:10  above . This leads to the final landing in the business , The cost of the experiment is huge . For example, there are five experiments ,  Each round of training requires multiple data experiments with different ratios , After the training, select the model for  decode  assessment , Then in the next round, multiple data experiments with different ratios will be carried out again , This iterates over five rounds . because  ASR  Training is expensive , The pace of each round of training is very painful .

in addition , At the limited dimension level , How to start the model cold ? Generally speaking , The initial training data is labeled , Training data are very few . such as  Iterative  The initial tag data in is generally very small , Only of the available data  1/10  about , So how to carry out cold start has become a core problem .

04 Improved NLPL  Solution

Based on these questions , We have come up with our own solutions , Published in  Improved noisy Iterative Pseudo-Labeling for Semi-superivised Speech Recogntion  in . Now let's briefly explain our solution in advance .
1、 Model framework
from  2020  Years later , We will no longer use  Kaldi  System , Instead, you switch to a class  ESPnet  Self research framework . On the model frame , about  CTC  The front end of the  sharedEncoder  and  LAS  Of  decoder, We all use  transformer, chart  1  On the left is  Watanabi  stay  CTC/Attention hybrid  The figures in that paper , On the right is an introduction to the model framework , Model parameters ,SharedEncoder  There was one before  subLayer, It's using  2  layer  (3
3+512)  Of  CNN, Step by step to  2, This may be related to  ESPnet  The frames in are slightly different , But it's basically the same .ransformer  We currently use  12
8  Of  transformer,512  dimension ,FFN  yes  2048, This is similar to most  formerbase  The model is almost the same . in addition ,AttentionDecoder  We're going to use  6  layer  transformer, Its parameter configuration is similar to  Encoder  It's the same . Language model ,LT  people ! Inserted  4  We added an extra  6  Layer of  transformer  Language model , Other parameter configurations are the same as  BERT  It's the same ,12  head ,768dims,FFN  by  3072, This is the overall model framework .

from  2020  Years later , We will no longer use  Kaldi  System , Instead, you switch to a class  ESPnet  Self research framework . On the model frame , about  CTC  The front end of the  sharedEncoder  and  LAS  Of  decoder, We all use  transformer, chart  1  On the left is  Watanabi  That article  CTC/Attention hybrid  Graph in the paper , On the right is an introduction to our model framework . Model parameters ,SharedEncoder  Of  sublayer  At present, it is  2  layer  (3*3+512)  Of  CNN, Step by step to  2,Transformer  We currently use  12  layer  8  head ,512  dimension ,FFN  yes  2048, This is similar to most  Transformer-based  The acoustic model is almost the same . in addition ,AttentionDecoder  We're going to use  6  layer  transformer, Its parameter configuration is similar to  Encoder  It's the same .

For the language model , We added an extra  6  Layer of  transformer  Language model , Other parameter configurations are the same as  BERT  It's the same ,12  head ,768dims,FFN  by  3072.

null
■ chart  1
2、 Other general settings
Our experimental data are  Librrispeech 100+860,100  As marked data ,860  As dimensionless data .LM  The data is  Librispeech  Your training data , And the official  800W  Text corpus . Our vocal features are  100  dimension  Fbank+3  dimension  pitch. To reduce the number of text labels , We used  BPE, hold  word  The quantity is reduced to  7002  individual  pieces  To reduce the final output , At the same time accelerate  CTC  Training for .

Training configuration involves learning rate , Learning rate and  transformer  be similar , But there are differences , Is in the  decay  To the last position , We'll advance  5000step decay  To the final stable value , Then hold it slowly for a while . This is directly related to the following technologies to maintain model stability , Let it train steadily for a period of time during that period , So that the average model can keep up with .
3、 How to generate false labels on unlabeled data
At present, it is common in the industry  decode  Algorithmic and relatively high-quality methods are  shadow fusion  and  deep fusion  system . We used  shadow fusion, And the acoustic model  CTC、LAS  as well as  LM  Merge to search ,bean size  by  50. The general process is the same as  ESPNET  almost , But we have two small changes :

The first is that we use  CTC  The way of greedy search is to judge the end of the sentence , and  ESPNET  It's not done , It has its own  end detact  Algorithm .

The second is that we will not prune the path too much , Instead, keep as many paths as possible .
4、 How to select high-quality pseudo tag data for the next round of semi supervised training
When generating pseudo tags , In fact, the quality of many data is not flattering , Especially the early training , such as  NST  perhaps  Iterative Labeling  The first or second round of , At this time, the model is  librispeech dev  and  test  Upper  WER  May be close to  9  perhaps  10  More than one point .

In this case ,Google  and  Facebook  Adopt the method of rough sorting to take the percentile , Be similar to  ESPNET  Medium  hypothesis  The score of , And then in  decode  During the process, the probability is added , Rank probabilities from small to large , Then take one of them  90%. There may be a cliff like confidence rate , Like the front  85%  The probability distribution of the data is very similar , And then in  85%~95%  The location of , The probability suddenly shows a very big difference , The probability of falling to more than a few possible points . In response to the above problems , We use the method of distribution test to extract samples : Let us first assume that it obeys the Gaussian distribution , Then only the bilateral confidence intervals of Gaussian distribution are preserved  90%  perhaps  95%  For training . Here is the bilateral confidence interval  90%/95%, Does not mean data retention  90%  and  95%, But in the case of Gaussian distribution, keep the data in the confidence interval , So it's probably less than direct retention  90% Data .
5、 mark / How to balance the ratio of unmarked data , Only in this way can the model not be over fitted to label data without label data
mark / How to balance the ratio of unlabeled data is the biggest problem in multi round iterative semi supervised training , All the previous studies did not show how to conduct proportional screening , Only the approximate proportion of corresponding tasks is given ,Facebook  What they do is  Librispeed 960+LibriVOX, Its ratio is  1:10~1:54  Between .Google  yes  Librispeech 100 +800, Scale in  1:3  about .

The above opinions can not guide the actual production to determine the proportion of land use . For example, live broadcast scenes  ASR, With  100  Hours as the starting price , At the same time, it may be easy to obtain many homologous unlabeled data . But in what proportion should these unlabeled data and labeled data be put together , So that the model will not be trained to unlabeled data ; How to train the model to ensure its stability and better effect , This will require endless data experiments . Of course , If there are enough machine resources in the company , It is indeed possible to do these experiments , But most of the time, people are not like  Google  and  Facebook  There are so many machines , It can be directly and violently exhausted .

So how can we get guidance from each business line at this time ? We are  Librispeech 100/860  Detailed experiments and qualitative and quantitative analysis were carried out , Got a guide , At present, this guidance is very accurate , It can teach you how to choose data balance . Let's make a hypothesis here , This is directly related to why we should do semi supervised training of false labels . We think that when training pseudo tags , Because tagged data and unlabeled data are mixed together , So for some pseudo tag data , We don't know if it's marked correctly , Model training should be made as much as possible on certain characteristics “ conservative ”, Don't over fit to the wrong data or trailer data . But it also ensures a certain diversity of samples , Because if you are completely conservative , Model training will fall into what it thinks is the optimal result of the data level , And then step into the local optimal solution . Multiple rounds of iterative training will exacerbate this process , This leads to model over training and over fitting .

To identify where to be conservative , Where to ensure diversity , We divide the data into three portrait dimensions , The first portrait dimension is audio length , The second portrait dimension is text /pieces  length , The third dimension is the distribution of the tag itself . The problem can be transformed into , In which dimensions should we try to keep the training conservative , Which dimensions should ensure the diversity of samples as much as possible . Based on this , We conducted a large-scale experiment , Every time a new pseudo tag is generated , We will according to different proportions , Build multiple training samples  candidate, That is, the alternative set , This  candidate  Each batch of training data in . Before each round of training , We all share every catenary  cadidate  Same as our last training data   Compare these three dimensions , And for all  candidate  Rank . such as  1:2  Of  candidate  Rank in three dimensions with the upstream ,1:4  Of  candidate  There will also be a ranking ,1:5  and  1:6  There will also be a ranking , wait .

On the evaluation ranking scheme , because  frame lenth  and  pieces length  Is a single dimensional statistic , So we used  KS  test . but  label  The distribution itself is multidimensional , So let's normalize  TF, Then the Euclidean distance is used to evaluate the distribution difference between the current round data and the previous round data , And for each  candidate  ranking .

After a lot of experiments , Found a very clear rule , Namely  pieces  The smaller the difference of distribution itself , Bigger  frame lenth  Distribution differences and  pieces length  The difference in distribution will generally lead to a better new round of model effect . The above logic can be described as a general paradigm , Pictured  2  Shown .

null
■ chart  2
6、 How to ensure that the model will not be over fitted to the wrong pseudo label in model training  trick
This is a key point we found in the whole system . Here we have two dimensions . The first dimension is the data dimension , We joined in  specAug  and  specAug++ Make the whole data more generalized . At the same time, at the model level , Be similar to  MPL, We will generate  online  and  offline  Generation , Choose... In the early stage  online  Result , Post selection  offline  Result , Generally speaking, after the fifth round  offline  The result will be stable higher than  online  Result . in addition , We will also carry out  dropout  promote , about  dropout  From  0.1  Gradually upgrade to  0.3, because   Pseudo tag training   There is a great risk of over fitting , But it's basically up to  0.4  There will be no new income in the future .

7、 Under a limited number of labeled samples , How to carry out the cold start supervision training of the model can obtain the best effect

We also used two-stage training . The first stage of training starts with  dropout0.1 30epoch  Match to the second level  dropout0.13 100epoch  The best effect . The specific experimental results are shown in the figure  3  Shown . It also shows a problem , In cold start, you should start with a few  epoch, The smaller  dropout, Fast fit target , Then raise  dropout, Let it be a relatively generalized training configuration , Train more rounds , Make the model optimal . This cold start mode can basically be compared with  Google  Of  NST  The cold start result of the system model is flat .

null
■ chart  3

Finally, the whole  improved NIPL  The final effect of . At present, the deadline for submission is  interspeech 2022  Look at , stay  Librispeech 100+860  At present, there are two companies that are better than us , The first is Mitsubishi  MPL  Of  conformer  yes  3.8%/8.2%. But if the control variable is the same transformer, Mitsubishi has only  4.8%/10.1%, And we are  3.93%/9.59%. The other is  Facebook  Of  simIPL, its  36  layer  transformer  It can be done  3.8%/7.5%, And no language model is required , If you add a language model and  rescore  It can be done  2.7%/5.2%. This effect already belongs to the effect beyond our cognition . Because we trained  960  The data of ,ESPnet librispeech 960  The supervision training results in  96.96  Should be  3.04%, It means  Facebook  no need  860  The data of , only  100  Of  label  You can do that  2.7%/5.2%.

Finally, the whole  improved NIPL  The final effect of . At present, the deadline for submission is  interspeech 2022  Look at , stay  Librispeech 100+860  At present, there are two companies that are better than us , The first is Mitsubishi  MPL  Of  conformer  yes  3.8%/8.2%. But if the control variable is the same  transformer, Mitsubishi has only  4.8%/10.1%, And we are  3.93%/9.59%. The other is  Facebook  Of  simIPL, its  36  layer  transformer  It can be done  3.8%/7.5%, And no language model is required , If you add a language model and  rescore  It can be done  2.7%/5.2%. This effect already belongs to the effect beyond our cognition . Because we trained  960  The data of ,ESPnet librispeech 960  The supervision training results in  96.96  Should be  3.04%, It means  Facebook  no need  860  The data of , only  100  Of  label  You can do that  2.7%/5.2%.

05  Q & A

1、 contrast  WER  What's the effect ?

our  test clean  yes  3.93,test other  yes  9.59, But then we went on NIPL Training rounds 7 and 8 ,test other  It can be reduced . although test clean  Still maintained at  3.93, but  test other  So far, it has been reduced to about  9.3. Mitsubishi's  conformer  yes  3.8%/ 8.2%, Better than ours 3.93  low , But their  transformer  yes  4.8%/10.1%.Facebook  Of  simIPL  yes  3.8%/7.5%, about  Facebook simIPL  We expressed a little disbelief , The effect is a little scary . So we should be the third in the world , Than  Google  stay  2020  The article published in  NST  Better .

2、 Introduce to you  CTC  Use

CTC  When it first appeared , Because of the difficulty of training optimization , The requirements for data volume are also strict , So at that time  CTC  The use of is all some strange skills . As mentioned above  ESSEN, hold  CTC  For training phonemes , Then I still pick up the same as everyone else  WFST. Because the number of phonemes is relative to  word  Much smaller , Greatly reduced  CTC  The difficulty of training , So that it can be the same in some fields  MMI,LFMMI  And so on . Directly naked  CTC  End to end  ASR  Data costs can be very high .

If you are in the  2020  Ask this question in , I will recommend you to try in the new business  ESSEN  project . But now it's  2022  Years. ,CTC  Great changes have taken place in the industrial use of .Watanabi  The paper tells you ,CTC  and  LAS hybrid  This system can have a very good effect , And the data quality will not be the same as before  CTC  That requires so much , because  LAS  The system has many optimization techniques that can be used to help train . therefore  CTC LAS  It is a relatively standard use scheme at present . If you don't have your own  ASR  Training platform , I suggest you try  ESPnet/Wenet, If flow identification is the core business demand ,Wenet  As the first choice .

Activity Notice

「RTC Dev Meetup -  Hangzhou station 」, We will focus on big front-end technology , The invitation comes from
Acoustic network 、 Ant group and Hikvision
Technical experts , Share with us the business architecture and cross end practices in the field of real-time interaction in the era of big front-end .

Action is better than action , Scan the QR code or click
here
Sign up !



原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231350354756.html