当前位置：网站首页>Fastcorrect: speech recognition fast error correction model RTC dev Meetup

Fastcorrect: speech recognition fast error correction model RTC dev Meetup

2022-06-22 08:19:00 【Acoustic network】

Preface

「 Voice Processing 」 It is a very important scene in the field of real-time interaction , Launched on the sound network 「RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction 」 In the activity , From Baidu 、 Technical experts of Huanyu technology and Yitu , Relevant sharing was conducted around this topic .

This article is based on Tan Xu, the research director of Microsoft Asia Research Institute, who shared the content in the activity . Official account 「 Sound network developer 」, Reply key words 「DM0428」 You can download activity related PPT Information .

Speech recognition error correction is to detect and correct errors in speech recognition results , Further improve the recognition accuracy . at present , Most error correction models adopt autoregressive structure based on attention mechanism , High latency , Affect the online deployment of the model .

This article will introduce a low latency 、 High precision error correction model FastCorrect, By using edit alignment and multiple candidate results , Acquire 10% While the word error rate of , Accelerate model 6-9 times , Relevant research papers have been published by NeurIPS 2021 and EMNLP 2021 Included .

Insert picture description here

01 Background information

1、ASR（Automatic Speech Recognition Automatic speech recognition ）

The accuracy of speech recognition is the most critical factor affecting the wide application of speech recognition , How to reduce the error rate of speech recognition in the process of recognition ASR It's very important . Improve the accuracy of speech recognition 、 There are many different ways to reduce the error rate , The traditional way is to improve the core model of speech recognition . In the course of previous studies , The main concern is how to improve the training model modeling paradigm and training data of speech recognition . Actually , In addition to improving the accuracy of the speech recognition model itself , It can also post process the recognition result of speech recognition , Further reduce the recognition error rate .

2、ASR post-processing

What can be done in the speech recognition post-processing scenario ？ First of all reranking, That's reordering , Usually, when speech recognition generates characters, multiple candidates will be generated , We can sort the models , Select a better result from multiple candidates as the final recognition result to improve the accuracy . The second method is to correct the error of speech recognition results , This will further reduce the error rate . These two methods are both optional methods for speech recognition post-processing , It is also a widely used method to reduce the error rate , Today's sharing mainly focuses on error correction methods .

3、 Why choose error correction

The reason for choosing error correction means is , We believe that error correction is based on the results of existing speech recognition , It can produce better speech recognition results . and reranking Is to generate a better candidate from the results returned by the existing speech recognition , If the error correction effect is good enough , It's better than reranking Have more advantages .

02 ASR The form of error correction task

The technical scheme selection is introduced above , Why choose error correction means . So let's define ASR Error correction task （error correction） In the form of . First, a training data set is given （S,T）, among S Represents the input speech of speech recognition ,T Is the corresponding text annotation . then ASR The model will recognize speech as text , The resulting M(S).M(S) and T The two data pairs form a training set ,error correction The model is mainly trained in this training set . After training , We give ASR The result of recognition , That is to say M(S), Return the correct result .

Error correction The task of the model is a typical sequence to sequence learning task , Input is the result of a speech recognition , The output is the correct result after error correction . Since it is a sequence to sequence model , Previous work will naturally regard it as a task of sequence modeling , adopt encoder- attention-decoder Error correction by autoregressive decoding , The input is the wrong sentence , The output is the correct sentence .

In the process of decoding, autoregressive method is used , For example, generate A, Then generate the next word B, And then generate C and D. There is a problem with this approach , The decoding speed will be slow . We have measured , Like online ASR Model in CPU On average latency yes 500 millisecond , If we add an autoregressive error correction model , Will bring additional 660 Millisecond delay , Reduce the online recognition speed by more than twice , Pictured 1 Shown .

■ chart 1

This kind of scheme is obviously not desirable in actual deployment , Therefore, our goal is to reduce the delay and maintain the accuracy of error correction . We use the non autoregressive method to accelerate , The above mentioned method is autoregressive , Instead of autoregressive, not one at a time token, Instead, generate all at once token, It can improve the decoding speed .

Because non autoregressive decoding model is widely used in machinetranslation , So we directly use the typical non autoregressive model in machinetranslation to try , It is found that it can not reduce the error rate of speech recognition , On the contrary, it will increase , Why is this so ？ First of all, we found that , The non autoregressive training task of text error correction in speech recognition is different from that in machinetranslation . For example, the input of machinetranslation is Chinese , And the output is in English , Then all... In the sequence are entered token All need to be modified , Translate Chinese into English . But in the error correction task , Most of the sentences entered are correct , in other words , Most of the words in the input sentences do not need to be modified .

If you still use the traditional method , It is easy to cause two problems ： Omissions and mistakes . This brings challenges to the task of error correction , How to detect errors , And how to correct errors , Become the key to improve accuracy .

03 Naive NAR solution fails

We have made a detailed analysis of this problem , It is expected to find features from the task to design specific non self regressive modeling methods . First , Machinetranslation of different languages （ For example, from Chinese to English ） It has the characteristics of word order exchange , Because the Chinese expression is different from the English expression in word order , But in the error correction task , Recognize the text generated by speech recognition and the last correct text , In fact, there is no word exchange error , It's a monotonous alignment .

secondly , There are many possibilities for the word itself to be wrong , For example, insertion errors 、 Delete and replace errors . Based on these two prior knowledge , It can provide a more detailed error pattern for the error correction process , To guide error detection and error correction operations , We analyze this problem to inspire the design of the corresponding model .

04 FastCorrect Series model introduction

Microsoft aims at FastCorrect The model has carried out a series of work , Include FastCorrect 1、FastCorrect 2 and FastCorrect 3. Each work is aimed at different problems and scenarios .FastCorrect 1 stay NeurIPS 2021 Give a speech at the meeting , It is mainly based on the prior knowledge of the task analyzed above , Provide guidance signals for addition, deletion and modification through text editing distance , To correct the result of speech recognition . The error correction only aims at the best result of speech recognition , Because speech recognition can get a result , It can also be done through beam search Decoding yields multiple results .FastCorrect 1 It can be realized 7～9 Double acceleration , At the same time, it can achieve 8% Of WERR, That is, the reduction of word error rate .WERR Although it looks small , But at present, the accuracy of speech recognition has been very high , It can be realized 8% Of WERR In fact, it is not easy .

Although usually speech recognition will eventually return a candidate, But in the process of speech recognition decoding , There will also be multiple candidate. If multiple candidates can provide mutually verified information , Can help us better achieve error correction . So we designed FastCorrect 2, It is published in EMNLP 2021 findings, Using multiple candidate Synergy , Further reduce the word error rate . comparison FastCorrect 1, The error rate can be further reduced , While maintaining a good acceleration ratio .

These two jobs are currently in Microsoft Of GitHub（ https://github.com/microsoft/NeuralSpeech ） Open source , If you are interested, you can try to use . Next, the technical implementation details of the two works will be introduced in detail .

1、FastCorrect

FastCorrect The core of is to use the prior knowledge in text error correction , That is, information about adding, deleting, and modifying operations , So we first aligned the wrong text with the correct text , The logic of alignment is guided by the edit distance of the text , By aligning, you can know which words to delete 、 Which words to add 、 Which words to replace, etc . With these fine-grained monitoring signals , The modeling of the model will be easier . For example, in the delete operation, we use duration The concept of ,duration Refer to , Give information in advance for each input word , Indicate change to target In the correct sentence , This word will become several words , For example, if it becomes zero, it means it is deleted , To become a word that means to remain unchanged or to be replaced , To change to more than two words means to insert or replace .

With such fine-grained supervision signals , The effect of the model will be improved , Not like machinetranslation , It is an end-to-end means of data learning . meanwhile , The non - autoregressive model design is also divided into three parts ,encoder Take the wrong text as input to extract information ;duration The predictor predicts each source token How many should be changed target token; and decoder The resulting target token.

(1) Edit alignment

Next up FastCorrect Edit alignment in , chart 2 The sequence on the left in is the result of speech recognition output BBDEF,Target The sequence is actually the correct result ABCDF, This indicates that speech recognition is wrong , We edit it to align the distance , An upward arrow indicates deletion , To the left means to insert , Pointing diagonally indicates substitution .

■ chart 2

After editing the distance alignment, you can get several different paths , The edit distance is the same for each path , For each path , We can know source Each token and target Each token Alignment of . after , You can choose some match High degree path . such as path a and path b Of two paths match To a greater extent than path c higher , So we are based on path a and path b Select the appropriate alignment relationship between the two paths . From these two path You can get three different alignments , For example Align a in B Of token Corresponding A and B, and B Corresponding C etc. . meanwhile path There are different possibilities , For example Align b1 in B It may also correspond to B and C, stay Align b2 in D It may also correspond to C and D. Next, we can find out which combination is common from the text corpus , Then we choose a reasonable alignment relationship through the collocation frequency of words .

From the picture 2 At the bottom of the BBDEF and ABCDF Can know every source token It should be changed into several token, such as Align b1 First of all B It will be changed to 2 individual , the second B Change to 1 individual ,D Change to 1 individual ,E Change to 0 individual ,F Change to 1 individual . With these signals , You can clearly know every source token It should be changed into several token.

(2) NAR model

Pictured 3 Shown ,Encoder The input is the wrong sentence , Predict how many words each sentence will be changed into , Then spread out the sentence according to this . For example, if you look at the first B It will be changed into two words , We will take B Spread twice . And this B It's a word , Let's put it here . So if it's going to be deleted , Let's delete it . And finally as Decoder The input of , Then decode it in parallel . This is the design of the core method of the model .

■ chart 3

(3) Pre-training

In error correction model training , because ASR The word error rate is relatively low , FALSE case commonly Less , Insufficient effective training data , The training effect of the model will also be reduced , So we constructed some wrong pairing data , That is, input wrong sentences but output correct sentences . Because in the past, it is not enough to rely on the model of speech recognition to provide data , So we forged such data on a large scale for pre training , Fine tune to the real speech recognition data set . We simulated deletion when forging data 、 Insert and replace operations , Because these operations should be close to the pattern of real speech recognition to produce error rate , Therefore, the probability of addition, deletion and modification is close to the existing speech recognition models . meanwhile , We will give priority to homonyms when making replacement , Because speech recognition is generally homophone errors , After finding such data , It can help the model to train well .

(4) Experiments

Next, some experimental details are introduced , We are in some academic data and Microsoft's internal speech recognition data set , Focus on Chinese speech recognition and error correction , At the same time, about 400million sentences from the pre training model were selected .

■ chart 4

The experimental results are shown in the figure 4 Shown , It can be seen that the original speech recognition probably got 4.83 The word error rate of , And if we use the autoregressive model just mentioned , That is to say encoder attention decoder, It can be realized 15% The decline of word error rate , But it's latency Relatively high . This is the method used in the past , Including the non - Auto - returning method in machinetranslation and some methods of text editing . Compared with the original speech recognition error , Can achieve 13% ～14% The decline of word error rate , Close to the autoregressive model , That is to say, there is almost no loss of error correction capability . however latency Compared with autoregressive model, it accelerates 7 times . It can be seen that FastCorrect This method can keep the word error rate down , Increase speed at the same time , Realize the standard of online deployment .

We also study Each module pre trains the method of constructing data , And by editing the distance alignment The effectiveness of the method . From the picture 5 The two datasets shown in can be seen , If you remove FastCorrect Related modules of , Or it will lead to the decrease of accuracy , indicate FastCorrect These modules are useful .

■ chart 5

The autoregressive model is a encoder decoder,decoder More time-consuming , It needs to decode a word of autoregression . Maybe you have questions , To speed up the autoregressive model , Is it possible to make encoder deepen ,decoder Lighten up , Achieve the same speedup ratio and maintain accuracy ？ In this regard, we will FastCorrect And different variants of autoregressive model , Pictured 6 Shown ,AR 6-6 representative 6 layer encoder and 6 layer decoder, and AR 11 -1 representative 11 layer encoder and 1 layer decoder. You can see ,FastCorrect Better method , Or the word error rate is about the same , But the acceleration is more obvious than the effect improvement , This also dispels the question just now .

■ chart 6

As mentioned above , How to detect and correct errors in text error correction is very important , We also compared the detected precision and recall, And error correction ability . Through comparison, we found that ,FastCorrect The effect of this method is really better than the previous methods , This also verifies some previous conjectures ： Provide some fine-grained addition, deletion and modification guidance signals through prior knowledge , It can help us better detect and correct errors .

2 FastCorrect 2

(1) Multiple candidates

FastCorrect 2 yes FastCorrect 1 The extended version of , because ASR The result of speech recognition model is usually multiple sentences , It will provide some additional information , called voting effect. Suppose a segment of speech gets three possible sentences through the recognition model , Namely “I have cat”“I have hat”“I have bat”, These three sentences can provide us with additional information . First , Generally speaking, the recognition of the first two words is correct , Because all three results identify I have, But the last three words are different , Explain that many of them may be wrong or both . But in general , This word is based on at End of pronunciation . After getting this information , The difficulty of error correction and correction will be greatly reduced . When revising, you can choose a more reasonable word , Help us narrow the problem space . This is it. FastCorrect 2 Design idea .

(2) Model structure

The results of the design model are shown in Figure 7 Shown , First , Align multiple candidate sentences of speech recognition before input , Because only after alignment can mutual confirmation information be provided . For example, in the previous example , We need to get cat、hat and bat alignment , Align the input sentences according to this idea , then encoder These sentences will be linked as input to the model , And predict the... Of each sentence duration, That is, it will be changed into several words after modification . A selector is also used to select a better candidate , adopt loss Monitor which candidate better , Then based on better candidate Make changes . chart 7 The third one in candidate good , Let's take it as decoder Input . This is the whole FastCorrect 2 Of high level The design method of .

■ chart 7

(3) Align multiple candidates

Here's a detail , It is how to make multiple sentences Align get up , Make it have a more accurate corresponding relationship , Let's find any one of them anchor Of candidate, Then align the other sentences with the sentence , The details here are not enough . This alignment method is actually the same as FastCorrect 1 As described in , First, calculate the editing distance , Then get the edited path And from this path Select a more reasonable alignment relationship . in other words , Make every sentence equal to anchor The sentence align Up to then , You'll get all the sentences and this anchor Alignment of sentences , Finally put this candidate merge get up , A multi-channel alignment is formed . After alignment, it can be used as the input of the model .

There's a contrast , If you don't use FastCorrect 2 Alignment method , instead Naive Padding, You will see the picture 8(b) The situation of , here B All gathered together , however C and D Is mixed . This is strange , Because in terms of models C and D It doesn't really matter . But because we used a very simple method , Put them in the same position , The model will not be able to get signals of mutual verification , This will happen D、E and F Also mixed together , Lead to cat、hat and bat Can't confirm each other to help us correct .

■ chart 8

(4) Results

Next, let's show the results , Pictured 9 Shown , The first line is the error rate of speech recognition results , The second line is the error rate after the autoregressive model is used to correct the error , The third line is FastCorrect 1 Result . At the same time, we also made some settings , As mentioned earlier, there are two ways to post process speech recognition , One is reranking, The other is error correction . Since there are many candidate, and reranking Is based on multiple candidate Make a selection , So we put the two methods together , Start with multiple candidate Pass through reranking Make a selection , recycling FastCorrect 1 Error correction . Suppose there is 4 individual candidate, Just for everyone candidate Correct errors separately , And choose the better one as the final result .FastCorrect 2 Method to put multiple candidate adopt align Later, align with each other as input .

■ chart 9

Finally, you can see ,FastCorrect 2 It's better than FastCorrect 1 good , Because it uses more information , In terms of word error rate ,FastCorrect 2 Can continue to drop more than two WERR, At the same time, the speed can be well maintained . From the picture 9 It can be seen that ,R+FC Our method has more advantages , But it's expensive , Because to many candidate Correct errors separately , Proceed again reranking, So you can't use this method , And ultimately choose FastCorrect 2 The strategy of .

In the data set Align In the process of , Consider using words that sound close together Align together , For example, in the example mentioned earlier , How to integrate I have hat and I have cat Medium cat and hat Align Together ？ There is a very important element here , Is the similarity of phonetic symbols .hat and cat The pronunciation of is very close , Give priority to words with similar pronunciation , Can better structure Align Relationship . If the pronunciation similarity is not considered ,WER Will it fall ？ Pictured 10 Shown , It is found that after removing the pronunciation similarity ,WER It did drop slightly . so , If the words in the language model are easy to match , Put these words together first Align, in addition , Notice the use of Naive padding The way is unreasonable at this time .

■ chart 10

We will have more than one candidate Error correction as input , So is it used candidate The more, the better ？ Experimental proof candidate The more , The delay will be worse . From the picture 9 You can see ,candidate An increase in , Will eventually face accuracy and latency Of trade off.

Some people may question whether this is caused by more data ？ Because compared to the previous one best correction Additional use of multiple candidate Sentence input as model training . So , We made a comparison , Is to break up sentences , Like four candidate Corresponding to a correct sentence , Split it into four pair, Every pair There's always one candidate Correspond to the correct sentence , This increases the amount of data by four times . But it is found that this method can not reduce the error rate , It will increase the error rate . It shows that the increase of data is not the cause of this result , But through reasonable alignment The error correction effect is better after the signal is provided .

Aiming at the problem of how to reduce the error rate and improve the accuracy in speech recognition , When the online delay is acceptable , We have developed FastCorrect Series of work , Pictured 11 Shown ,FastCorrect 1 and FastCorrect 2 In the academic data set and Microsoft's internal product data set , Good results have been achieved , At the same time, the error rate of the relative autoregressive error correction model is reduced . If you are interested, you can pay attention to our GitHub, We are still doing some analysis and design based on this problem , Use method related insight structure FastCorrect 3 Model , Achieve better error detection and error correction capabilities .

■ chart 11

05 Microsoft's research achievements and projects in the field of voice

Microsoft has also carried out a series of research on the whole voice , Pictured 12 Shown , Including front-end text analysis of speech synthesis 、 Modeling of low resource data in speech synthesis , And how to improve when deploying online inference The speed of 、 How to improve the robustness of speech synthesis 、 How to promote the ability of speech synthesis .

■ chart 12

Besides , We also extended the speech synthesis scenario , such as talking face generation, Input voice , The output is video of talking faces and gestures ; We also do vocal and instrumental sound synthesis , And in TTS The field has carried out detailed survey Work , At the same time tutorial Lecture course . In the near future , We have developed a speech synthesis system NaturalSpeech, The generated speech can reach the human level , If you are interested in speech synthesis , Communicate more .

Microsoft is in AI Some work has also been carried out in music , For example, the traditional music information retrieval and understanding task , And music generation tasks （ Including the creation of lyrics and songs 、 Style generation 、 Arranger 、 Voice synthesis and mixing ） etc. . If you're right AI I'm interested in music , You can also focus on our open source projects , See the picture for details 13 Shown . Microsoft is in voice Azure Aspect provides speech synthesis 、 speech recognition 、 Voice translation and other services , If you're interested , You can also use figure 14 The website shown in .

■ chart 13

■ chart 14

The machine learning group of Microsoft Research Asia is currently recruiting formal researchers and research interns , The recruitment direction includes voice 、NLP、 Machine learning and generating models , Welcome to join us ！

06 Q & A

1、FastCorrect And BART Relationship and difference of

BART yes NLP Pre training model in , For sequence to sequence tasks , It can perform machinetranslation , Apply to any sequence to sequence learning task related to text . Text error correction task itself also belongs to sequence to sequence learning , It is a traditional autoregressive method . In the field of traditional methods ,BART You can use it directly , Because it is also decoded by autoregression . and FastCorrect The problem of slow decoding speed of autoregressive method is solved , It is a non autoregressive model , Unlike BART Read word for word , Instead, read the entire sentence at once , In this way, the online inference Speed , This is also the core of our design , So from this point of view, the two are quite different .

2、 Is there a targeted design for error correction ？

In addition to the general speech recognition model , We still have a lot of customized scenes , For these scenes , The data contains a large number of professional words . In order to achieve better recognition effect , Enhanced knowledge base or adaptive operation can be introduced in error correction . Suppose that the general speech recognition model should be applied to the law 、 Medical and other scenes , The technical terms contained in these fields are rare , Then the topic can be provided for the speech recognition model , Tell the Related words of the topic involved in the scene of the current recognition paragraph , For model reference for identification . This mechanism can be used for error correction . in addition , In the Chinese error correction scenario , Alignment is relatively easy , But in English or other languages , One word may correspond to some characters of another word , How to design methods for these languages is a problem to be considered in the adaptation process .

About the voice network cloud market

The voice network cloud market is a real-time interactive one-stop solution launched by the voice network , By integrating the capabilities of technology partners , Provide developers with a one-stop development experience , Solve the selection of real-time interaction module 、 Price match 、 Integrate 、 Account opening and purchase , Help developers quickly add all kinds of RTE function , Quickly bring applications to market , save 95% Integrate RTE Function time .

Microsoft real time speech recognition （ Multilingual ） The service has been launched in the voice network cloud market . With this service , The audio stream can be recorded as text in real time , And it can translate with voice services and text to voice products / Services work seamlessly together .

You can click here Experience now .

原网站

版权声明
本文为[Acoustic network]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206220816120081.html