当前位置：网站首页>Implementation and landing of any to any real-time voice change RTC dev Meetup

Implementation and landing of any to any real-time voice change RTC dev Meetup

2022-06-21 16:12:00 【Acoustic network】

Preface

「 Voice Processing 」 It is a very important scene in the field of real-time interaction , stay 「RTC Dev Meetup Technical practice and application of voice processing in the field of real-time interaction 」 In the activity , From the sound network 、 Technical experts from Microsoft and Sumi , Relevant sharing was conducted around this topic .

This paper is based on fengjianyuan, an audio experience algorithm expert of the sound network, who shares and sorts out the content in the activity . Official account 「 Sound network developer 」, Reply key words 「DM0428」 You can download activity related PPT Information .

01 The dilemma of real-time voice change algorithm based on traditional sound effects

1、 What does sound change

■ chart 1

To recognize a person by pronunciation , There are many dimensions to consider .

First , Everyone's pronunciation cavity is different , The opening and closing of the mouth and the vibration of the vocal cords in the throat , There may be differences in individual acoustics , This leads to the fact that everyone's pronunciation has a different timbre , On this basis, use language to express , It is possible to produce different rhythms ; secondly , Everyone is in a different room , It may also be accompanied by different reverberation , Impact on identification ; Besides , Sometimes I sing by changing my voice , It may also require musical instrument coordination and some musical theory knowledge .

and , When we perceive sound , It will also be affected by psychology , The same voice , Some people may feel magnetic , But some people may find it rough . Psychological perception varies according to personal experience , There will be some differences .

2、 Why do people have different timbres

So in real time , How to change the sound in real time ？ When the rhythm is long , Although we can't change the choice of words and sentences , But it can change the timbre of the pronunciation cavity , Because it requires high real-time performance .

for instance , such as “ What's the date today ” In a word , You can break these words into phonemes , such as “ today ” It can be disassembled into “J” and “in” These two phonemes , It is a voiced sound , Sound can be produced by vocal cord vibration , At this time, everyone's voice color is quite different ; and “ yes ” It's a voiceless sound , Through the air flow form of lips and teeth , For this word , Everyone's timbre has little difference . Why is that ？

■ chart 2

In essence , The vibration frequency of the vocal cord determines the pitch of the pronunciation . Different people make the same sound through different vocal cord vibration frequencies , Its timbre will be very different in voiced sound . Take the example above ,“ What's the date today ” This a few word , Different people may have different pronunciation , and “ yes ” The difference may be small .

The traditional transformation algorithm is also considered from this point of view , Different people have different fundamental frequencies , According to the frequency of vocal cord vibration , The distribution of fundamental frequency can be adjusted . Different fundamental frequencies have different harmonics , Harmonics are multiples of the fundamental frequency . In this way, you can change the sound by changing the tone .

■ chart 3

When making different sounds , The opening and closing of the mouth determines the resonance of the pronunciation cavity , The different opening and closing degree will lead to the enhancement or weakening of different frequency responses , If it reaches the same frequency as the resonance frequency of the mouth , The frequency band will be enhanced , Difference is weakened . This is the principle of formant . From the perspective of voiced sound , Vocal cord vibration produces fundamental frequency , The fundamental frequency will have corresponding harmonics , The distribution of harmonics at different frequencies is determined by the opening and closing of the cavity . such , Changing the distribution of fundamental frequency and frequency response can adjust the timbre .

3、 Sound change based on traditional sound effects

The traditional algorithm is to adjust different pronunciation dimensions through different effectors , chart 4 Shown is the effector commonly used in traditional algorithms , For example, tone sandhi , Now most of the tone changing algorithms are through pitch shifting Extend the fundamental frequency and harmonic upward or downward . under these circumstances ,shift Operation will change the formant at the same time , Make the sound move up or down according to the energy of the spectrum , So as to change people's fundamental frequency .

The fundamental frequency of girls is relatively high , The fundamental frequency of boys is relatively low , By this method, the fundamental frequency direction of both can be consistent . But the opening and closing of the oral cavity when men and women pronounce will also change with the change of words , Just changing the fundamental frequency can not achieve perfect sound transformation . Like movies 《 Despicable Me 》, One of the little yellow people speaks Spanish , As the fundamental frequency is raised, it becomes a child's voice , But it's not natural , Equalizer adjustment is also required .

The previous example only changes the fundamental frequency when changing the tone , In fact, formant and frequency response can be adjusted by equalizer or formant filter , These are also the key modules to adjust the timbre . in addition , People who have been trained in Bel Canto are singing or talking , There are many high-frequency sounds , The pronunciation is full , From this perspective , Its harmonics can be enhanced by an exciter .

■ chart 4

Most of the traditional algorithms are through the serial link of effectors , Change the sound from different dimensions . Commonly used effector software includes MorphVOX PRO perhaps voicemod, They have different effectors , By adjusting the parameters, different character characteristics can be generated . Choosing sound effect is actually the superposition of different effects , Everyone's voice turns out to be different , Because the basic position is different , So it's hard to make everyone's voice the same , But you can make directional adjustments .

I use MorphVOX The software changes from male to female or from female to male , When the male voice becomes the female voice , If you use the default options to adjust, you will feel like a little yellow man . If you change a female voice into a male voice , I found that the voice sounded a little silly , For example, the nasal sound is heavier . This happens because the female voice needs to be lowered when it becomes a male voice , Compress the entire spectrum down , In this way, a lot of high-frequency information is lost .

so , Traditional effectors have their own defects , Cause the change to be less accurate . To get a better sound , The traditional link mode requires manual parameter change , Unable to achieve accurate timbre transformation , If only the preset values are used for adjustment , The effect of timbre transformation is not ideal . in addition , It's hard for everyone's voice to become another designated person , Only a professional tuner can do it .

02 ASR+TTS=VC？ be based on AI Explore the possibility of real-time sound change

1、ASR

Traditional voice changing has many limitations , Then based on AI Can we improve the real-time sound change effect by using the method of ？ We know ,ASR Technology can realize the function of voice to text , In fact, when changing the sound , Semantic information also needs to be preserved , Only the overall adjustment may lead to missing words or wrong meanings .

Pictured 5 Shown , There are many recognition frameworks , such as hybrid（ frame ） Of Kaldi etc. , This kind of frame will pass through the acoustic model 、 Pronunciation dictionary and language model can comprehensively judge the sound conversion , The two conditions of correct pronunciation and reasonable language should be met at the same time .

■ chart 5

From this perspective , It is a link with strong explanation , It is also easy to build , Because there are many ready-made pronunciation dictionaries 、 The language model can use . It also has some disadvantages , The decoding process is complicated , This framework contains multiple models , If one of them has a problem , Then there will be deviation in the identification .

To solve this problem, many end-to-end speech recognition frameworks have emerged , such as Espnet, Its accuracy is from the perspective of general model , With enough data , It's better than hybrid Better recognition . And the training process is relatively simple , Unlike hybrid frame , Need to train acoustic models 、 Language model , Then decode . This framework requires only voice and text data pairing , Then you can do end-to-end training directly . But its shortcomings are also obvious , Because of the end-to-end training , So its requirements are higher , Larger training sets and more accurate annotation are needed . in addition , Different scenarios may need to be customized , More data required , If it is an end-to-end model , There may be no better corpus for training .

At present, from the perspective of sound change , You don't have to recognize text , As long as the phoneme is recognized accurately , This is the ASR The difference between models .

2、TTS

ASR Can recognize pronunciation , So the same sound , Sent by different people , It needs to be used TTS. For this point , There are also many ready-made frameworks that can be used , The process is nothing more than to ASR The extracted text is normalized , Then perform spectrum prediction , And then generate sound through vocoder , Pictured 6 Shown . Google's Tacontron、 Microsoft Fastspeech Have practiced this function , Able to synthesize speech with low delay . For Graphs 6 For the link ,Fastspeech You can also directly skip the vocoder part and directly generate speech , in other words , This part can be done directly from text to speech end to end Generation . such , combination ASR Add TTS In the form of , You can change the sound .

■ chart 6

TTS The vocoder technology in is actually quite mature , What it does is , Predict speech from a compressed spectrum information , chart 7 Show the different vocoders mushra scores, It can be understood as the naturalness of pronunciation transformation , As a judge of whether the transformation is natural .

We can see , image wnet perhaps wRNN This kind of vocoder based on the main point , Good results have been achieved in speech synthesis , It is not much different from the voice of a real person in terms of naturalness . in addition ,Hifi-GAN perhaps Mel-GAN When the real-time performance is good , Can also achieve similar WaveRNN The effect of . Vocoder technology has developed to the present , It has achieved good speech generation effect , The premise of becoming a sound changing effect device .

■ chart 7

3、VC

adopt ASR+TTS The basic steps of realizing sound conversion in series are shown in the figure 8 Shown ,Speaker A Say a few words , And then through ASR Extract text or phoneme information irrelevant to the speaker , recycling TTS Technology to Speaker B To restore the timbre of , To achieve sound change .

■ chart 8

For this link , If the voice can only change from one person to another , Namely One-to-one In the form of . and ASR In fact, it has nothing to do with the speaker , If this is the case , You can do that Any-to-one. in other words ,ASR Model no matter Speaker Who is it? , Can recognize accurate words , Make everyone become Speaker B. The ultimate goal of changing sound is to become Any-to-any In the form of , That is to say, a model can change anyone's voice into any kind of voice , Expand the sound change ability .

We know , Many end-to-end models , image CycleGAN、StarGAN perhaps VAEGAN etc. , It can realize sound transformation in a finite set , So that the people who concentrate on training can group Mutual conversion in , But this is also the limitation of its design , If you can switch TTS The timbre of , It is possible to solve the customization of sound , Make it anyone's voice . so ,ASR + TTS be equal to VC The theory of is actually realizable .

If you want to become Any-to-any In the form of , Need a shortcut , You can't add every person , Retrain the whole model . This needs to learn from the idea of transferable learning . chart 9 It shows the voice generation method based on transfer learning .

■ chart 9

The idea of transfer learning is , adopt speaker encoder The module performs... On the target voice embedding operation , To extract the characteristics of timbre , Then put the features into the speech generation module synthesizer in , In this way, the speech of different timbres can be generated . This is actually what Google found before , They will speaker encoder Module added TTS in , Different timbre changes can be achieved , When adding new ID, That is, when the speaker , It only needs to recognize the speaker's corpus of one minute or tens of seconds , You can extract speaker Characteristics of , Generate the corresponding voice . This idea is actually the method of transfer learning .

Now there are many different speaker encoding Method , such as I-vector、X-vector、 Google's GE2E （generalized end to end） This main bias loss The designed model , And Chinese （ better ） Of Deep Speaker、 Korean （ better ） Of RawNet etc. . therefore , We can take it apart , except ASR and TTS, Add this speaker encoder Module to adjust , Achieve the target speaker's voice , Achieve Any-to-any 了 .

03 Algorithm implementation of real-time sound changing system

1、 Algorithm framework of real-time sound changing system

Next, let's take a look at how the real-time voice change system is implemented , Pictured 10 Shown , The target speaker's corpus is added to voiceprint recognition （speaker encoding part ）, Get voiceprint features , At the same time, the phoneme features are extracted through the speech recognition module . Then in the spectrum conversion module , Generate spectrum features , Then get the voice through the vocoder .

■ chart 10

In fact, the whole process exists offline and online The difference between , The voiceprint recognition module can be realized through offline training Realization . The main difficulty here is ：

① Real time voice change should consider the computational power and real-time performance . Even in terms of force , speech recognition 、 Spectrum conversion and vocoder , Even with the fastest algorithm in the industry , Its calculating power is also GFLOPS Magnitude , If it is deployed on the mobile phone , It may not be possible to achieve , But it can be realized on the server .

② In terms of real-time , In real-time communication , If the sound change takes a long time , It may take a long time to react during communication , stay RTC This happens more frequently in the environment . This requires that the end-to-end delay be as small as possible , Generally speaking, it cannot exceed 1 second , If exceeded 1 second , You will feel the obvious delay effect of sound change . Of course , End to end delay also includes encoding and decoding 、 Delay on link such as acquisition and playback , Final , The delay of the whole algorithm cannot be higher than 400 millisecond .

③ To do it Any-to-any Voice change , Whether the sound change effect is stable 、 Whether the error rate is low enough is also a challenging direction . The error rate of speech recognition is 5% It is already a good effect , If you want a lower error rate , You may need a larger model or a more targeted scenario . in addition , The similarity of timbre depends on whether the voiceprint recognition module can accurately extract timbre , This is also a huge challenge .

2、 Real time sound changing system

In fact, different speech recognition frameworks 、 The voiceprint recognition frame and the vocoder frame can be combined freely , Form a complete sound change system . But in order to reduce the algorithm delay to 400 millisecond , The deployment of voiceprint system may be considered more . Pictured 11 The example shown is , When deploying the real-time sound change system , We will consider whether to change the voice in the cloud , It is also implemented on the end side , These two sets of links actually have their own advantages and disadvantages . First, let's look at the advantages and disadvantages of voice change on the cloud .

■ chart 11

If you change your voice in the cloud , It needs to be done locally speaker Audio acquisition , And then pass by APM The sound processing module removes noise 、 Echo, etc . Next, we will encode and decode the speech , After the speech is encoded, it will be transmitted to the server for voice change . This is partly due to the impact of the network , May need to introduce NetEQ The module can resist weak network , Decode and then change the sound . The audio after the sound change needs to be encoded and then transmitted listener End ,listener It will also introduce NetEQ The module can resist weak network , Then decode and play .

This type of sound change is more than that realized at the end side NetEQ And the process of encoding and decoding on the server , If there is no weak network , The delay it controls may increase 30～40 millisecond . But if there is a weak network , To prevent packet loss , The delay may reach 100 Milliseconds or 200 millisecond , Even higher .

The advantages of cloud voice change are very obvious , That is, the computing power of voice change on the server is relatively low , At the time of deployment , You can use a better vocoder 、ASR Or spectrum conversion can improve the sound quality through computing power . If it is considered from the end side , You can put the voice change at the sending end , Because there may be more than one receiver , Just make a sound change at the sending end , This eliminates NetEQ and Codec Conversion on the server , The delay on other links is the same as that of cloud deployment . From this point of view , End to end, even if there is no sound change , In general, it can do 400 Millisecond delay , Add the algorithm 400 Millisecond delay , The delay of end-to-side sound change will be less than 800 millisecond , It can greatly reduce the delay consumption in the process of real-time transmission , Make communication more smooth .

End to side sound change is also affected by computational force . Mentioned earlier , The efficiency of the end side is GFLOPS Magnitude , such as iPhone X Or the newer version has GPU Processing chip , In this case, real-time calculation can be realized on the end , But the computational power of the model should not be too great .

04 Demo Presentation and application scenarios

■ chart 12

Next, we will introduce how to change the sound through the link mentioned above Demo The effect of , And its application scenarios .Demo It is a dialogue between men and women . In virtual social networking , In fact, the voice is not necessarily consistent with the character itself , For example, girls want to express themselves in the image of boys , If the sound doesn't change , May be inconsistent with the image of boys , Produce gender dislocation . adopt Demo The comparison before and after the sound change shows that , To solve this problem , The sound changing effect can be used to realize the correspondence between the character and the pronunciation . in addition , The conversation just now was not interrupted by the change of voice .

1、 Application scenario of real-time sound change

So what kind of scene will sound change be used in ？ In fact, whether it is in Yuan language chat MetaChat scene , Or in the traditional live chat , There's a place to use . For example, when chatting , You may need to customize the avatar , Including changing the image or sound , To enhance immersion . If you are a cute girl , But with a thick voice , It may be recognized soon . Another example is in the game , If you can use the target's voice , There will be more sense of substitution .

in addition , Now there are many scenes of virtual digital human , Many stars will live by customizing their own virtual images , You don't need to enter the live studio by yourself , Others can be normalized through the background simulation 24 Hours of live . The representative of the chat room scene is the online script kill , When users perform role deduction , There will be different scenarios , You need to switch between different sound lines , It can be realized by changing the sound .

2、 Better real-time sound changing effect

For the current real-time transformation effect , There are many directions that can be optimized . Here are some examples , You can discuss . First , Mentioned earlier , To ensure real-time performance , It is only adjusted by changing the speaker's voice color , In fact, the similarity of recognition can be improved through the speaker's cadence or the expression of pronunciation emotion . therefore , In the case of meeting the real-time requirements , More exploration is possible , Put the expressed features more into the sound changing effect .

secondly , In the case of vocoders , Need less computational power and better results . In fact, today's vocoders , In terms of naturalness , Has been able to perform well , But for the point by point vocoder , There are still big challenges in computing power . Therefore, the computational power and effect need to be balanced , And a more suitable vocoder is obtained on the basis of equalization .

in addition , More robust phoneme extraction module is also a problem at present ASR A big problem in speech recognition . In noisy or complex scenes , There are other voices besides the human voice , This will affect the effect of speech recognition . In this case , Noise reduction module can be introduced to extract human voice , We can also develop a phoneme extraction module that is more robust to noise , Make it recognize the human voice in the noise . When we communicate , Whether it's ASR Or change the sound , You may also encounter multilingual problems , For example, Chinese is mixed with English , Even quote Japanese and so on , under these circumstances , There will be the problem of multilingual recognition , That is to say code-switch, There is no perfect solution to this problem , It is the direction that can be improved in the future .

Last , Hardware determines how complex the upper limit of the model that can be deployed . If there is better hardware on the end for deployment , It can reduce the end-to-end delay and achieve better sound change effect . This is also a direction that needs attention in the future .

05 Q & A

1、 For real-time voice change, what methods can effectively reduce the delay ？

We are doing real-time voice change , Especially when interaction is still needed , First , It is best to deploy on the cloud , This can reduce the delay when the server is transmitting and resisting the weak network . secondly , While doing the algorithm , Such as real-time streaming ASR There will be lookahead, For example, we need to watch more frames in the future , To ensure its recognition accuracy . For this part , Whether it's ASR、TTS Or spectrum conversion , All need to be controlled lookahead Of The amount , So that it can guarantee real-time performance at the same time , So as not to have too much impact on the effect . In practice , It can control the overall algorithm delay to 400 Within milliseconds , It is divided into each module , I think it's in 100 Millisecond or so .

2、Any-to-any Where can I experience the function of changing sound ？ Can I use it personally ？

This part is still in the polishing stage , But we already have demo You can experience it . It will be released in the next few months demo For you to experience .

3、 If Xiaobai wants to learn audio systematically , Are there any good learning resources or learning path recommendations ？

I worked with geek time before “ Fix the audio technology ” Column of . I found that in fact, the field of audio may be a little small , But it involves many fields . You can find some resources on the Internet to study systematically , Besides the audio itself , In fact, it also includes audio 3A Handle （ Audio link processing ）、 Acoustics and based on AI Technology .

You can first look for relevant information from these perspectives , Learn from each other , The effect will be better .

Expanding reading ： How to realize sound detection based on real-time voiceprint sound change “ clone ”

Activity Notice

「RTC Dev Meetup - Hangzhou station 」, We will focus on big front-end technology , The invitation comes from Acoustic network 、 Ant group and Hikvision Technical experts , Share with us the business architecture and cross end practices in the field of real-time interaction in the era of big front-end .

Action is better than action , Click on here Sign up ！

原网站

版权声明
本文为[Acoustic network]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/172/202206211436534185.html