当前位置:网站首页>AEC: analysis of echo generation causes and echo cancellation principle

AEC: analysis of echo generation causes and echo cancellation principle

2022-06-10 19:02:00 Zego instant Technology

In the last course 《 Advanced audio and video developers —— Audio elements 》 in , We start from the three elements of sound 、 The digitization of audio analog signal and the characteristics of audio digital signal , A new understanding “ voice ” This old friend . today , We will talk more about this old friend in RTC Other stories in the world .

Sharpening a knife never misses a woodcutter , Before the topic begins , So let's see RTC The basic processing flow of audio and video data in the scene . Combined with the actual application scenario , From the anchor 、 The audience has two roles to explain .

One 、 Audio and video data flow link

Simply speaking : The host needs to collect and send audio and video data , The audience needs to receive and play audio and video data , The anchor and audience are connected through a real-time network . Further said that : Audio and video data collected by the host , There may be noise 、 Echo and other problems , There's a lot of data , Often not suitable for direct network transmission ; The data pulled from the network by the audience , Is the form of coding compression , Nor can it be directly used for playing . To solve these problems , We have introduced the former / post-processing 、 Ed / Decoding and other modules , It forms a basic 、 With the network as the link “ symmetry ” link , As shown in the figure below .

 

We need to pay attention to , The picture shows only “ The host → The audience ” Unidirectional link , Can be called “ Single anchor ” scene . It can also be bi-directional , For example, we often say “ Lian Mai ” scene : Audio and video communication between two users , For example, wechat phone , At this time, the two sides are the anchor and audience of each other , Data flow is bidirectional .

RTC The audio and video data flow in the scenario basically follows the above link , Today we mainly focus on “ Before processing ” Module .

Two 、 Audio processing three swordsman — 3A Handle

There are many functions of audio pre-processing , There are some treatments to make the sound more “ Pleasant to hear ”, such as : Echo cancellation 、 Noise suppression, etc , They eliminate the... From the sound signal “ Irrelevant signals ”, Make the sound more “ clean ”; There are some treatments to make the sound more “ interesting ”, such as : Tone sandhi 、 Virtual stereo 、 Reverberation, etc , They give sound rich special effects , Make the sound more “ Interesting ”.

And in those that make the sound more “ Pleasant to hear ” In the pretreatment of , There are so-called audio processing three swordsmen 3A Handle  :

  • AEC(Acoustic Echo Canceller, Echo cancellation )
  • ANS (Ambient Noise Suppression, Noise suppression ),
  • AGC(Automatic Gain Control, Automatic volume gain )

They are frequent visitors to the audio processing process , We will introduce them to you one by one later .

this “ Three swordsmen ” in , Noise suppression and automatic volume gain , People should be able to roughly associate their abilities from the name , After all, noise and volume are familiar concepts . however “ Echo cancellation ” The swordsman seems mysterious , You may not be familiar , Just by name : Is to eliminate the echo signal in the sound .

But what is RTC The echo of the scene ?

Why is there an echo in the voice signal ?

Why do we need to eliminate it ?

We'll unveil its mystery right away !

3、 ... and 、 The cause of the echo

In fact, we all encounter echoes in our daily life , Imagine a scene like this : If you shout at the mountains “ Hello !” What will happen ? First , You will immediately hear your own voice “ Hello !”, Soon we will hear the mountains “ reply ” The sound of , It's the same “ Hello !”. here , Mountainous “ reply ” It is an echo phenomenon .

The simple explanation of echo in physics is as follows : Sound is produced by the vibration of the vocal body , It will spread around , Some will be directly transmitted to the human ear ( It is called direct sound ), Some will encounter obstacles ( Like a wall ) It is reflected and then transmitted to the human ear ( It is called reflected sound ).

If the distance between the direct sound and the reflected sound is more than 0.1s, You can tell the two from the senses , The latter is called “ Echoes ”. Understand the echo of life and Physics , And our protagonist today : Audio preprocessing AEC To face ,RTC Echoes in the scene , There are still some differences . Let's combine a wheat scenario to illustrate .

Refer to the below , user A And the user B It's connecting wheat , They have separate microphones and speakers :

 

On the whole , user A、B Voice of speech , After being collected by respective microphones , Transmit to the other party through the network , And play it out through the speaker of the other party , No clue has been found here . From a specific interaction , There will be the following process :

 

1、 At some point , user A Start talking , Generated voice A By microphone A collection 、 And transmitted to users through the network B, Become the voice to be played A1

2、 voice A1 By speakers B After the play , Through direct light 、 Reflection of the surrounding environment, etc , Finally, the microphone B Acquisition as voice A2( The echo in the picture A2)

3、 The voice that has been extracted A2, It will be transmitted to users through the network A, And through the speaker A Play out

After the above process , user A Will find : I just finished a sentence , After a while, I heard my own “ Retell ”, This is it. RTC In the scene “ Echoes ” The phenomenon . Echoes A2 It mainly includes direct echo (A1  After playing by loudspeaker , The sound that enters the microphone directly without any reflection ) And indirect echo (A1 After playing by loudspeaker , The sound that enters the microphone after one or more reflections from the environment ) etc. .( Only acoustic echoes are discussed here , The line echo caused by abnormal equipment line is not involved temporarily )

The above process describes Lian mai “ Single talk ” scene ( Only one person speaks at the same time ), If the user A、B Talking at the same time ( Two scenes ), So the echo A2 Will communicate with the user B The voice of B Mix them together and send them to users A, It will seriously interfere with users A Listening to . Empathy , user B You will also encounter the same problem , Hear your own echo .

As you can imagine , If this situation is not dealt with , Even wheat and wheat will hear their own echoes again and again 、 Unable to hear each other's voice clearly , The experience will be very poor . Whether it's in daily voice calls 、 The game is still black 、KTV Chorus and other scenes , Echo is a problem that developers must pay attention to and solve “ Big problem ”, And dedicated to solving this big problem , Is one of our three swordsmen  --  Echo cancellation AEC. Next , Let's take a look at this “ Swordsman ” Stunt .

Four 、 Echo cancellation principle

1、 The basic logic of echo cancellation

Refer to previous analysis process , user A The echo heard when the wheat was connected , Is the user B The microphone of the user A The voice of leads to . user A yes “ Innocent victims ”, To solve this problem , Need to start from the originator user B Side hands , Here's the picture :

 

In the figure , voice A1  It will be played through the loudspeaker , Belongs to a known signal , We call it Reference signal . Echoes A2 And voice B, We call them respectively according to their production methods Far end echo signal and Near end voice signal , The two are collected as mixed signals by the microphone C(C = A2 + B). Mixed signal C Yes, you can learn , But one of them A2 、 B  Like salt and sugar dissolved in a glass of water , It's hard to distinguish .

Sum up , If you can get a signal C Mid far echo A2 subtract , There's only... Left “ clean “ Near end voice of B(B1 = C - A2 ), This is the picture  AEC modular   The job of . At first glance , This process seems very simple ?A2 Since it is a known reference signal A1 The echo of , The two should be similar from the sense of hearing , That's just for A1 Instead of A2, Directly from the signal C Subtract from A1 It's done ? unfortunately , The ideal is very good , The reality is cruel .

From the reference signal A1 Played 、 To the echo A2 Between being collected , Through the speakers → Environmental Science → The propagation path of the microphone (LRM,Loudspeaker-Room-Microphone), The environment is real-time and changeable ,LRM The same goes for the path , This uncertainty makes the two signals very different in digital processing . If directly from the signal C Subtract the reference signal from A1, There will be a lot of residue in the results 、 And voice B  It's also very different .

Cannot be used directly A1 Instead of ,A2 It is impossible to imagine out of thin air , Is there nothing we can do ? Of course not. , voice A1 And echo A2 Like a pair of similar appearance 、 Twins with different personalities , There is still a correlation between them that cannot be ignored . Use these correlations , We have an indirect solution :

take LRM The path is simulated mathematically , The solution is a function F(x), Then there are A2 = F(A1). From mixed signals C Subtract from F(A1), It can also achieve the purpose of echo cancellation . And this , Is the key work of echo cancellation algorithm , The basic logic is : By estimating the characteristic parameters of the echo path , Simulate the path . Using the simulated path function  F(x) And reference signals A1, Calculate the echo signal A2, Finally, the signal is collected from C Subtract from  A2, Achieve echo cancellation . That is to say :

The ideal result :C - A2 = B

The actual result :C - F(A1) = B1

Differences between the two :B - B1 = F(A1) - A2

If we can accurately solve the echo path , be  F(A1) = A2 ,B1=B, It also achieves perfect echo cancellation . however , In practical application, we should realize “F(A1) = A2” It's very difficult , We not only need to face the complex external reflection environment , Also consider collecting 、 Play exceptions that may be introduced by the device .

Designing an excellent echo cancellation algorithm is a big project , After all, this is a swordsman AEC The core secret of , Need some mathematical knowledge 、 Knowledge of signal processing 、 And a lot of practice to cultivate . As an application developer , We can not delve into the details of the algorithm , But it is necessary to understand the basic principle , This helps us to solve the echo problem in practical application scenarios .

2、 The basic principle of echo cancellation algorithm

 

Refer to the previous discussion , The input signal of echo cancellation mainly includes : Reference signal ( Refer to the voice above A1), Far end echo signal ( Refer to the previous echo A2) And near end voice signals ( Refer to the voice above B), The desired output signal is clean near end speech . Echo path in the environment LRM Unknown , We usually need a Linear filter To simulate . Again because LRM Is dynamic 、 time-varying , A fixed parameter filter can not meet the demand , So we also need it to be able to “ The adaptive ”, It can dynamically adjust various parameters according to its own state and environmental changes .

To make a long story short , We need a “ Adaptive linear filter ” Finish right Echo path  F(x) The solution of , It can be used in complex and changeable environment according to the reference signal A1 Estimate the echo signal A2, Use the correlation between the two , On the signal C Eliminate the echo as much as possible .

After the processing of adaptive linear filter , To a certain extent “ Purification ” Near end voice B, But there are often some echoes left , We need to proceed according to the residue Residual echo processing , Another round of elimination . The residue here , It refers to analyzing the correlation between residual echo and remote reference signal , The more relevant , It means that the more remains , On the contrary, it means that the residue is less .

Last , There may still be a small number of die hards , For example, speakers 、 The nonlinear signal introduced by the distortion of microphone and other equipment , These nonlinear signals cannot be eliminated by linear filters , They also need to be Nonlinear shear treatment .

Sum up , Linear adaptive filtering + Residual echo suppression + Nonlinear shear treatment , This completes a relatively complete echo cancellation .

3、 Single talk / Echo cancellation strategy in double talk scenario

As we mentioned earlier , The Lian Mai scene depends on the number of people who speak at the same time , There is a distinction between single lecture and double lecture . In two scenarios , The input signal of echo cancellation is different , Processing strategies are also different .

First , We can compare the characteristics of the far end signal and the near end signal , Such as peak correlation 、 Frequency domain correlation 、 Amplitude similarity , To determine whether it is in the dual talk state ( If the energy of each signal is very high 、 The correlation is very low , It may be a double talk scenario ).

If it is a single scenario , Because only remote users A speak , user B The voice signal collected by the microphone only contains the far end echo , Does not include near end voice . Echo cancellation in this case is relatively easy , We can even use more radical strategies , For example, kill all voice signals directly , Fill in comfortable noise appropriately to optimize listening . here , Linear adaptive filter can achieve better echo cancellation effect , Reduced subsequent residual echo suppression 、 The workload of nonlinear shear processing .

If it is a double talk scenario , Due to remote 、 The near end user is talking at the same time , The signals collected by the microphone include far end echo and near end voice , The combination of the two makes the treatment more difficult : It is necessary to eliminate the remote echo , And ensure the sound quality of the near end voice . If the energy of the far end echo is higher than that of the near end speech ( For example, higher than 6~8dB ), The elimination process is difficult to avoid the damage to the near end speech . here , It may be necessary to reduce the elimination force of the adaptive filter , For subsequent residual echo suppression 、 The strategy of nonlinear shearing treatment should be adjusted accordingly .

Echo cancellation technology has always been the cutting-edge field of major audio and video manufacturers , The processing effect of residual echo , Degree of protection for near end speech quality , It represents the level of an echo cancellation algorithm .

ZEGO SDK Self developed audio and video engine , Based on a lot of practical verification and application feedback , The residual echo suppression and nonlinear shearing are optimized . At the same time, in order to meet the different needs of users for sound quality ,SDK It also supports different levels of echo cancellation ( soft 、 equilibrium 、 radical ), In double talk 、 Music and other scenes can achieve good echo cancellation effect , And ensure the sound quality to the greatest extent , It belongs to the leading level in the industry . In addition to the application level echo cancellation algorithm ,SDK It also supports echo cancellation using device systems , System echo cancellation is relatively more radical , The elimination effect may be better 、 But the damage to sound quality is even greater , It will have special advantages in some scenarios , We will explain briefly later .

5、 ... and 、 Echo problems in practical application scenarios

Through the previous content , We have a systematic understanding of :RTC Echo definition and echo cancellation in the scene AEC The basic principle of . Next , Let's see how to use this knowledge , Help locate 、 Solve the echo problem in practical application scenarios .

First , Let's be clear : If two users are in the process of wheat connection , One of the parties heard “ My own echo ”, So the probability is “ The echo of the other side is eliminated “ Didn't do a good job . Others “ Small probability ” The situation of : For example, the user uses the ear return function 、 For example, the fault of the earphone line produces the line echo 、 For example, the user repeatedly plays the collected audio with the sound card 、 For example, there are in the business bug、 The user returns the voice request sent by himself again, and so on . Although the final phenomenon is that users hear their own voice repeatedly , But these are not echo problems in the conventional sense , Nor is it echo cancellation AEC Can be solved , You need to tune from the device 、 Usage mode 、 Business logic to avoid .

Exclude unconventional “ Echoes ” After the question , The rest “ Large probability ” problem , Generally, you can refer to C - F(A1) = B1 This formula is used to analyze . Let's list them one by one :

  Let's continue to elaborate in combination with the above figure :

1、 The signal C The problem of

The signal C It was collected by microphone 、 Mixed signal to be echo cancelled , It includes near end voice and far end echo . If C The far end echo energy in is much larger than that of the near end speech , For example, the speaker is too close to the microphone 、 The speaker output volume is too loud to mask the near end voice , In this case, echo cancellation must kill a thousand enemies 、 Self damage eight hundred . here , It is recommended to turn down the volume of local playback properly .

2、 The signal A1 The problem of

The signal A1 It is the reference signal used for echo cancellation , The echo cancellation algorithm will estimate the echo based on the reference signal ( F(A1) ), therefore A1 The accuracy of echo cancellation will directly affect the effect of echo cancellation . The greater the difference between the actual playing sound signal and the reference signal , The more difficult it is to make simulation estimates . Ideally , We expect the reference signal A1 = The signal to be played by the loudspeaker , But if the following happens :

A、 The actual broadcast signal has changed : For example, the output device pair A1 The sound effect is processed , There is a huge difference between the actual playing signal and the reference signal , Calculated based on the error signal F(A1) , Naturally, good echo processing cannot be achieved

B、 Unable to obtain reference signal A1: This situation , It is generally because the echo cancellation operator is not the signal A1 The producers of . such as : application a Use the self-developed algorithm for echo cancellation , But the signal played by the loudspeaker includes the application b Generated audio ( For example, a music software is playing in the background ), Because of the application a Not the producer of the audio 、 There are no system level permissions , Therefore, it is impossible to know the audio as a reference signal , So-called “ one can't make bricks without straw ”, Other treatments are out of the question .

The signal A1 Echo problems caused by variations or deletions , Difficult to solve in algorithm , In addition to turning off sound processing 、 Avoid third-party audio playback , Only the pre-processing module of the system hardware can be expected . Because the system module has the highest authority , You can get the final result of the system loudspeaker playing 、 The most complete signal , This is also the natural advantage of system pre-processing over application level pre-processing .

3、 Echo path F(x) The problem of

If we can get the right reference signal , Mixed signal C The energy ratio between the mid far end echo and the near end speech is also reasonable , However, echo cancellation is not ideal , It may be that the simulation of echo path is abnormal . If the problem of echo cancellation algorithm itself is excluded , May be played by 、 Frequent changes in the collected environment lead to ( Including hardware / The natural environment ), For example, audio devices are constantly moving 、 Covered , For example, a user suddenly enters a noisy corridor from an empty room , Will result in LRM Path change , The filter has not yet had time to adapt and converge ( It can't even converge ), There will be echoes . This situation , On the one hand, the echo cancellation algorithm needs to be further optimized 、 Improve adaptivity 、 The rate of convergence , On the one hand, we still have to improve the wheat environment , Try to keep it stable .

Last , Add a unique trick that can solve most conventional echo problems : Put on headset .

Because after wearing headphones , Far end audio playback is focused on the human ear , Basically, it will not be broadcast to the outside , It will not be picked up by the microphone , Naturally, there is no need for echo cancellation . also , In some areas, the requirements for sound quality are extremely high 、 Scenarios with high real-time requirements , For example, multi person real-time KTV  Chorus scene , In order to avoid the damage to sound quality caused by echo cancellation and the time-consuming processing process , It is also recommended that users wear headphones , And turn off echo cancellation .

Of course ,RTC Factors in the scene that cause echo problems 、 There are many ways to solve the echo problem , There is no need to copy mechanically , We should understand the principle of echo cancellation , Use theory to guide practice , And accumulate experience from practice 、 Constantly improve our knowledge system .

As I said before , Echo cancellation technology has always been a major RTC One of the cutting-edge areas of the manufacturer , It's more complicated 、 More profound , Of course, more interesting content awaits your exploration . Today, we only have a preliminary discussion on the basic principles and simple applications , Although it is far from enough for you to design a good echo cancellation algorithm , But I hope it can help you take the first step of application practice .

below , Let's go through a mind map , Sort out the contents of the whole article :

【 Supplementary mind map 】

 

 

Thinking questions

When using mobile devices , In the selection of system echo cancellation and application level echo cancellation , What factors should be considered ?

answer :

1、 High requirements for echo cancellation effect , And not very sensitive to sound quality , System echo cancellation can be used

2、 The reference signal contains a third audio 、 And the scene that needs to eliminate this audio , System echo cancellation is required

3、 For scenes that require high sound quality , It is recommended to use application level echo cancellation ( The premise is that there is an excellent echo cancellation algorithm at the application level )

4、iOS On the device, due to Apple's highly customized software and hardware systems , The effect of echo cancellation of the system can be guaranteed ;Android On the system , Due to the different capabilities of different manufacturers 、 Different software and hardware specifications , The effect of system echo cancellation is uneven 、 There may even be no systematic echo cancellation , The suggestion is to increase the application level of echo cancellation as a minimum measure .

原网站

版权声明
本文为[Zego instant Technology]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206101811399282.html