当前位置：网站首页>Basic knowledge of audio and video ｜ analysis of ANS noise suppression principle

Basic knowledge of audio and video ｜ analysis of ANS noise suppression principle

2022-06-22 16:18:00 【Zego instant Technology】

In the last course 《 Advanced course of audio and video development | The second is ： Echo cancellation 》 in , We touched on the concept of audio preprocessing , I also got to know one of the three swordsmen of audio preprocessing AEC Echo cancellation . today , Let's continue to know the second of the three swordsmen ： Noise suppression ANS (Ambient Noise Suppression).

You are often involved in online meetings , I must have complained ：“ It's too noisy , I can't hear anything clearly ”、“ There is a lot of noise around , Need a quiet place ”. This leads to “ noisy ” and “ Make a noise ” The culprit of , Is the ubiquitous noise . The noise problem is the same as the echo problem , Seriously affect the user experience in audio and video scenes , It is a barrier that all developers cannot get around . I want to solve this problem ,“ Change to a quiet place ” Of course it works , But Taiyuan is not elegant enough , We need swordsmen “ Noise suppression ” With the help of the .

One 、 What is? RTC Scene noise

Swordsman “ Noise suppression ”, As the name suggests, it is to help us solve RTC The noise of the scene , Want to know it well , First step , Is to understand what noise is ？

Compared to the echo , Noise is more common in our daily life , For example, the roar of the office air conditioning motor 、 The roar of electric drill for decoration next door 、 The roar of planes taking off and landing at the airport 、 The Hawking of market vendors, etc , It can be said that all this has affected our normal work 、 Study 、 The sound of rest is all noise . So the definition of noise is not so strict , In fact, except “ Useful ” Outside the sound of , All of the other “ It's useless ” All sounds can be regarded as noise , Just different scenes for “ Useful ” and “ It's useless ” The definition of is different .

stay RTC scenario , The most basic requirement is to ensure the quality of voice communication , therefore “ Useful ” The voice of a person usually refers to the voice of a person , And other signals besides human voice , It can be counted as “ It's useless ” The noise of . these “ It's useless ” The noise of , According to their time-varying characteristics , It can be divided into two categories ： Steady state noise and unsteady state noise .

Steady state noise , The key is stability . It refers to those that are continuous in time 、 continued , Noise with stable amplitude, spectrum and other signal characteristics or slow change . For example, the sound of air conditioning 、 Fan sound, etc , It sounds like continuous 、 Steady “ Hum. Hum ”.

Unsteady noise Is relative to steady-state noise , Its signal characteristics are unstable , The magnitude of change over time is large , Or there may be a break 、 Instant case . For example, the noisy voices in the vegetable market 、 The clatter of the keyboard 、 The closing of the door 、 Fireworks, firecrackers, etc , They may persist but change constantly , Or instant 、 After a short existence, it disappears .

Steady state noise is compared with unsteady state noise , Less difficult to suppress , The inhibition scheme is also more mature . Non steady state noise is a big problem for audio and video developers , Suppression effect on unsteady noise , It is an important index to test the swordsman's ability .

It's important to note that , The noise we are talking about today , Whether steady or unsteady , It's all relative to “ Useful sounds ”（ Voice ） Additive noise . Additive noise is not related to human voice , The existence of human voice does not affect the existence of additive noise , Their mixed signals can be obtained by adding .

notes ： Those exist because of the existence of human voice 、 adopt “ Useful voice ” Changing noise , Such as room reverberation , they Meet the voice “ Multiplicative relationship ”, It is beyond the scope of today's discussion .

Two 、 Analysis of noise suppression principle

Have known what is RTC Noise in the scene , And its simple classification , Then we hold hands “ Noise suppression ”,“ real guns and bullets ” Deal with it carefully . First step , To understand the basic principle of noise suppression .

1、 Simple understanding of noise suppression principle

Think about it , What we learned in the last lesson , The key work of echo cancellation ： Use the remote reference signal to be played by the loudspeaker , Estimate the echo signal , Subtract the echo signal from the mixed signal collected by the microphone , Preserve near end voice . This process , It only needs a little adjustment to correspond to the noise suppression ： Estimate the noise signal , And remove it from the mixed signal collected by the microphone , Get the noise reduction speech signal . As shown in the figure above , Steady or unsteady “ Additive noise ” B, and “ Useful signals ” voice A Add up , A mixture of noisy speech C1, after ANS After the processing of the module , The final output noise reduction speech A1.

namely ：

C = A + B

A1 = C - B1

Although the core work is just a word , But there are many mysteries . Tradition ANS The core work of the algorithm can be divided into two modules ： Noise estimation module + Noise filtering module .

2、 Noise estimation module

The main work of noise estimation module is ： Determine whether the current signal is voice or noise , And the amount of noise . Compared with echo estimation , The difference in noise estimation lies in ： Noise is not derived from changes in the remote reference signal （ Additive noise ）, It has its own independent sound source , It is impossible to estimate the echo by using its correlation with the remote reference signal , Most of the time , Our raw material is mixed noisy speech C. The accuracy of noise estimation is very important , But from the mixed signal , Estimate part of it , It sounds like something difficult ？ Fortunately, the method is more difficult than the method , There are two sides to everything , Since it is not available “ The correlation ” Estimate , Why don't you try from “ Irrelevant ” Lay hands on .

Let's first look at the following two ways of noise estimation ：

（1） Noise estimation based on spectral subtraction

Noise estimation based on spectral subtraction , It is generally believed that the first few frames of noisy speech do not contain speech activity , It's a pure noise signal , therefore , The average signal spectrum of the previous frames can be taken （ Amplitude spectrum or energy spectrum ） As an estimate of the noise spectrum .

（2） Based on voice activity detection (VAD) Noise estimation

Noise estimation based on language activity detection , For mixed signals C Perform frame by frame detection , If an audio frame passes VAD No voice detected , It is considered noise 、 And update the noise spectrum , Otherwise, the noise spectrum of the previous frame will be used .

The two methods above , It often needs relatively pure noise section , And hope that the noise will be as stable as possible , The requirements for the signal to be processed are relatively high , It is especially harsh for complex and changeable practical application scenarios .

（3） Noise estimation based on statistical model

stay RTC scenario , Another solution is usually used ： Noise estimation based on statistical model .

Noise estimation based on statistical model , Generally speaking, speech signal and noise signal are statistically independent （ Unrelated ）, And obey a specific distribution , To put the problem into the estimation framework of statistics . Open source WebRTC Noise reduction module , The estimation method based on statistical model is also used . We might as well learn WebRTC The estimation scheme of noise reduction module , To help understand .WebRTC The noise estimation process is as follows ：

First , Perform initial noise estimation ： Using quantile noise estimation . Based on a consensus , Even voice segments , The input signal may also have no signal energy in some frequency band components , The energy of all speech frames in a certain frequency band can be counted , Set a quantile value , If the energy is lower than the quantile value, it is considered as noise , On the contrary, it is thought to be phonetics . This way, , Compared with VAD Frame by frame , The granularity of noise statistics is further refined , Even speech frames can extract effective noise information .

then , Update the initial noise estimate ： The result of initial noise estimation is not accurate enough , But it can be used as an initial condition for noise updating / Estimated follow-up process – Noise estimation based on likelihood ratio function transformation ： Put multiple voice / The noise classification features are combined into one model , A priori of combining model and initial noise estimation / A posteriori SNR , Analyze the input spectrum of each frame , Calculate whether a frame is audio / The probability of noise to judge whether it is noise , Further update the initial noise estimate .

The updated noise estimate will be more accurate than the initial estimate , After the update, you need to recalculate the prior / A posteriori SNR , And then used in the subsequent noise filtering module

Sum up , Namely WebRTC Some noise estimation schemes used , We get more accurate noise information from the noise estimation module , The next step is ANS Another key step in ： Noise filtering module .

3、 Noise filtering module

There are many ways to filter noise , For example, the spectral subtraction mentioned earlier , The estimated noise spectrum is subtracted from the signal spectrum of the noisy speech , The estimated clean signal spectrum is obtained . But we said , This method is not applicable to complex and changeable RTC scene , So just do it in one step , Get to know WebRTC Noise filtering means used ： Noise filtering based on Wiener filter .

What is Wiener filter ？

As shown in the figure below ： Put the noisy voice A Input a noise filter , If the desired output signal is A1, The actual output signal is A2, It is calculated that A1 and A2 The estimation error of E. We hope A2 As close as possible to A1, So the estimation error E It should be as small as possible , The optimal filter to achieve this goal is Wiener filter .

Wiener filter is a linear filter , The derivation of its mathematical expression requires a priori of speech signals / A posteriori SNR , In this way , The output information of the previous noise estimation module is useful . Through this information , We can WebRTC Noise estimation module 、 The noise filtering module is connected .

Now combine the two modules , Summarize its Core logic ：

First, the quantile noise estimation method is used to get the initial estimation of noise , A posteriori is calculated based on the estimation / A priori SNR ;

next , Using the modified algorithm based on likelihood ratio function 、 And integrated voice / A model of noise classification characteristics , Analyze the spectrum of each frame to get the speech of each frame / The probability of noise , Use probability to judge noise 、 Further update the more accurate noise estimation ;

Recalculate a posteriori based on more accurate noise estimation / A priori SNR 、 The Wiener filter expression is derived ;

Last , Use Wiener filter to complete noise reduction , Get noise reduction speech .

3、 ... and 、WebRTC Some problems of noise reduction module

Above contents , We learn WebRTC A noise reduction scheme for , Have a general understanding of the basic principle of noise suppression . Based on an understanding of these principles , We can discuss further WebRTC Some problems of noise reduction module ：

1、 Poor noise reduction effect on unsteady noise ：

If the statistical properties of the noise are unstable , Then the effect of noise estimation based on statistical model will be greatly reduced , It will also affect the noise reduction effect
Because only when the probability is enough to judge that the current audio frame is noise , The algorithm will update the initial noise estimate . If the noise changes frequently , And there are changes in the voice segment （ For example, when someone is talking ）, It may not be judged , Affect the inhibition effect

2、 The effect of noise reduction in low SNR scenes is poor ： In big noise / Low speech energy scenarios , Because the signal-to-noise ratio is very low , Yes, voice / The probability judgment success rate of noise will decrease , It may cause unclean noise suppression , It may even cause speech damage due to incorrect speech suppression

In practical application , If used WebRTC Noise reduction 、 Or based on WebRTC Improved noise reduction scheme , When the noise suppression effect is poor , You can also refer to the above two points for preliminary analysis .

Four 、AI Noise reduction — Make up for the shortcomings of traditional algorithms

For unsteady noise 、 The noise reduction effect of complex scenes such as low signal-to-noise ratio is poor , In fact, it is a common problem of traditional noise reduction algorithms , It is also for the major RTC Technical challenges of manufacturers . How to make up for the shortcomings of traditional algorithms ？AI, May be a correct answer .

With the extensive application of deep learning , A large number of audio noise reduction algorithms based on neural network have emerged in the industry , These algorithms are effective in noise reduction 、 The generalization ability has a good performance , But there are also some problems that cannot be ignored . For example, some algorithms require high computational power , It is difficult to apply to the actual user equipment , Especially for some low-end devices , take AI The algorithm is simply extravagant . Even on high-end devices , How to reasonably allocate computing resources to avoid performance waste is also a difficult problem . in addition ,AI Algorithm in different scenarios 、 Especially in Training Data In scenarios not covered , The noise reduction effect may not guarantee the robustness .

Based on these questions ,ZEGO A lightweight neural network noise reduction method is proposed —— ZegoAIDenoise, The algorithm is suitable for steady state / Unsteady noise has good noise reduction effect , Ensure the quality and intelligibility of voice , At the same time, the performance overhead is controlled at a very low level ： stay 1.4G Hz Dominant iPhone 6 On ,CPU The performance overhead is 1% about , And WebRTC The general noise reduction is quite , It can cover most middle and low-end models （ about ZegoAIDenoise Students interested in principles , You can read the articles published before ： A sharp weapon for eliminating unsteady noise - AI Noise reduction _ZEGO That is, the blog of science and technology -CSDN Blog _ai Audio noise reduction ）.

Yes ZegoAIDenoise The blessing , Combined with the original self-developed improved noise reduction algorithm , Developers use ZEGO SDK when , On the basis of good elimination of steady-state noise , Synchronous processing of unsteady noise （ Including mouse click sound 、 Keyboard sound 、 Knocking 、 Air conditioning sound 、 The clash of kitchen dishes 、 The restaurant is noisy 、 Environmental wind 、 Cough 、 Blowing noise and other non-human noise ）, For voice chat 、 Online meeting 、 Scenes such as Lian Mai Kai Hei bring better call effect . Without increasing the performance burden 、 Optimize the actual call experience on the premise of handling the delay , It is helpful for remote calls to play a role in more harsh environments , Broaden RTC Application scenarios of .

5、 ... and 、 summary

thus , With WebRTC The noise reduction module is the medium , We are about swordsmen AEC That's all . Because in the process of telling , As far as possible, we have stripped away the complicated mathematical calculation process , Just sort out and summarize the core logic , For developers who actually do underlying algorithm research , Many places may be a little superficial or not very rigorous , If you're interested , You can try to study open source WebRTC Source code for further study . For application developers who have not been exposed to audio noise reduction algorithms , I hope you can call a few SDK API The audio noise reduction function can be turned on , Can have a further understanding , Help you better understand and solve problems in application development .

Thinking questions

Noise suppression may damage “ Useful signals ”, This phenomenon is particularly evident in music scenes , Why ？

原网站

版权声明
本文为[Zego instant Technology]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206221502187071.html