当前位置：网站首页>Audio alignment using cross-correlation

Audio alignment using cross-correlation

2022-07-31 21:37:00 【atypical nonsense】

When calculating some audio metrics such as SNR, we need the audio signal to be aligned with the reference signal, but sometimes our processed or recorded audio is not aligned with the reference signal, which requires finding a way to align themAlign.

I. Cross-correlation function

Audio alignment can be transformed into a delay estimation problem. We have previously introduced the use of GCC-PHAT for delay estimation, here we use a simpler way to estimate - the cross-correlation function.我们在The time domain analysis of speech signal has introduced the autocorrelation function, and the similar calculation formula for the cross-correlation function of the discrete time domain signal is:

We know that in the frequency domain cross-correlation, we have weighted the generalized cross-correlation, and PHAT has whitened the result, making the peak value of the cross-correlation function more obvious. Similarly, formula (1) can be similarly operated to makeThe peaks are more pronounced:

The calculation of cross-correlation is actually similar to cross-correlation. I found a video to explain the calculation process.

II. Praat

Many programming languages encapsulate cross-correlation functions. Here we use a software commonly used in the field of speech analysis called Praat.Since the speech length is finite Praat has made appropriate modifications on the cross-correlation function, in simple terms the start time of the cross-correlation sequence will be the start time of f minus the end time of g, and the end time will be the end time of f minus the end time of gThe start time of g, i.e. the time of the first sample is the first sample of f minus the last sample of g, the time of the last sample will be the last sample of f minus the first sample of g, autocorrelationThe length of the sequence is the sum of the samples of f and g minus 1.

Let's take a look at the effect below. First, we have two audios, as shown in the figure below. It can be clearly seen that the audio on the two tracks has a significant delay.