当前位置：网站首页>Thesis reading - autovc: zero shot voice style transfer with only autoencoder loss

Thesis reading - autovc: zero shot voice style transfer with only autoencoder loss

2022-06-13 02:16:00 【zjuPeco】

List of articles

1 summary

voice conversion The goal of this task is to input two audio , Its input is two pieces of audio , A piece of audio is called content_audio, The other paragraph is called speaker_audio. The model will be extracted content_audio The characteristics of speech content and speaker_audio The characteristics of the speaker in , Combine the two , For output speaker_audio The voice of the speaker in content_audio Audio of content in .

The model of this task usually has two problems ：

Need paired data , That is, data that requires multiple people to say the same sentence
It cannot be used on vocals that have not appeared in the training data

AutoVC Successfully solved these two problems , That is, it does not require paired data , At the same time zero-shot conversion.

Official demo so https://auspicious3000.github.io/autovc-demo/. The effect is very good , But the performance in Chinese pronunciation is very poor . Do it in Chinese , It needs to be retrained on the Chinese dataset . Chinese training data is easy to find , As long as you know who said each sentence, you can , such as ASR All data sets can be used .

2 Model architecture

voice conversion The overall process is shown in the figure below 2-1 Shown .

First of all, I'll check the input content_wav and speaker_wav Extract Mel spectrum , Then we will extract the speaker_mel Have a time speaker_embedding Model , Extract the characteristics of the speaker , Then, the characteristics of the speaker and content_mel Input to AutoVC In the process of integration , Get the fused Mel spectrum merged_mel, Finally, put merge_mel too vocoder You get the final output audio .
voice conversion Overall flow chart

chart 2-1 voice conversion Overall flow chart

The purple box here shows the parts that need training , The earth colored box shows the replaceable parts .

3 Module analysis

3.1 Acquire Mel spectrum

The input of this module is a sound signal , The output is the Mel spectrum of the sound signal , It's usually 80 Dimensional .

When different models extract Mel spectrum , The parameters will vary , The Mel spectrum here also affects the input vocoder Mel spectrum of . So when different models are used to piece together the whole process , It is necessary to unify the processing of Mel spectrum , This is a very important step in the implementation . Otherwise, the trained model , I can't connect with the model behind , It's a lot of trouble .

3.2 speaker encoder

This part is to extract the speaker's features from the audio , The input is the Mel spectrum , The output is a 256 Dimension characteristics .

The so-called characteristics of the speaker are irrelevant to the content , Features that relate only to the speaker . The goal of this model training is to make the same person speak different words and output the same 256 Dimension vector , Make different people say the same sentence and output different 256 Dimension vector .

This model can be pre trained ,AutoVC Is used in the VoxCeleb1 and Librispeech It's pre trained Dvector, common 3549 A speaker .

Before training , Put everything that each speaker says , Let's go separately speaker encoder, And put these embedding Take an average , As the speaker speaker embedding.

Prediction time , Put the input speaker_mel Later speaker encoder, Take the output features as the speaker embedding.

This speaker embedding Is a very important feature , If this feature is not good enough , It will lead to training AutoVC The training is not good enough .

We can also retrain this model on the Chinese corpus . You can also take the model that others have pre trained on the Chinese corpus , such as MockingBird.

3.3 AutoVC

AutoVC This module is the focus of this article , It consists of a content encoder( $E_c$ ), One speaker encoder( $E_s$ ) and decoder form . there $E_s$ Namely 3.2 Medium speaker encoder, It's pre trained , So use gray to show . The schematic diagram of reasoning is shown in the figure below 3-1(a) Shown , The schematic diagram of training is shown in the figure below 3-1(b) Shown .
AutoVC Sketch Map

chart 3-1 AutoVC Sketch Map

In the figure 3-1(a) It shows the flow chart of reasoning , $X_1$ by content_audio Of mel-spectrogram, $Z_1$ by $X_1$ Content features in , $U_1$ by $X_1$ The characteristics of the speaker in , $E_c$ Representing the extracted content features content encoder, $C_1$ Are extracted content features ; $X_2$ by speaker_audio Of mel-spectrogram, $Z_2$ by $X_2$ Content features in , $U_2$ by $X_2$ The characteristics of the speaker in , surface $E_s$ Showing the characteristics of the speaker speaker encoder, $S_2$ Is the extracted speaker feature ; $D$ Representing the fused feature decoder, $\hat{X}_{1 \rightarrow 2}$ Represents the final converted mel-spectrogram.

In the figure 3-1(b) It shows the flow chart of training , During training , We don't have paired data , That is, there is no data that two people say the same sentence , So there's no $S_2$ , Only $S_1$ . This $S_1$ Not alone $X_1$ after $E_s$ Got , But rather $X_1$ All the words of this person , Go through separately $E_s$ after , Take the average , That is, it has been determined before training . $\hat{X}_{1 \rightarrow 1}$ Represents the Mel spectrum after self reconstruction , We want it to be related to $X_1$ The closer you get, the better .

The author in decoder( $D$ ) There is also a postnetwork, This network acts as a correction , The input is $\hat{X}_{1 \rightarrow 2}$ , The output is $\hat{R}_{1 \rightarrow 2}$ , You can also guess from this symbol , It means residual error . With the residuals , The final result is

$\hat{X}_{final \ 1 \rightarrow 2} = \hat{X}_{1 \rightarrow 2} + \hat{R}_{1 \rightarrow 2} \tag{3-1}$

In training Loss It consists of three parts

One of the losses of self reconstruction :
$L_{recon} = E[|| \hat{X}_{1 \rightarrow 1} - X_1 ||^2_2] \tag{3-2}$

The second is the loss of self reconstruction ：
$L_{recon0} = E[|| \hat{X}_{final \ 1 \rightarrow 1} - X_1 ||^2_2] \tag{3-3}$

Content loss ：
$L_{content} = E[|| E_c(\hat{X}_{final \ 1 \rightarrow 1}) - C_1 ||_1] \tag{3-4}$

The total loss is
$L_{recon} + \mu L_{recon0} + \lambda L_{content} \tag{3-5}$

among , $\mu$ and $\lambda$ Time is used to adjust different loss Between the weights of the hyperparameters . these loss All ensure the self reconstruction ability of the model .

3.4 Vocoder

Vocoder Its function is to restore the Mel spectrum back to the audio signal , Under normal circumstances , This model does not need to be replaced . What is used in the original is wavenet, however wavenet It's too slow , One 5 Seconds of audio to 15 Minutes to get out , It doesn't work .

After open source , The author also provided a well-trained hifi-gan As vocoder,hifi-gan Just like wavenet Much faster , The effect is also good . It's just that there will be some electric sounds . This can be done by yourself finue-tune once , But you have to train hifi-gan Words ,get_mel We should also pay attention to the problem of consistency .

4 A key part

Some may question , Why did you train like this $C_1$ Is the characteristic of the content , Even if the $S_1$ It is characteristic of a clean speaker , $C_1$ Wouldn't there be some features of the speaker in the ？

This problem is also explained in the original text , List some assumptions . The main premise is $S_1$ Is a clean speaker feature .

I don't want to know anything here , Just explain the intuitive understanding . Make sure that $C_1$ Only content features , And the method of complete content features is , control $C_1$ Dimensions . Open source code , This dimension is 32 dimension .
autovc Illustration of key parts

chart 3-2 AutoVC Illustration of key parts

Pictured 3-2 Shown , If $C_1$ If the dimension of is too large , Such as (a), Will include the characteristics of the speaker , At this time, there is no effect on the self reconstruction ability , But if you take $C_1$ To train a speaker classification Model words , The accuracy of the model will be higher ; If $C_1$ If the dimension of is too small , Such as (b), Finally, the error of self reconstruction will be relatively large ; If $C_1$ If the dimension of is just right , Such as , Yes, the self weight error is also relatively small , use $C_1$ To train a speaker classification The accuracy of the model is also relatively small .

It is in this way that the author determines , The following table shows the difference $C_1$ Under the dimension , The self weight error and the accuracy of the speaker classifier are obtained .
Self reconstruction error and speaker classification accuracy under different dimensions