当前位置:网站首页>Thesis reading - autovc: zero shot voice style transfer with only autoencoder loss
Thesis reading - autovc: zero shot voice style transfer with only autoencoder loss
2022-06-13 02:16:00 【zjuPeco】
List of articles
1 summary
voice conversion The goal of this task is to input two audio , Its input is two pieces of audio , A piece of audio is called content_audio, The other paragraph is called speaker_audio. The model will be extracted content_audio The characteristics of speech content and speaker_audio The characteristics of the speaker in , Combine the two , For output speaker_audio The voice of the speaker in content_audio Audio of content in .
The model of this task usually has two problems :
- Need paired data , That is, data that requires multiple people to say the same sentence
- It cannot be used on vocals that have not appeared in the training data
AutoVC Successfully solved these two problems , That is, it does not require paired data , At the same time zero-shot conversion.
Official demo so https://auspicious3000.github.io/autovc-demo/. The effect is very good , But the performance in Chinese pronunciation is very poor . Do it in Chinese , It needs to be retrained on the Chinese dataset . Chinese training data is easy to find , As long as you know who said each sentence, you can , such as ASR All data sets can be used .
2 Model architecture
voice conversion The overall process is shown in the figure below 2-1 Shown .
First of all, I'll check the input content_wav and speaker_wav Extract Mel spectrum , Then we will extract the speaker_mel Have a time speaker_embedding Model , Extract the characteristics of the speaker , Then, the characteristics of the speaker and content_mel Input to AutoVC In the process of integration , Get the fused Mel spectrum merged_mel, Finally, put merge_mel too vocoder You get the final output audio .
The purple box here shows the parts that need training , The earth colored box shows the replaceable parts .
3 Module analysis
3.1 Acquire Mel spectrum
The input of this module is a sound signal , The output is the Mel spectrum of the sound signal , It's usually 80 Dimensional .
When different models extract Mel spectrum , The parameters will vary , The Mel spectrum here also affects the input vocoder Mel spectrum of . So when different models are used to piece together the whole process , It is necessary to unify the processing of Mel spectrum , This is a very important step in the implementation . Otherwise, the trained model , I can't connect with the model behind , It's a lot of trouble .
3.2 speaker encoder
This part is to extract the speaker's features from the audio , The input is the Mel spectrum , The output is a 256 Dimension characteristics .
The so-called characteristics of the speaker are irrelevant to the content , Features that relate only to the speaker . The goal of this model training is to make the same person speak different words and output the same 256 Dimension vector , Make different people say the same sentence and output different 256 Dimension vector .
This model can be pre trained ,AutoVC Is used in the VoxCeleb1 and Librispeech It's pre trained Dvector, common 3549 A speaker .
Before training , Put everything that each speaker says , Let's go separately speaker encoder, And put these embedding Take an average , As the speaker speaker embedding.
Prediction time , Put the input speaker_mel Later speaker encoder, Take the output features as the speaker embedding.
This speaker embedding Is a very important feature , If this feature is not good enough , It will lead to training AutoVC The training is not good enough .
We can also retrain this model on the Chinese corpus . You can also take the model that others have pre trained on the Chinese corpus , such as MockingBird.
3.3 AutoVC
AutoVC This module is the focus of this article , It consists of a content encoder( E c E_c Ec), One speaker encoder( E s E_s Es) and decoder form . there E s E_s Es Namely 3.2 Medium speaker encoder, It's pre trained , So use gray to show . The schematic diagram of reasoning is shown in the figure below 3-1(a) Shown , The schematic diagram of training is shown in the figure below 3-1(b) Shown .
In the figure 3-1(a) It shows the flow chart of reasoning , X 1 X_1 X1 by content_audio Of mel-spectrogram, Z 1 Z_1 Z1 by X 1 X_1 X1 Content features in , U 1 U_1 U1 by X 1 X_1 X1 The characteristics of the speaker in , E c E_c Ec Representing the extracted content features content encoder, C 1 C_1 C1 Are extracted content features ; X 2 X_2 X2 by speaker_audio Of mel-spectrogram, Z 2 Z_2 Z2 by X 2 X_2 X2 Content features in , U 2 U_2 U2 by X 2 X_2 X2 The characteristics of the speaker in , surface E s E_s Es Showing the characteristics of the speaker speaker encoder, S 2 S_2 S2 Is the extracted speaker feature ; D D D Representing the fused feature decoder, X ^ 1 → 2 \hat{X}_{1 \rightarrow 2} X^1→2 Represents the final converted mel-spectrogram.
In the figure 3-1(b) It shows the flow chart of training , During training , We don't have paired data , That is, there is no data that two people say the same sentence , So there's no S 2 S_2 S2, Only S 1 S_1 S1. This S 1 S_1 S1 Not alone X 1 X_1 X1 after E s E_s Es Got , But rather X 1 X_1 X1 All the words of this person , Go through separately E s E_s Es after , Take the average , That is, it has been determined before training . X ^ 1 → 1 \hat{X}_{1 \rightarrow 1} X^1→1 Represents the Mel spectrum after self reconstruction , We want it to be related to X 1 X_1 X1 The closer you get, the better .
The author in decoder( D D D) There is also a postnetwork, This network acts as a correction , The input is X ^ 1 → 2 \hat{X}_{1 \rightarrow 2} X^1→2, The output is R ^ 1 → 2 \hat{R}_{1 \rightarrow 2} R^1→2, You can also guess from this symbol , It means residual error . With the residuals , The final result is
X ^ f i n a l 1 → 2 = X ^ 1 → 2 + R ^ 1 → 2 (3-1) \hat{X}_{final \ 1 \rightarrow 2} = \hat{X}_{1 \rightarrow 2} + \hat{R}_{1 \rightarrow 2} \tag{3-1} X^final 1→2=X^1→2+R^1→2(3-1)
In training Loss It consists of three parts
One of the losses of self reconstruction :
L r e c o n = E [ ∣ ∣ X ^ 1 → 1 − X 1 ∣ ∣ 2 2 ] (3-2) L_{recon} = E[|| \hat{X}_{1 \rightarrow 1} - X_1 ||^2_2] \tag{3-2} Lrecon=E[∣∣X^1→1−X1∣∣22](3-2)
The second is the loss of self reconstruction :
L r e c o n 0 = E [ ∣ ∣ X ^ f i n a l 1 → 1 − X 1 ∣ ∣ 2 2 ] (3-3) L_{recon0} = E[|| \hat{X}_{final \ 1 \rightarrow 1} - X_1 ||^2_2] \tag{3-3} Lrecon0=E[∣∣X^final 1→1−X1∣∣22](3-3)
Content loss :
L c o n t e n t = E [ ∣ ∣ E c ( X ^ f i n a l 1 → 1 ) − C 1 ∣ ∣ 1 ] (3-4) L_{content} = E[|| E_c(\hat{X}_{final \ 1 \rightarrow 1}) - C_1 ||_1] \tag{3-4} Lcontent=E[∣∣Ec(X^final 1→1)−C1∣∣1](3-4)
The total loss is
L = L r e c o n + μ L r e c o n 0 + λ L c o n t e n t (3-5) L = L_{recon} + \mu L_{recon0} + \lambda L_{content} \tag{3-5} L=Lrecon+μLrecon0+λLcontent(3-5)
among , μ \mu μ and λ \lambda λ Time is used to adjust different loss Between the weights of the hyperparameters . these loss All ensure the self reconstruction ability of the model .
3.4 Vocoder
Vocoder Its function is to restore the Mel spectrum back to the audio signal , Under normal circumstances , This model does not need to be replaced . What is used in the original is wavenet, however wavenet It's too slow , One 5 Seconds of audio to 15 Minutes to get out , It doesn't work .
After open source , The author also provided a well-trained hifi-gan As vocoder,hifi-gan Just like wavenet Much faster , The effect is also good . It's just that there will be some electric sounds . This can be done by yourself finue-tune once , But you have to train hifi-gan Words ,get_mel We should also pay attention to the problem of consistency .
4 A key part
Some may question , Why did you train like this C 1 C_1 C1 Is the characteristic of the content , Even if the S 1 S_1 S1 It is characteristic of a clean speaker , C 1 C_1 C1 Wouldn't there be some features of the speaker in the ?
This problem is also explained in the original text , List some assumptions . The main premise is S 1 S_1 S1 Is a clean speaker feature .
I don't want to know anything here , Just explain the intuitive understanding . Make sure that C 1 C_1 C1 Only content features , And the method of complete content features is , control C 1 C_1 C1 Dimensions . Open source code , This dimension is 32 dimension .
Pictured 3-2 Shown , If C 1 C_1 C1 If the dimension of is too large , Such as (a), Will include the characteristics of the speaker , At this time, there is no effect on the self reconstruction ability , But if you take C 1 C_1 C1 To train a speaker classification Model words , The accuracy of the model will be higher ; If C 1 C_1 C1 If the dimension of is too small , Such as (b), Finally, the error of self reconstruction will be relatively large ; If C 1 C_1 C1 If the dimension of is just right , Such as , Yes, the self weight error is also relatively small , use C 1 C_1 C1 To train a speaker classification The accuracy of the model is also relatively small .
It is in this way that the author determines , The following table shows the difference C 1 C_1 C1 Under the dimension , The self weight error and the accuracy of the speaker classifier are obtained .
Reference material
[1] AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
[2] https://github.com/auspicious3000/autovc
[3] https://github.com/babysor/MockingBird
[4] https://github.com/jik876/hifi-gan
边栏推荐
- 华为设备配置双反射器优化虚拟专用网骨干层
- 蓝牙模块:使用问题集锦
- [learning notes] xr872 GUI littlevgl 8.0 migration (display part)
- How to solve the problem of obtaining the time through new date() and writing out the difference of 8 hours between the database and the current time [valid through personal test]
- Decoding iFLYTEK open platform 2.0 is a fertile land for developers and a source of industrial innovation
- Use of Arduino series pressure sensors and detected data displayed by OLED (detailed tutorial)
- Functional translation
- Luzhengyao, who has entered the prefabricated vegetable track, still needs to stop being impatient
- [keras] train py
- The scientific innovation board successfully held the meeting, and the IPO of Kuangshi technology ushered in the dawn
猜你喜欢

【 unity】 Problems Encountered in Packaging webgl Project and their resolution Records

Chapter7-10_ Deep Learning for Question Answering (1/2)

When AI meets music, iFLYTEK music leads the industry reform with technology

Bai ruikai Electronic sprint Scientific Innovation Board: proposed to raise 360 million Funds, Mr. And Mrs. Wang binhua as the main Shareholder

Decoding iFLYTEK open platform 2.0 is a fertile land for developers and a source of industrial innovation

Gome's ambition of "folding up" app

Cumulative tax law: calculate how much tax you have paid in a year

Luzhengyao, who has entered the prefabricated vegetable track, still needs to stop being impatient

分享三个关于CMDB的小故事

Paper reading - beat tracking by dynamic programming
随机推荐
[pytorch] kaggle image classification competition arcface + bounding box code learning
10 days based on stm32f401ret6 smart lock project practice day 1 (environment construction and new construction)
Simple ranging using Arduino and ultrasonic sensors
Parameter measurement method of brushless motor
What did Hello travel do right for 500million users in five years?
STM32F103 IIC OLED program migration complete engineering code
Sensor: sht30 temperature and humidity sensor testing ambient temperature and humidity experiment (code attached at the bottom)
Application circuit and understanding of BAT54C as power supply protection
传感器:MQ-5燃气模块测量燃气值(底部附代码)
【LeetCode-SQL】1532. Last three orders
[the second day of actual combat of smart lock project based on stm32f401ret6 in 10 days] GPIO and register
【Unity】打包WebGL项目遇到的问题及解决记录
Mac使用Docker安装Oracle
Stm32 mpu6050 servo pan tilt support follow
华为设备配置双反射器优化虚拟专用网骨干层
[the fourth day of actual combat of stm32f401ret6 smart lock project in 10 days] voice control is realized by externally interrupted keys
[single chip microcomputer] single timer in front and back platform program framework to realize multi delay tasks
(novice to) detailed tutorial on machine / in-depth learning with colab from scratch
Leetcode daily question - 890 Find and replace mode
反爬虫策略(ip代理、设置随机休眠时间、哔哩哔哩视频信息爬取、真实URL的获取、特殊字符的处理、时间戳的处理、多线程处理)