Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Last update: Jan 02, 2023

Overview

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

English | 中文

❗ Now we provide inferencing code and pre-training models. You could generate any text sounds you want.

⭐ The model training only uses the corpus of neutral emotion, and does not use any strongly emotional speech.

⭐ There are still great challenges in out-of-domain style transfer. Limited by the training corpus, it is difficult for the speaker-embedding or unsupervised style learning (like GST) methods to imitate the unseen data.

⭐ With the help of Unet network and AdaIN layer, our proposed algorithm has powerful speaker and style transfer capabilities.

Infer code or Colab notebook

Demo results

Paper link

😄 The authors are preparing simple, clear, and well-documented training process of Unet-TTS based on Aishell3. It contains:

MFA-based duration alignment
Multi-speaker TTS with speaker_embedding-Instance-Normalization, and this model provides pre-training Content Encoder.
Unet-TTS training
One-shot Voice cloning inference
C++ inference

Stay tuned!

Install Requirements

Install the appropriate TensorFlow and tensorflow-addons versions according to CUDA version.
The default is TensorFlow 2.6 and tensorflow-addons 0.14.0.

pip install TensorFlowTTS

Usage

see file UnetTTS_syn.py or notebook

CUDA_VISIBLE_DEVICES=0 python UnetTTS_syn.py

from UnetTTS_syn import UnetTTS

models_and_params = {"duration_param": "train/configs/unetts_duration.yaml",
                    "duration_model": "models/duration4k.h5",
                    "acous_param": "train/configs/unetts_acous.yaml",
                    "acous_model": "models/acous12k.h5",
                    "vocoder_param": "train/configs/multiband_melgan.yaml",
                    "vocoder_model": "models/vocoder800k.h5"}

feats_yaml = "train/configs/unetts_preprocess.yaml"

text2id_mapper = "models/unetts_mapper.json"

Tts_handel = UnetTTS(models_and_params, text2id_mapper, feats_yaml)

#text: input text
#src_audio: reference audio
#dur_stat: phoneme duration statistis to contraol speed rate
syn_audio, _, _ = Tts_handel.one_shot_TTS(text, src_audio, dur_stat)

Reference

https://github.com/TensorSpeech/TensorFlowTTS

https://github.com/CorentinJ/Real-Time-Voice-Cloning

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Related tags

Overview

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Install Requirements

Usage

Reference

Owner

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Model for recasing and repunctuating ASR transcripts

The first online catalogue for Arabic NLP datasets.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Telegram AI chat bot written in Python using Pyrogram

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Turkish Stop Words Türkçe Dolgu Sözcükleri

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

A simple implementation of N-gram language model.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

PG-19 Language Modelling Benchmark

InferSent sentence embeddings

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

NSFW A chatbot based on GPT2-chitchat

Question and answer retrieval in Turkish with BERT

A natural language processing model for sequential sentence classification in medical abstracts.

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Related tags

Overview

Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning

Install Requirements

Usage

Reference

Owner

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency

GCRC: A Gaokao Chinese Reading Comprehension dataset for interpretable Evaluation

Model for recasing and repunctuating ASR transcripts

The first online catalogue for Arabic NLP datasets.

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

Gpt2-WebAPI - The objective of this API is to provide the 3 best possible responses to sentences that the user would input via http GET request as a parameter

Telegram AI chat bot written in Python using Pyrogram

Unofficial PyTorch implementation of Google AI's VoiceFilter system

Turkish Stop Words Türkçe Dolgu Sözcükleri

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

A simple implementation of N-gram language model.

Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

PG-19 Language Modelling Benchmark

InferSent sentence embeddings

Applying "Load What You Need: Smaller Versions of Multilingual BERT" to LaBSE

**NSFW** A chatbot based on GPT2-chitchat

Question and answer retrieval in Turkish with BERT

A natural language processing model for sequential sentence classification in medical abstracts.

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

OceanScript is an Esoteric language used to encode and decode text into a formulation of characters

NSFW A chatbot based on GPT2-chitchat