PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech


Cross-Speaker-Emotion-Transfer - PyTorch Implementation

PyTorch Implementation of ByteDance's Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech.


DATASET refers to the names of datasets such as RAVDESS in the following documents.


You can install the Python dependencies with

pip3 install -r requirements.txt

Also, install fairseq (official document, github) to utilize LConvBlock. Please check here to resolve any issue on installing it. Note that Dockerfile is provided for Docker users, but you have to install fairseq manually.


You have to download the pretrained models and put them in output/ckpt/DATASET/.

To extract soft emotion tokens from a reference audio, run

python3 --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --ref_audio REF_AUDIO_PATH --restore_step RESTORE_STEP --mode single --dataset DATASET

Or, to use hard emotion tokens from an emotion id, run

python3 --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --emotion_id EMOTION_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt. Please note that only the hard emotion tokens from a given emotion id are supported in this mode.



The supported datasets are

  • RAVDESS: This portion of the RAVDESS contains 1440 files: 60 trials per actor x 24 actors = 1440. The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two lexically-matched statements in a neutral North American accent. Speech emotions includes calm, happy, sad, angry, fearful, surprise, and disgust expressions. Each expression is produced at two levels of emotional intensity (normal, strong), with an additional neutral expression.

Your own language and dataset can be adapted following here.


  • For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

  • Run

    python3 --dataset DATASET

    for some preparations.

    For the forced alignment, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Pre-extracted alignments for the datasets are provided here. You have to unzip the files in preprocessed_data/DATASET/TextGrid/. Alternately, you can run the aligner by yourself.

    After that, run the preprocessing script by

    python3 --dataset DATASET


Train your model with

python3 --dataset DATASET

Useful options:

  • To use Automatic Mixed Precision, append --use_amp argument to the above command.
  • The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES=<GPU_IDs> at the beginning of the above command.



tensorboard --logdir output/log

to serve TensorBoard on your localhost. The loss curves, synthesized mel-spectrograms, and audios are shown.


  • The current implementation is not trained in a semi-supervised way due to the small dataset size. But it can be easily activated by specifying target speakers and passing no emotion ID with no emotion classifier loss.
  • In Decoder, 15 X 1 LConv Block is used instead of 17 X 1 due to memory issues.
  • Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
  • DeepSpeaker on RAVDESS dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

  • For vocoder, HiFi-GAN and MelGAN are supported.


Please cite this repository by the "Cite this repository" of About section (top right of the main page).


  • loading state dict ——size mismatch

    loading state dict ——size mismatch

    I have a problem when I use your pre-trained model for synthesis. However, the following error happens:

    RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans: size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 3]) from checkpoint, the shape in current model is torch.Size([2, 1, 3]). size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]). size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 15]) from checkpoint, the shape in current model is torch.Size([8, 1, 15]).

    opened by cythc 2
  • Closed Issue

    Closed Issue

    Hi, I synthesized some samples with the provided pretrained models and the speaker embeedding from philipperemy's DeepSpeaker repo. However, the sampled results were bad in that all of the words were garbled and I could not hear any words.

    I am not sure if I am doing anything wrong since I just cloned your repository, downloaded the RAVDESS data and did everything listed in the Based on how I was able to generate samples, I do not think I am doing anything wrong, but was anyone able to synthesize good speech? And to the author of this repo @keonlee9420 do you mind uploading some samples generated from the pretrained models from the

    Thanks in advance.

    opened by jinny1208 0
  • The generated wav is not good

    The generated wav is not good

    Hi, thank you for open source the wonderful work ! I followed your instructions 1) install lightconv_cuda, 2) download the checkpoint, 3) download the speaker embedding npy. However, the generated result is not good.

    Below is my running command

    python3 \
      --text "Hello world" \
      --speaker_id Actor_22 \
      --emotion_id sad \
      --restore_step 450000 \
      --mode single \
      --dataset RAVDESS
    # sh 
    2022-11-30 13:45:22.626404: I tensorflow/stream_executor/platform/default/] Successfully opened dynamic library
    Device of XSpkEmoTrans: cuda
    Removing weight norm...
    Raw Text Sequence: Hello world
    Phoneme Sequence: {HH AH0 L OW1 W ER1 L D}


    python 3.6.8
    fairseq                 0.10.2
    torch                   1.7.0+cu110
    CUDA 11.0

    Hello world_Actor_22_sad


    opened by pangtouyuqqq 1
  • Synthesis with other person out of RAVDESS

    Synthesis with other person out of RAVDESS

    Hello author, Firstly, thank you for giving this repo, it is really nice. I have a question that:

    1. I download CMU data with single person with 100 audios and make speaker embedding vector and synthesis with this, the performance is not good. I cannot detect any words.
    2. Should we need to fine-tuning deep-speaker model to generate speaker embedding with my data.

    Thank you

    opened by hathubkhn 5
  • Error using the pretrained model

    Error using the pretrained model

    I'm trying to run synthesize with the pretrained model, like such:

    python3 --text "This sentence is a test" --speaker_id Actor_01 --emotion_id neutral --restore_step 450000  --dataset RAVDESS --mode single

    but I get an error in layer size:

    Traceback (most recent call last):
      File "", line 206, in <module>
        model = get_model(args, configs, device, train=False,
      File "/home/jrings/diviai/installs/Cross-Speaker-Emotion-Transfer/utils/", line 27, in get_model
        model.load_state_dict(model_dict, strict=False)
      File "<...>/torch/nn/modules/", line 1604, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for XSpkEmoTrans:
    	size mismatch for emotion_emb.etl.embed: copying a param with shape torch.Size([8, 64]) from checkpoint, the shape in current model is torch.Size([9, 64]).
    	size mismatch for duratin_predictor.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([2, 1, 3]) from checkpoint, the shape in current model is torch.Size([2, 3]).
    	size mismatch for decoder.lconv_stack.0.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.1.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.2.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.3.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.4.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    	size mismatch for decoder.lconv_stack.5.conv_layer.weight: copying a param with shape torch.Size([8, 1, 15]) from checkpoint, the shape in current model is torch.Size([8, 15]).
    opened by jrings 1
  • speaker embedding npy file not found

    speaker embedding npy file not found


    I am facing the following issue while synthesizing using pretrained model.

    Removing weight norm... Traceback (most recent call last): File "", line 234, in )) if load_spker_embed else None File "/home/sagar/tts/Cross-Speaker-Emotion-Transfer/venv/lib/python3.7/site-packages/numpy/lib/", line 417, in load fid = stack.enter_context(open(os_fspath(file), "rb")) FileNotFoundError: [Errno 2] No such file or directory: './preprocessed_data/RAVDESS/spker_embed/Actor_19-spker_embed.npy'

    Please suggest any way out. Thanks in advance -Sagar

    opened by raikarsagar 4
Keon Lee
Expressive Speech Synthesis | Conversational AI | Open-domain Dialog | NLP | Generative Models | Empathic Computing | HCI
Keon Lee
Dynamic Attentive Graph Learning for Image Restoration, ICCV2021 [PyTorch Code]

Dynamic Attentive Graph Learning for Image Restoration This repository is for GATIR introduced in the following paper: Chong Mou, Jian Zhang, Zhuoyuan

Jian Zhang 84 Dec 09, 2022
Graph Convolutional Networks for Temporal Action Localization (ICCV2019)

Graph Convolutional Networks for Temporal Action Localization This repo holds the codes and models for the PGCN framework presented on ICCV 2019 Graph

Runhao Zeng 318 Dec 06, 2022
On-device wake word detection powered by deep learning.

Porcupine Made in Vancouver, Canada by Picovoice Porcupine is a highly-accurate and lightweight wake word engine. It enables building always-listening

Picovoice 2.8k Dec 29, 2022
Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN 🦄 Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

118 Jan 06, 2023
Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks"

LUNAR Official Implementation of "LUNAR: Unifying Local Outlier Detection Methods via Graph Neural Networks" Adam Goodge, Bryan Hooi, Ng See Kiong and

Adam Goodge 25 Dec 28, 2022
ChebLieNet, a spectral graph neural network turned equivariant by Riemannian geometry on Lie groups.

ChebLieNet: Invariant spectral graph NNs turned equivariant by Riemannian geometry on Lie groups Hugo Aguettaz, Erik J. Bekkers, Michaël Defferrard We

haguettaz 12 Dec 10, 2022
High dimensional black-box optimizer using Latent Action Monte Carlo Tree Search algorithm

LA-MCTS The code is based of paper Learning Search Space Partition for Black-box Optimization using Monte Carlo Tree Search. Component LA-MCTS has thr

Meta Research 18 Oct 24, 2022
A repository for benchmarking neural vocoders by their quality and speed.

License The majority of VocBench is licensed under CC-BY-NC, however portions of the project are available under separate license terms: Wavenet, Para

Meta Research 177 Dec 12, 2022
Datasets, tools, and benchmarks for representation learning of code.

The CodeSearchNet challenge has been concluded We would like to thank all participants for their submissions and we hope that this challenge provided

GitHub 1.8k Dec 25, 2022
PyTorch implementation for Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021, Oral)

Score-Based Generative Modeling through Stochastic Differential Equations This repo contains a PyTorch implementation for the paper Score-Based Genera

Yang Song 757 Jan 04, 2023
Free like Freedom

This is all very much a work in progress! More to come! ( We're working on it though! Stay tuned!) Installation Open an Anaconda Prompt (in Windows, o

2.3k Jan 04, 2023
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 05, 2023
Face recognize and crop them

Face Recognize Cropping Module Source 아이디어 Face Alignment with OpenCV and Python Requirement 필요 라이브러리 imutil dlib python-opence (cv2) Usage 사용 방법 open

Cho Moon Gi 1 Feb 15, 2022
A curated list of neural rendering resources.

Awesome-of-Neural-Rendering A curated list of neural rendering and related resources. Please feel free to pull requests or open an issue to add papers

Zhiwei ZHANG 43 Dec 09, 2022
Image Super-Resolution Using Very Deep Residual Channel Attention Networks

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

kongdebug 14 Oct 14, 2022
Source code for Fixed-Point GAN for Cloud Detection

FCD: Fixed-Point GAN for Cloud Detection PyTorch source code of Nyborg & Assent (2020). Abstract The detection of clouds in satellite images is an ess

Joachim Nyborg 8 Dec 22, 2022
Trafffic prediction analysis using hybrid models - Machine Learning

Hybrid Machine learning Model Clone the Repository Create a new Directory as assests and download the model from the below link Model Link To Start th

1 Feb 08, 2022
Here is the diagnostic tool for BMVC 2021 paper Diagnosing Errors in Video Relation Detectors.

Here is the diagnostic tool for BMVC 2021 paper Diagnosing Errors in Video Relation Detectors. We provide a tiny ground truth file demo_gt.json, and t

Shuo Chen 3 Dec 26, 2022
A python package to perform same transformation to coco-annotation as performed on the image.

coco-transform-util A python package to perform same transformation to coco-annotation as performed on the image. Installation Way 1 $ git clone https

1 Jan 14, 2022
[ICCV2021] IICNet: A Generic Framework for Reversible Image Conversion

IICNet - Invertible Image Conversion Net Official PyTorch Implementation for IICNet: A Generic Framework for Reversible Image Conversion (ICCV2021). D

felixcheng97 55 Dec 06, 2022