Official PyTorch implementation of SyntaSpeech (IJCAI 2022)

Overview

SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech

arXiv | GitHub Stars | downloads | Hugging Face | 中文文档

This repository is the official PyTorch implementation of our IJCAI-2022 paper, in which we propose SyntaSpeech for syntax-aware non-autoregressive Text-to-Speech.



Our SyntaSpeech is built on the basis of PortaSpeech (NeurIPS 2021) with three new features:

  1. We propose Syntactic Graph Builder (Sec. 3.1) and Syntactic Graph Encoder (Sec. 3.2), which is proved to be an effective unit to extract syntactic features to improve the prosody modeling and duration accuracy of TTS model.
  2. We introduce Multi-Length Adversarial Training (Sec. 3.3), which could replace the flow-based post-net in PortaSpeech, speeding up the inference time and improving the audio quality naturalness.
  3. We support three datasets: LJSpeech (single-speaker English dataset), Biaobei (single-speaker Chinese dataset) , and LibriTTS (multi-speaker English dataset).

Environments

conda create -n synta python=3.7
condac activate synta
pip install -U pip
pip install Cython numpy==1.19.1
pip install torch==1.9.0 
pip install -r requirements.txt
# install dgl for graph neural network, dgl-cu102 supports rtx2080, dgl-cu113 support rtx3090
pip install dgl-cu102 dglgo -f https://data.dgl.ai/wheels/repo.html 
sudo apt install -y sox libsox-fmt-mp3
bash mfa_usr/install_mfa.sh # install force alignment tools

Run SyntaSpeech!

Please follow the following steps to run this repo.

1. Preparation

Data Preparation

You can directly use our binarized datasets for LJSpeech and Biaobei. Download them and unzip them into the data/binary/ folder.

As for LibriTTS, you can download the raw datasets and process them with our data_gen modules. Detailed instructions can be found in dosc/prepare_data.

Vocoder Preparation

We provide the pre-trained model of vocoders for three datasets. Specifically, Hifi-GAN for LJSpeech and Biaobei, ParallelWaveGAN for LibriTTS. Download and unzip them into the checkpoints/ folder.

2. Training Example

Then you can train SyntaSpeech in the three datasets.

cd <the root_dir of your SyntaSpeech folder>
export PYTHONPATH=./
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset # training in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset # training in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset # training in LibriTTS

3. Tensorboard

tensorboard --logdir=checkpoints/lj_synta
tensorboard --logdir=checkpoints/biaobei_synta
tensorboard --logdir=checkpoints/libritts_synta

4. Inference Example

CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/lj/synta.yaml --exp_name lj_synta --reset --infer # inference in LJSpeech
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name biaobei_synta --reset --infer # inference in Biaobei
CUDA_VISIBLE_DEVICES=0 python tasks/run.py --config egs/tts/biaobei/synta.yaml --exp_name libritts_synta --reset ---infer # inference in LibriTTS

Audio Demos

Audio samples in the paper can be found in our demo page.

We also provide HuggingFace Demo Page for LJSpeech. Try your interesting sentences there!

Citation

@article{ye2022syntaspeech,
  title={SyntaSpeech: Syntax-Aware Generative Adversarial Text-to-Speech},
  author={Ye, Zhenhui and Zhao, Zhou and Ren, Yi and Wu, Fei},
  journal={arXiv preprint arXiv:2204.11792},
  year={2022}
}

Acknowledgements

Our codes are based on the following repos:

Comments
  • pinyin preprocess problem

    pinyin preprocess problem

    005804 你当#1我傻啊#3?脑子#1那么大#2怎么#1塞进去#4? ni3 dang1 wo2 sha3 a5 nao3 zi5 na4 me5 da4 zen3 me5 sai1 jin4 qu4

    txt_struct=[['', ['']], ['你', ['n', 'i3']], ['当', ['d', 'ang1']], ['我', ['uo3']], ['傻', ['sh', 'a3']], ['啊', ['a', '?', 'n', 'ao3']], ['?', ['z', 'i']], ['脑', ['n', 'a4']], ['子', ['m', 'e']], ['那', ['d', 'a4']], ['么', ['z', 'en3']], ['大', ['m', 'e']], ['怎', ['s', 'ai1']], ['么', ['j', 'in4']], ['塞', ['q', 'v4', '?']], ['进', []], ['去', []], ['?', []], ['', ['']]]

    ph_gb_word=['', 'n_i3', 'd_ang1', 'uo3', 'sh_a3', 'a_?n_ao3', 'z_i', 'n_a4', 'm_e', 'd_a4', 'z_en3', 'm_e', 's_ai1', 'j_in4', 'q_v4?', '', '', '', '']

    what is 'a_?_n_ao3'

    in the mfa_dict it appears ch_a1_d_ou1 ,a_?_n_ao3 and so on

    opened by windowxiaoming 2
  • discriminator output['y_c'] never used

    discriminator output['y_c'] never used

    Discriminator's output['y_c'] never used, and never calculated in discriminator forward func. What does this variable mean? https://github.com/yerfor/SyntaSpeech/blob/5b07439633a3e714d2a6759ea4097eb36d6cd99a/tasks/tts/synta.py#L81

    opened by mayfool 2
  • A question of KL divergence calculation

    A question of KL divergence calculation

    In modules/tts/portaspeech/fvae.py, SyntaFVAE compute loss_kl (line 121) , Can someone help explain why loss_kl = ((logqx - logpx) * nonpadding_sqz).sum() / nonpadding_sqz.sum() / logqx.shape[1],I think loss_kl should be compute by loss_kl = logqx.exp()*(logqx - logpx) I would be very grateful if you could reply to me!

    opened by JiaYK 2
  • mfa for multi speaker.

    mfa for multi speaker.

    In the code, group MFA inputs for better parallelism. For multi speaker, it maybe go wrong. For input g_uang3 zh_ou1 n_v3 d_a4 x_ve2 sh_eng1 d_eng1 sh_an1 sh_i1 l_ian2 s_i4 t_ian1 j_ing3 f_ang1 zh_ao3 d_ao4 i2 s_i4 n_v3 sh_i1. The TexGrid is

    	item [1]:
    		class = "IntervalTier"
    		name = "words"
    		xmin = 0.0
    		xmax = 9.4444
    		intervals: size = 56
    			intervals [1]:
    				xmin = 0
    				xmax = 0.5700000000000001
    				text = ""
    			intervals [2]:
    				xmin = 0.5700000000000001
    				xmax = 0.61
    				text = "eng"
    			intervals [3]:
    				xmin = 0.61
    				xmax = 0.79
    				text = "s_an1"
    			intervals [4]:
    				xmin = 0.79
    				xmax = 0.89
    				text = "eng"
    			intervals [5]:
    				xmin = 0.89
    				xmax = 1.06
    				text = "i1"
    			intervals [6]:
    				xmin = 1.06
    				xmax = 1.24
    				text = "eng"
    			intervals [7]:
    				xmin = 1.24
    				xmax = 1.3
    				text = ""
    			intervals [8]:
    				xmin = 1.3
    				xmax = 1.36
    				text = "s_an1"
    			intervals [9]:
    				xmin = 1.36
    				xmax = 1.42
    				text = ""
    			intervals [10]:
    				xmin = 1.42
    				xmax = 1.49
    				text = "eng"
    			intervals [11]:
    				xmin = 1.49
    				xmax = 1.67
    				text = "s_i4"
    			intervals [12]:
    				xmin = 1.67
    				xmax = 1.78
    				text = "eng"
    			intervals [13]:
    				xmin = 1.78
    				xmax = 1.91
    				text = ""
    			intervals [14]:
    				xmin = 1.91
    				xmax = 1.96
    				text = "er4"
    			intervals [15]:
    				xmin = 1.96
    				xmax = 2.06
    				text = "eng"
    			intervals [16]:
    				xmin = 2.06
    				xmax = 2.19
    				text = ""
    			intervals [17]:
    				xmin = 2.19
    				xmax = 2.35
    				text = "i1"
    			intervals [18]:
    				xmin = 2.35
    				xmax = 2.53
    				text = "eng"
    			intervals [19]:
    				xmin = 2.53
    				xmax = 3.03
    				text = "i1"
    			intervals [20]:
    				xmin = 3.03
    				xmax = 3.42
    				text = "eng"
    			intervals [21]:
    				xmin = 3.42
    				xmax = 3.48
    				text = "i1"
    			intervals [22]:
    				xmin = 3.48
    				xmax = 3.6
    				text = ""
    			intervals [23]:
    				xmin = 3.6
    				xmax = 3.64
    				text = "eng"
    			intervals [24]:
    				xmin = 3.64
    				xmax = 3.86
    				text = "i1"
    			intervals [25]:
    				xmin = 3.86
    				xmax = 3.99
    				text = "eng"
    			intervals [26]:
    				xmin = 3.99
    				xmax = 4.59
    				text = ""
    			intervals [27]:
    				xmin = 4.59
    				xmax = 4.869999999999999
    				text = "er4"
    			intervals [28]:
    				xmin = 4.869999999999999
    				xmax = 4.9799999999999995
    				text = "eng"
    			intervals [29]:
    				xmin = 4.9799999999999995
    				xmax = 5.1899999999999995
    				text = "s_i4"
    			intervals [30]:
    				xmin = 5.1899999999999995
    				xmax = 5.34
    				text = ""
    			intervals [31]:
    				xmin = 5.34
    				xmax = 5.43
    				text = "eng"
    			intervals [32]:
    				xmin = 5.43
    				xmax = 5.6
    				text = ""
    			intervals [33]:
    				xmin = 5.6
    				xmax = 5.76
    				text = "i1"
    			intervals [34]:
    				xmin = 5.76
    				xmax = 6.279999999999999
    				text = "eng"
    			intervals [35]:
    				xmin = 6.279999999999999
    				xmax = 6.359999999999999
    				text = "s_an1"
    			intervals [36]:
    				xmin = 6.359999999999999
    				xmax = 6.47
    				text = ""
    			intervals [37]:
    				xmin = 6.47
    				xmax = 6.6
    				text = "eng"
    			intervals [38]:
    				xmin = 6.6
    				xmax = 6.9399999999999995
    				text = "i1"
    			intervals [39]:
    				xmin = 6.9399999999999995
    				xmax = 7.039999999999999
    				text = "eng"
    			intervals [40]:
    				xmin = 7.039999999999999
    				xmax = 7.289999999999999
    				text = "s_an1"
    			intervals [41]:
    				xmin = 7.289999999999999
    				xmax = 7.369999999999999
    				text = "eng"
    			intervals [42]:
    				xmin = 7.369999999999999
    				xmax = 7.6
    				text = "s_i4"
    			intervals [43]:
    				xmin = 7.6
    				xmax = 7.699999999999999
    				text = "eng"
    			intervals [44]:
    				xmin = 7.699999999999999
    				xmax = 7.869999999999999
    				text = ""
    			intervals [45]:
    				xmin = 7.869999999999999
    				xmax = 8.049999999999999
    				text = "er4"
    			intervals [46]:
    				xmin = 8.049999999999999
    				xmax = 8.26
    				text = ""
    			intervals [47]:
    				xmin = 8.26
    				xmax = 8.299999999999999
    				text = "eng"
    			intervals [48]:
    				xmin = 8.299999999999999
    				xmax = 8.36
    				text = "s_i4"
    			intervals [49]:
    				xmin = 8.36
    				xmax = 8.389999999999999
    				text = ""
    			intervals [50]:
    				xmin = 8.389999999999999
    				xmax = 8.42
    				text = "eng"
    			intervals [51]:
    				xmin = 8.42
    				xmax = 8.45
    				text = ""
    			intervals [52]:
    				xmin = 8.45
    				xmax = 8.59
    				text = "s_an1"
    			intervals [53]:
    				xmin = 8.59
    				xmax = 8.83
    				text = ""
    			intervals [54]:
    				xmin = 8.83
    				xmax = 9.1
    				text = "eng"
    			intervals [55]:
    				xmin = 9.1
    				xmax = 9.44
    				text = "i1"
    			intervals [56]:
    				xmin = 9.44
    				xmax = 9.4444
    				text = ""
    
    opened by leon2milan 2
  • Problem with DDP

    Problem with DDP

    Hello, I have experimented on your excellent job with this repo. But I found the ddp is not effective. I wonder if the way I used is wrong?

    CUDA_VISIBLE_DEVICES=0,1,2 python -m torch.distributed.launch --nproc_per_node 3 tasks/run.py --config //fs.yaml --exp_name fs_test_demo --reset

    opened by zhazl 0
Releases(v1.0.0)
Owner
Zhenhui YE
I am currently a second-year computer science Ph.D student at Zhejiang University, working on deep learning and reinforcement learning.
Zhenhui YE
Neural networks applied in recognizing guitar chords using python, AutoML.NET with C# and .NET Core

Chord Recognition Demo application The demo application is written in C# with .NETCore. As of July 9, 2020, the only version available is for windows

Andres Mauricio Rondon Patiño 24 Oct 22, 2022
An implementation of "Optimal Textures: Fast and Robust Texture Synthesis and Style Transfer through Optimal Transport"

Optex An implementation of Optimal Textures: Fast and Robust Texture Synthesis and Style Transfer through Optimal Transport for TU Delft CS4240. You c

Hans Brouwer 33 Jan 05, 2023
Simple SN-GAN to generate CryptoPunks

CryptoPunks GAN Simple SN-GAN to generate CryptoPunks. Neural network architecture and training code has been modified from the PyTorch DCGAN example.

Teddy Koker 66 Dec 15, 2022
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

63 Aug 11, 2022
🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series

🌾 PASTIS 🌾 Panoptic Agricultural Satellite TIme Series (optical and radar) The PASTIS Dataset Dataset presentation PASTIS is a benchmark dataset for

86 Jan 04, 2023
Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

Task-aware Joint CWS and POS (TCwsPos) This is the implementation of the final project of the course DDA6309 Probabilistic Graphical Models, The Chine

Peng 1 Dec 26, 2021
Implementations of LSTM: A Search Space Odyssey variants and their training results on the PTB dataset.

An LSTM Odyssey Code for training variants of "LSTM: A Search Space Odyssey" on Fomoro. Check out the blog post. Training Install TensorFlow. Clone th

Fomoro AI 95 Apr 13, 2022
Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP

Wav2CLIP 🚧 WIP 🚧 Official implementation of the paper WAV2CLIP: LEARNING ROBUST AUDIO REPRESENTATIONS FROM CLIP 📄 🔗 Ho-Hsiang Wu, Prem Seetharaman

Descript 240 Dec 13, 2022
ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

ENet in Caffe Execution times and hardware requirements Network 1024x512 1280x720 Parameters Model size (fp32) ENet 20.4 ms 32.9 ms 0.36 M 1.5 MB SegN

Timo Sämann 561 Jan 04, 2023
EqGAN - Improving GAN Equilibrium by Raising Spatial Awareness

EqGAN - Improving GAN Equilibrium by Raising Spatial Awareness Improving GAN Equilibrium by Raising Spatial Awareness Jianyuan Wang, Ceyuan Yang, Ying

GenForce: May Generative Force Be with You 149 Dec 19, 2022
Transformer in Vision

Transformer-in-Vision Recent Transformer-based CV and related works. Welcome to comment/contribute! Keep updated. Resource SCENIC: A JAX Library for C

Yong-Lu Li 1.1k Dec 30, 2022
HALO: A Skeleton-Driven Neural Occupancy Representation for Articulated Hands

HALO: A Skeleton-Driven Neural Occupancy Representation for Articulated Hands Oral Presentation, 3DV 2021 Korrawe Karunratanakul, Adrian Spurr, Zicong

Korrawe Karunratanakul 43 Oct 07, 2022
(Personalized) Page-Rank computation using PyTorch

torch-ppr This package allows calculating page-rank and personalized page-rank via power iteration with PyTorch, which also supports calculation on GP

Max Berrendorf 69 Dec 03, 2022
Styleformer - Official Pytorch Implementation

Styleformer -- Official PyTorch implementation Styleformer: Transformer based Generative Adversarial Networks with Style Vector(https://arxiv.org/abs/

Jeeseung Park 159 Dec 12, 2022
Yoloxkeypointsegment - An anchor-free version of YOLO, with a simpler design but better performance

Introduction 关键点版本:已完成 全景分割版本:已完成 实例分割版本:已完成 YOLOX is an anchor-free version of

23 Oct 20, 2022
Council-GAN - Implementation for our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020)

Council-GAN Implementation of our paper Breaking the Cycle - Colleagues are all you need (CVPR 2020) Paper Ori Nizan , Ayellet Tal, Breaking the Cycle

ori nizan 260 Nov 16, 2022
Live Hand Tracking Using Python

Live-Hand-Tracking-Using-Python Project Description: In this project, we will be

Hassan Shahzad 2 Jan 06, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Multipath RefineNet A MATLAB based framework for semantic image segmentation and general dense prediction tasks on images. This is the source code for

Guosheng Lin 575 Dec 06, 2022
CTRMs: Learning to Construct Cooperative Timed Roadmaps for Multi-agent Path Planning in Continuous Spaces

CTRMs: Learning to Construct Cooperative Timed Roadmaps for Multi-agent Path Planning in Continuous Spaces This is a repository for the following pape

17 Oct 13, 2022