TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese, German and Easy to adapt for other languages)

Overview

😋 TensorFlowTTS

Build GitHub Colab

Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

🤪 TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded systems.

What's new

Features

  • High performance on Speech Synthesis.
  • Be able to fine-tune on other languages.
  • Fast, Scalable, and Reliable.
  • Suitable for deployment.
  • Easy to implement a new model, based-on abstract class.
  • Mixed precision to speed-up training if possible.
  • Support Single/Multi GPU gradient Accumulate.
  • Support both Single/Multi GPU in base trainer class.
  • TFlite conversion for all supported models.
  • Android example.
  • Support many languages (currently, we support Chinese, Korean, English.)
  • Support C++ inference.
  • Support Convert weight for some models from PyTorch to TensorFlow to accelerate speed.

Requirements

This repository is tested on Ubuntu 18.04 with:

Different Tensorflow version should be working but not tested yet. This repo will try to work with the latest stable TensorFlow version. We recommend you install TensorFlow 2.3.0 to training in case you want to use MultiGPU.

Installation

With pip

$ pip install TensorFlowTTS

From source

Examples are included in the repository but are not shipped with the framework. Therefore, to run the latest version of examples, you need to install the source below.

$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

If you want to upgrade the repository and its dependencies:

$ git pull
$ pip install --upgrade .

Supported Model architectures

TensorFlowTTS currently provides the following architectures:

  1. MelGAN released with the paper MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis by Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, Aaron Courville.
  2. Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu.
  3. FastSpeech released with the paper FastSpeech: Fast, Robust, and Controllable Text to Speech by Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
  4. Multi-band MelGAN released with the paper Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech by Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie.
  5. FastSpeech2 released with the paper FastSpeech 2: Fast and High-Quality End-to-End Text to Speech by Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu.
  6. Parallel WaveGAN released with the paper Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram by Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim.
  7. HiFi-GAN released with the paper HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis by Jungil Kong, Jaehyeon Kim, Jaekyoung Bae.

We are also implementing some techniques to improve quality and convergence speed from the following papers:

  1. Guided Attention Loss released with the paper Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention by Hideyuki Tachibana, Katsuya Uenoyama, Shunsuke Aihara.

Audio Samples

Here in an audio samples on valid set. tacotron-2, fastspeech, melgan, melgan.stft, fastspeech2, multiband_melgan

Tutorial End-to-End

Prepare Dataset

Prepare a dataset in the following format:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

Where metadata.csv has the following format: id|transcription. This is a ljspeech-like format; you can ignore preprocessing steps if you have other format datasets.

Note that NAME_DATASET should be [ljspeech/kss/baker/libritts] for example.

Preprocessing

The preprocessing has two steps:

  1. Preprocess audio features
    • Convert characters to IDs
    • Compute mel spectrograms
    • Normalize mel spectrograms to [-1, 1] range
    • Split the dataset into train and validation
    • Compute the mean and standard deviation of multiple features from the training split
  2. Standardize mel spectrogram based on computed statistics

To reproduce the steps above:

tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten] --config preprocess/[ljspeech/kss/baker/libritts/thorsten]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten]

Right now we only support ljspeech, kss, baker, libritts and thorsten for dataset argument. In the future, we intend to support more datasets.

Note: To run libritts preprocessing, please first read the instruction in examples/fastspeech2_libritts. We need to reformat it first before run preprocessing.

After preprocessing, the structure of the project folder should be:

|- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...
  • stats.npy contains the mean and std from the training split mel spectrograms
  • stats_energy.npy contains the mean and std of energy values from the training split
  • stats_f0.npy contains the mean and std of F0 values in the training split
  • train_utt_ids.npy / valid_utt_ids.npy contains training and validation utterances IDs respectively

We use suffix (ids, raw-feats, raw-energy, raw-f0, norm-feats, and wave) for each input type.

IMPORTANT NOTES:

  • This preprocessing step is based on ESPnet so you can combine all models here with other models from ESPnet repository.
  • Regardless of how your dataset is formatted, the final structure of the dump folder SHOULD follow the above structure to be able to use the training script, or you can modify it by yourself 😄 .

Training models

To know how to train model from scratch or fine-tune with other datasets/languages, please see detail at example directory.

Abstract Class Explaination

Abstract DataLoader Tensorflow-based dataset

A detail implementation of abstract dataset class from tensorflow_tts/dataset/abstract_dataset. There are some functions you need overide and understand:

  1. get_args: This function return argumentation for generator class, normally is utt_ids.
  2. generator: This function have an inputs from get_args function and return a inputs for models. Note that we return a dictionary for all generator functions with the keys that exactly match with the model's parameters because base_trainer will use model(**batch) to do forward step.
  3. get_output_dtypes: This function need return dtypes for each element from generator function.
  4. get_len_dataset: Return len of datasets, normaly is len(utt_ids).

IMPORTANT NOTES:

  • A pipeline of creating dataset should be: cache -> shuffle -> map_fn -> get_batch -> prefetch.
  • If you do shuffle before cache, the dataset won't shuffle when it re-iterate over datasets.
  • You should apply map_fn to make each element return from generator function have the same length before getting batch and feed it into a model.

Some examples to use this abstract_dataset are tacotron_dataset.py, fastspeech_dataset.py, melgan_dataset.py, fastspeech2_dataset.py

Abstract Trainer Class

A detail implementation of base_trainer from tensorflow_tts/trainer/base_trainer.py. It include Seq2SeqBasedTrainer and GanBasedTrainer inherit from BasedTrainer. All trainer support both single/multi GPU. There a some functions you MUST overide when implement new_trainer:

  • compile: This function aim to define a models, and losses.
  • generate_and_save_intermediate_result: This function will save intermediate result such as: plot alignment, save audio generated, plot mel-spectrogram ...
  • compute_per_example_losses: This function will compute per_example_loss for model, note that all element of the loss MUST has shape [batch_size].

All models on this repo are trained based-on GanBasedTrainer (see train_melgan.py, train_melgan_stft.py, train_multiband_melgan.py) and Seq2SeqBasedTrainer (see train_tacotron2.py, train_fastspeech.py).

End-to-End Examples

You can know how to inference each model at notebooks or see a colab (for English), colab (for Korean). Here is an example code for end2end inference with fastspeech and melgan.

import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts.inference import AutoConfig
from tensorflow_tts.inference import TFAutoModel
from tensorflow_tts.inference import AutoProcessor

# initialize fastspeech model.
fs_config = AutoConfig.from_pretrained('./examples/fastspeech/conf/fastspeech.v1.yaml')
fastspeech = TFAutoModel.from_pretrained(
    config=fs_config,
    pretrained_path="./examples/fastspeech/pretrained/model-195000.h5"
)


# initialize melgan model
melgan_config = AutoConfig.from_pretrained('./examples/melgan/conf/melgan.v1.yaml')
melgan = TFAutoModel.from_pretrained(
    config=melgan_config,
    pretrained_path="./examples/melgan/checkpoint/generator-1500000.h5"
)


# inference
processor = AutoProcessor.from_pretrained(pretrained_path="./test/files/ljspeech_mapper.json")

ids = processor.text_to_sequence("Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning.")
ids = tf.expand_dims(ids, 0)
# fastspeech inference

masked_mel_before, masked_mel_after, duration_outputs = fastspeech.inference(
    ids,
    speaker_ids=tf.zeros(shape=[tf.shape(ids)[0]], dtype=tf.int32),
    speed_ratios=tf.constant([1.0], dtype=tf.float32)
)

# melgan inference
audio_before = melgan.inference(masked_mel_before)[0, :, 0]
audio_after = melgan.inference(masked_mel_after)[0, :, 0]

# save to file
sf.write('./audio_before.wav', audio_before, 22050, "PCM_16")
sf.write('./audio_after.wav', audio_after, 22050, "PCM_16")

Contact

Minh Nguyen Quan Anh: [email protected], erogol: [email protected], Kuan Chen: [email protected], Dawid Kobus: [email protected], Takuya Ebata: [email protected], Trinh Le Quang: [email protected], Yunchao He: [email protected], Alejandro Miguel Velasquez: [email protected]

License

Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world, except in Viet Nam this framework cannot be used for production in any way without permission from TensorFlowTTS's Authors. There is an exception, Tacotron-2 can be used with any purpose. If you are Vietnamese and want to use this framework for production, you Must contact us in advance.

Acknowledgement

We want to thank Tomoki Hayashi, who discussed with us much about Melgan, Multi-band melgan, Fastspeech, and Tacotron. This framework based-on his great open-source ParallelWaveGan project.

Comments
  • FastSpeech2 training with MFA and Phoneme-based

    FastSpeech2 training with MFA and Phoneme-based

    When training FastSpeech2 (fastspeech2_v2) with phonetic alignments extracted from MFA I get the error described:

    /content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py in run(self)
         65         )
         66         while True:
    ---> 67             self._train_epoch()
         68 
         69             if self.finish_train:
    
    /content/TensorflowTTS/tensorflow_tts/trainers/base_trainer.py in _train_epoch(self)
         87         for train_steps_per_epoch, batch in enumerate(self.train_data_loader, 1):
         88             # one step training
    ---> 89             self._train_step(batch)
         90 
         91             # check interval
    
    <ipython-input-39-dd452e77975e> in _train_step(self, batch)
         75         """Train model one step."""
         76         charactor, duration, f0, energy, mel = batch
    ---> 77         self._one_step_fastspeech2(charactor, duration, f0, energy, mel)
         78 
         79         # update counts
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
        578         xla_context.Exit()
        579     else:
    --> 580       result = self._call(*args, **kwds)
        581 
        582     if tracing_count == self._get_tracing_count():
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
        642         # Lifting succeeded, so variables are initialized and we can run the
        643         # stateless function.
    --> 644         return self._stateless_fn(*args, **kwds)
        645     else:
        646       canon_args, canon_kwds = \
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in __call__(self, *args, **kwargs)
       2418     with self._lock:
       2419       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
    -> 2420     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
       2421 
       2422   @property
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _filtered_call(self, args, kwargs)
       1663          if isinstance(t, (ops.Tensor,
       1664                            resource_variable_ops.BaseResourceVariable))),
    -> 1665         self.captured_inputs)
       1666 
       1667   def _call_flat(self, args, captured_inputs, cancellation_manager=None):
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
       1744       # No tape is watching; skip to running the function.
       1745       return self._build_call_outputs(self._inference_function.call(
    -> 1746           ctx, args, cancellation_manager=cancellation_manager))
       1747     forward_backward = self._select_forward_and_backward_functions(
       1748         args,
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in call(self, ctx, args, cancellation_manager)
        596               inputs=args,
        597               attrs=attrs,
    --> 598               ctx=ctx)
        599         else:
        600           outputs = execute.execute_with_cancellation(
    
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
         58     ctx.ensure_initialized()
         59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
    ---> 60                                         inputs, attrs, num_outputs)
         61   except core._NotOkStatusException as e:
         62     if name is not None:
    
    InvalidArgumentError:  Incompatible shapes: [16,823,80] vs. [16,867,80]
    	 [[node mean_absolute_error/sub (defined at <ipython-input-39-dd452e77975e>:115) ]] [Op:__inference__one_step_fastspeech2_341496]
    
    Errors may have originated from an input operation.
    Input Source operations connected to node mean_absolute_error/sub:
     mel (defined at <ipython-input-39-dd452e77975e>:77)	
     tf_fast_speech2_2/mel_before/BiasAdd (defined at /content/TensorflowTTS/tensorflow_tts/models/fastspeech2.py:196)
    
    Function call stack:
    _one_step_fastspeech2
    

    I did everything I could think of to rule out my durations as the problem including verification that length is the same, so I don't know what happened. Interestingly enough, when training with mixed_precision off the same error happens but with different values:

    InvalidArgumentError:  Incompatible shapes: [16,763,80] vs. [16,806,80]
    	 [[node mean_absolute_error/sub (defined at <ipython-input-39-dd452e77975e>:115) ]] [Op:__inference__one_step_fastspeech2_449871]
    
    Errors may have originated from an input operation.
    Input Source operations connected to node mean_absolute_error/sub:
     tf_fast_speech2_3/mel_before/BiasAdd (defined at /content/TensorflowTTS/tensorflow_tts/models/fastspeech2.py:196)	
     mel (defined at <ipython-input-39-dd452e77975e>:77)
    
    Function call stack:
    _one_step_fastspeech2
    

    Am I missing something?

    enhancement 🚀 question ❓ Feature Request 🤗 FastSpeech Discussion 😁 
    opened by ZDisket 169
  • Fine-Tuning with a small dataset

    Fine-Tuning with a small dataset

    Hello!

    I'm trying to evaluate ways to achieve TTS for individuals that have lost their ability to speak, the idea is to allow them to regain speech via TTS but using the voice they had prior to losing their voice. This could happen from various causes such as cancer of the larynx, motor neurone disease, etc.

    These patients have recorded voice banks, a small dataset of phrases recorded prior to losing their ability to speak.

    Conceptually, I wanted to take a pre-trained model and fine-tune it with the individual's voice bank data.

    I'd love some guidance.

    There are a few constraints:

    1. The patient-specific data bank is not a large dataset, it's approximately 100 recorded phrases.
    2. Latency must be low, we hope for real-time TTS. Some approaches use a pre-trained model followed by vocoders, in our experience, this has been too slow, with latencies of about 5 seconds.
    3. The trained model must work on an Android app (I see there is already an Android example, which has been helpful)

    I'd love your guidance on the steps required to achieve this, and any recommendations on which choices would give good results...

    • Which model architectures will tolerate tuning with a small dataset?
    • The patients have British accents, whereas most pre-trained models have American accents. Will this be a problem?

    Do you have any tutorials or examples that show how to achieve a customised voice via fine-tuning?

    question ❓ 
    opened by OscarVanL 127
  • Tacotron2: Everything become nan at 53k steps

    Tacotron2: Everything become nan at 53k steps

    Hi, I am not that experienced in TTS, so I've faced many problem before get the code running with my non-English dataset which has about 10k sentences (~26h long) . However, still some issues and questions.

    1. When training process reaches at 53.5k steps, the model seems lost "everything". The values of train, eval losses and model predictions became nan (but training continues without reporting exception).
    tensorboard1

    So I stopped training and resumed from 50k; I will wait until 53.5k and see if it happens again. By the way, do my figures look fine? looks like model is overfitting; should I wait for a "surprise"?

    1. My language is somehow under-resourced and there is no (at least I couldn't find one) phoneme dictionary to train a G2P and MFA model. However, unlike English, a character roughly represents a phone, except some vowels sound longer or shorter according to meaning of host word. So character-based model seems fine with me. This tacotron2 has been trained just for duration extraction.

      Which step seems best for duration extraction so far?

    2. How can I improve the quality of duration extraction? extract_duration.py extracts durations from model prediction but they are supposed to be used with ground-truth mels. Although, the sum of tactron2-extracted durations is forced to match the length of ground-truth mels by alignment = alignment[:real_char_length, :real_mel_length], this is just based on an assumption that predicted mels and their ground-truth counterparts are roughly one-to-one (from index 0).

      So, when the goal of training a tactron2 is to extract good duration only, is it a good idea to use whole dataset for training and make a severely over-fitted model (maybe up to 200k steps or more in my case)?

    3. Any idea on MFA model training for a language with no phone dictionary available? Has anyone tried making a fake phone dictionary like this to force MFA align character instead of phoneme. .... hello h e l l o nice n i c e ....

    Thanks.

    question ❓ performance 🏍 Tacotron Discussion 😁 wontfix 
    opened by tekinek 57
  • Error Preprocessing KeyError: 'eos'

    Error Preprocessing KeyError: 'eos'

    I am getting this error when trying to preprocess:

    Traceback (most recent call last): File "/home/zak/venv/bin/tensorflow-tts-preprocess", line 8, in sys.exit(preprocess()) File "/home/zak/venv/lib/python3.8/site-packages/tensorflow_tts/bin/preprocess.py", line 442, in preprocess for result, mel, energy, f0, features in train_map: File "/usr/lib/python3.8/multiprocessing/pool.py", line 448, in return (item for chunk in result for item in chunk) File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next raise value KeyError: 'eos'

    I didn't have this error before the latest updates, after I re installed the TensorflowTTs again and tried to preprocess I got this. any Ideas ?

    Thanks

    bug 🐛 
    opened by Zak-SA 52
  • 🇨🇳 Chinese TTS now available 😘

    🇨🇳 Chinese TTS now available 😘

    Chinese TTS now available, thank @azraelkuan for his support :D. The model used Baker dataset here (https://www.data-baker.com/open_source.htmlt). The pretrained model licensed under CC BY-NC-SA 4.0 (https://creativecommons.org/licenses/by-nc-sa/4.0/) since the dataset is non-commercial :D

    Pls check out the colab bellow and enjoy :D.

    https://colab.research.google.com/drive/1YpSHRBRPBI7cnTkQn1UcVTWEQVbsUm1S?usp=sharing

    Note: this is just init results, there are more things can be done to make the model better.

    cc: @candlewill @l4zyf9x @machineko

    enhancement 🚀 good first issue 🤔 Feature Request 🤗 wontfix 
    opened by dathudeptrai 46
  • RuntimeError when trying to inference From TFlite for Fastspeech2

    RuntimeError when trying to inference From TFlite for Fastspeech2

    Hi, So I converted Fastspeech2 model to TFlite, when I tried to inference from TFlite I am getting this error

    decoder_output_tflite, mel_output_tflite = infer(input_text) interpreter.invoke() File "/home/zak/venv/lib/python3.8/site-packages/tensorflow/lite/python/interpreter.py", line 539, in invoke self._interpreter.Invoke() RuntimeError: tensorflow/lite/kernels/reshape.cc:55 stretch_dim != -1 (0 != -1)Node number 83 (RESHAPE) failed to prepare.

    the code I used for this purpose is

    import numpy as np import yaml import tensorflow as tf

    from tensorflow_tts.processor import ZAKSpeechProcessor from tensorflow_tts.processor.ZAKspeech import ZAKSPEECH_SYMBOLS

    from tensorflow_tts.configs import FastSpeechConfig, FastSpeech2Config from tensorflow_tts.configs import MultiBandMelGANGeneratorConfig

    from tensorflow_tts.models import TFFastSpeech, TFFastSpeech2 from tensorflow_tts.models import TFMBMelGANGenerator

    from IPython.display import Audio

    Load the TFLite model and allocate tensors.

    interpreter = tf.lite.Interpreter(model_path='fastspeech2_quant.tflite')

    Get input and output tensors.

    input_details = interpreter.get_input_details() output_details = interpreter.get_output_details()

    Prepare input data.

    def prepare_input(input_ids): input_ids = tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0) return (input_ids, tf.convert_to_tensor([0], tf.int32), tf.convert_to_tensor([1.0], dtype=tf.float32), tf.convert_to_tensor([1.0], dtype=tf.float32), tf.convert_to_tensor([1.0], dtype=tf.float32))

    Test the model on random input data.

    def infer(input_text): for x in input_details: print(x) for x in output_details: print(x) processor = ZAKSpeechProcessor(data_dir=None, symbols=ZAKSPEECH_SYMBOLS, cleaner_names="arabic_cleaners") input_ids = processor.text_to_sequence(input_text.lower()) interpreter.resize_tensor_input(input_details[0]['index'], [1, len(input_ids)]) interpreter.resize_tensor_input(input_details[1]['index'], [1]) interpreter.resize_tensor_input(input_details[2]['index'], [1]) interpreter.resize_tensor_input(input_details[3]['index'], [1]) interpreter.resize_tensor_input(input_details[4]['index'], [1]) interpreter.allocate_tensors() input_data = prepare_input(input_ids) for i, detail in enumerate(input_details): input_shape = detail['shape'] interpreter.set_tensor(detail['index'], input_data[i])

    interpreter.invoke()

    The function get_tensor() returns a copy of the tensor data.

    Use tensor() in order to get a pointer to the tensor.

    return (interpreter.get_tensor(output_details[0]['index']), interpreter.get_tensor(output_details[1]['index']))

    initialize melgan model

    with open('../examples/multiband_melgan/conf/multiband_melgan.v1.yaml') as f: mb_melgan_config = yaml.load(f, Loader=yaml.Loader) mb_melgan_config = MultiBandMelGANGeneratorConfig(**mb_melgan_config["multiband_melgan_generator_params"]) mb_melgan = TFMBMelGANGenerator(config=mb_melgan_config, name='mb_melgan_generator') mb_melgan._build() mb_melgan.load_weights("../examples/multiband_melgan/exp/train.multiband_melgan.v1/checkpoints/generator-1000000.h5")

    input_text = ""

    decoder_output_tflite, mel_output_tflite = infer(input_text) audio_before_tflite = mb_melgan(decoder_output_tflite)[0, :, 0] audio_after_tflite = mb_melgan(mel_output_tflite)[0, :, 0]

    appreciate your help

    bug 🐛 wontfix 
    opened by Zak-SA 44
  • fastspeech2 training error

    fastspeech2 training error

    i have already created durations with MFA, and also ran well two preprocess script(tensorflow-tts-preprocess, tensorflow-tts-normalize) with no error. but when i ran the train script, there is an error occurred as follows: 2020-08-12 02:19:06,034 (train_fastspeech2:289) INFO: batch_size = 16 2020-08-12 02:19:06,034 (train_fastspeech2:289) INFO: remove_short_samples = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: allow_cache = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: mel_length_threshold = 32 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: is_shuffle = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: optimizer_params = {'initial_learning_rate': 0.001, 'end_learning_rate': 5e-05, 'decay_steps': 150000, 'warmup_proportion': 0.02, 'weight_decay': 0.001} 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: train_max_steps = 200000 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: save_interval_steps = 5000 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: eval_interval_steps = 500 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: log_interval_steps = 200 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: num_save_intermediate_results = 1 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: train_dir = ./dump/train/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: dev_dir = ./dump/valid/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: use_norm = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: f0_stat = ./dump/stats_f0.npy 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: energy_stat = ./dump/stats_energy.npy 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: outdir = ./examples/fastspeech2/exp/train.fastspeech2.v1/ 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: config = ./examples/fastspeech2/conf/fastspeech2.v1.yaml 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: resume = 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: verbose = 1 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: mixed_precision = True 2020-08-12 02:19:06,035 (train_fastspeech2:289) INFO: version = 0.6.1 Traceback (most recent call last): File "examples/fastspeech2/train_fastspeech2.py", line 400, in main() File "examples/fastspeech2/train_fastspeech2.py", line 316, in main mel_length_threshold=mel_length_threshold, File "/home/speechlab/TensorflowTTS/examples/fastspeech2/fastspeech2_dataset.py", line 104, in init ), f"Number of charactor, mel, duration, f0 and energy files are different" AssertionError: Number of charactor, mel, duration, f0 and energy files are different how do i solve this problem? can anybody help me ? thank a lot!

    bug 🐛 question ❓ 
    opened by mataym 44
  • Tacotron2 produces random mel outputs during inference (french dataset)

    Tacotron2 produces random mel outputs during inference (french dataset)

    Hi ! I have trained tacotron2 for 52k steps on the SynPaFlex french dataset. I deleted sentences longer than 20 seconds from the dataset and ended up with around 30 hours of single speaker data.

    I made a custom synpaflex.py processor in ./tensorflow_tts/processor/ with these symbols (adapted to french without arpabet) :

    _pad = "pad"
    _eos = "eos"
    _punctuation = "!/\'(),-.:;? "
    _letters = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzéèàùâêîôûçäëïöüÿœæ"
    
    # Export all symbols:
    SYNPAFLEX_SYMBOLS = (
        [_pad] + list(_punctuation) + list(_letters) + [_eos]
    )
    

    I used basic_cleaners for text cleaning.

    in #182 the issue was similar, but the problem came from using tacotron2.v1.yaml as configuration file. I am using my own tacotron2.synpaflex.v1.yaml for both training and inference.

    During synthesis, mel outputs are completely random : the output is different even if the sentence is kept the exact same. The audio signals sound like a french version of the WaveNet examples where no text has been provided during training, in the "Knowing What to Say" section of this page.

    Here are my tensorboard results : image

    I must be doing something wrong somehow as I have been able to train on LJSpeech successfuly... Any idea ?

    bug 🐛 
    opened by samuel-lunii 41
  • Long sentences issue with FS2

    Long sentences issue with FS2

    seem my fastspeech2 implementation can't handle long sentence in some dataset such as KSS. FOr Ljspeech and other dataset from other person report that it's still fine. I'm thinking about the maximum length in the training set that my FS2 need to be able to handle long sentences. In my private dataset, it always fine. Maybe 15s is enough.

    question ❓ Discussion 😁 wontfix 
    opened by dathudeptrai 41
  • Pretrained fastspeech2 libritts model for testing?

    Pretrained fastspeech2 libritts model for testing?

    Hi,

    Thanks for the nice work. Is there a pretrained fastspeech2 libritts model for testing? Like the one trained with ljspeech data?https://colab.research.google.com/drive/1akxtrLZHKuMiQup00tzO2olCaN-y3KiD?usp=sharing

    question ❓ wontfix 
    opened by ronggong 38
  • Add C++ inference example and code

    Add C++ inference example and code

    This is complete C++ code (from text processing to saving audio) for inference with TensorflowTTS/FastSpeech2 (phonetic MFA-aligned from my fork) and Multi-Band MelGAN using the Tensorflow C API. Can compile and run for Windows 64-bit out of the box(solution and project), but the code is cross-platform assuming one provides the required libraries. The project builds a simple command line program where one inputs sentences and they are generated and saved as WAVs.

    There's a link for compiled binaries, libraries, and a sample model required to compile for Win64 in the README.

    It will allow deploying TensorflowTTS models in a portable way into desktop environments.

    enhancement 🚀 Feature Request 🤗 
    opened by ZDisket 38
  • Tacotron2 Pre-training have difficulties

    Tacotron2 Pre-training have difficulties

    Hello, I am a student who is learning with the Tacotron2 Kss dataset.

    If you proceed with Tacotron2 Kss pre-training 120k and check the results through the tensor board, the following result values are given.

    The loss percentage in the "val" section tends to be higher and higher.

    If you pull it out as a wav file, the sound quality is indistinguishable.

    I'd like to ask for your advice on this matter. Screenshot from 2022-12-20 16-09-30

    opened by Gyuub 0
  • Support Arabic Language

    Support Arabic Language

    Are you open to support Arabic language? you can use Dr.Nawar Halabi dataset :https://www.kaggle.com/datasets/bc297d8ca0753cd21cdcacd7bd324c0c607361a14471c801f09b028a1ecb098e

    opened by Muhammad-Abdelsattar 1
  • Fastspeech 2 Training error

    Fastspeech 2 Training error

    When training a fastspeech model with: python "examples\fastspeech2\train_fastspeech2.py" and valid arguments, the script run but then I get an error: AssertionError: Number of charactor, mel, duration, f0 and energy files are different

    Ive looked at other issues but none of them have solved my problem. The dataset is on a different hardrive, but it doesnt give any "File not found" errors. Any ways to fix this? I preprocced and normalized with "ljspeech" as the config and dataset. and im training with "Fastspeech2.v1" as the config.

    opened by LxtteDev 6
  • not working with large text.

    not working with large text.

    i want to use large text with fastspeech, but as i understand in need to change the configs. i not able to find the exact place to make change in order to make in work. i try to change some parameter in configs files, but it not working for me. which file and which parameter exactly i need to change?

    mels, audios = do_synthesis(input_text, fastspeech, mb_melgan, "FASTSPEECH", "MB-MELGAN")

    but i got error:

    InvalidArgumentError: indices[0,2048] = 2049 is not in [0, 2049) [[node decoder/position_embeddings/Gather (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_tts/models/fastspeech.py:76) ]] [Op:__inference__inference_63215]

    Errors may have originated from an input operation. Input Source operations connected to node decoder/position_embeddings/Gather: In[0] decoder/position_embeddings/Gather/resource: In[1] mul_1 (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_tts/models/fastspeech.py:872)

    wontfix 
    opened by avraamya 1
Releases(v1.8)
  • v1.8(Aug 21, 2021)

  • v1.6.1(Jun 1, 2021)

  • v1.6(Jun 1, 2021)

    Release Notes

    • Support TFlite C++ inference. (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/cpptflite)
    • Add an example for FastSpeech2 and MB-Melgan on IOS. (https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/ios)
    • Integrated with Huggingface Hub. (PR #555 #564 #566). Our pretrained models uploaded in https://huggingface.co/tensorspeech
    • Fix convergence problem with hifigan caused by large learning rate (#571)
    Source code(tar.gz)
    Source code(zip)
  • v1.1(Jan 12, 2021)

  • v0.11(Nov 25, 2020)

  • v0.9(Oct 4, 2020)

    Release Notes

    • Supported both TensorFlow 2.2/2.
    • Faster Tacotron-2 training.
    • Stable training fastspeech/fastspeech2/tacotron2/mb-melgan.
    • Supported Eng/Chinese/Korean.
    • Supported ParallelWaveGAN.
    • Added C++ inference code.
    Source code(tar.gz)
    Source code(zip)
  • v0.8(Aug 23, 2020)

  • v0.7(Jul 11, 2020)

    Release Notes

    • First release of TensorflowTTS.
    • Built against TensorFlow 2.2

    Changelog

    • Apply black formatter.
    • Use pytest as default test runner.

    TensorflowTTS Core

    tensorflow_tts.bin

    • Multi-preprocess to calculate mel-spectrogram, f0, energy
    • Add code to calculate mean/std of mel-spectrogram, f0, energy
    • Add code to normalize mel-spectrogram, f0, energy based on its mean/std value

    tensorflow_tts.config

    • Add configuration for FastSpeech
    • Add configuration for FastSpeech2
    • Add configuration for Tacotron-2
    • Add configuration for MelGAN
    • Add configuration for Multiband-MelGAN

    tensorflow_tts.datasets

    • Add dataset abstract based on tf.data
    • Add dataloder for mel-spectrogram
    • Add dataloder for audio

    tensorflow_tts.losses

    • Add MultiScale STFT Loss
    • Add Mel-spectrogram Loss

    tensorflow_tts.models

    • Add FastSpeech modeling
    • Add FastSpeech2 modeling
    • Add Melgan modeling
    • Add Multiband-melgan modeling
    • Add Tacotorn-2 modeling

    tensorflow_tts.optimizers

    • Add adam-weightdecay optimizers

    tensorflow_tts.processor

    • Add Ljspeech processor for english charactor-based.

    tensorflow_tts.trainers

    • Add base trainer including GanBasedTrainer and Seq2SeqTrainer

    tensorflow_tts.utils

    • Add seq2seq dynamic decoder
    • Add cleaner for english text
    • Add group convolution for melgan
    • Add batch Griffin-Lim version based on librosa and Tensorflow
    • Add number normalization
    • Add function to detect outlier from 1D array
    • Add weight-norm layer

    NoteBooks

    • Add notebook for GL inference
    • Add notebook for convert FastSpeech/FastSpeech2/Melgan/Mb-melgan/Tacotron-2 to pb and inference
    • Add notebook for convert FastSpeech/FastSpeech2/Tacotron-2 to tflite and inference

    Examples

    • Add example to training fastspeech
    • Add example to training fastspeech2
    • Add example to training tacotron-2
    • Add example to training melgan
    • Add example to training melgan.stft
    • Add example to training multiband melgan

    Thanks to our Contributors

    @erogol @azraelkuan @l4zyf9x @myagues @sujeendran @MokkeMeguru @jaeyoo @dathudeptrai

    Source code(tar.gz)
    Source code(zip)
pytorch implementation of "Distilling a Neural Network Into a Soft Decision Tree"

Soft-Decision-Tree Soft-Decision-Tree is the pytorch implementation of Distilling a Neural Network Into a Soft Decision Tree, paper recently published

Kim Heecheol 262 Dec 04, 2022
Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Summary Explorer is a tool to visually explore the state-of-the-art in text summarization.

Webis 42 Aug 14, 2022
Code for visualizing the loss landscape of neural nets

Visualizing the Loss Landscape of Neural Nets This repository contains the PyTorch code for the paper Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer

Tom Goldstein 2.2k Dec 30, 2022
A collection of research papers and software related to explainability in graph machine learning.

A collection of research papers and software related to explainability in graph machine learning.

AstraZeneca 1.9k Dec 26, 2022
Model analysis tools for TensorFlow

TensorFlow Model Analysis TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models. It allows users to evaluate their models on

1.2k Dec 26, 2022
Making decision trees competitive with neural networks on CIFAR10, CIFAR100, TinyImagenet200, Imagenet

Neural-Backed Decision Trees · Site · Paper · Blog · Video Alvin Wan, *Lisa Dunlap, *Daniel Ho, Jihan Yin, Scott Lee, Henry Jin, Suzanne Petryk, Sarah

Alvin Wan 556 Dec 20, 2022
Convolutional neural network visualization techniques implemented in PyTorch.

This repository contains a number of convolutional neural network visualization techniques implemented in PyTorch.

1 Nov 06, 2021
Visualizer for neural network, deep learning, and machine learning models

Netron is a viewer for neural network, deep learning and machine learning models. Netron supports ONNX, TensorFlow Lite, Keras, Caffe, Darknet, ncnn,

Lutz Roeder 20.9k Dec 28, 2022
Contrastive Explanation (Foil Trees), developed at TNO/Utrecht University

Contrastive Explanation (Foil Trees) Contrastive and counterfactual explanations for machine learning (ML) Marcel Robeer (2018-2020), TNO/Utrecht Univ

M.J. Robeer 41 Aug 29, 2022
Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Comprehensive collection of Pixel Attribution methods for Computer Vision.

Jacob Gildenblat 6.5k Jan 01, 2023
FairML - is a python toolbox auditing the machine learning models for bias.

======== FairML: Auditing Black-Box Predictive Models FairML is a python toolbox auditing the machine learning models for bias. Description Predictive

Julius Adebayo 338 Nov 09, 2022
An Empirical Review of Optimization Techniques for Quantum Variational Circuits

QVC Optimizer Review Code for the paper "An Empirical Review of Optimization Techniques for Quantum Variational Circuits". Each of the python files ca

Owen Lockwood 5 Jun 28, 2022
JittorVis - Visual understanding of deep learning model.

JittorVis - Visual understanding of deep learning model.

182 Jan 06, 2023
PyTorch implementation of DeepDream algorithm

neural-dream This is a PyTorch implementation of DeepDream. The code is based on neural-style-pt. Here we DeepDream a photograph of the Golden Gate Br

121 Nov 05, 2022
Interpretability and explainability of data and machine learning models

AI Explainability 360 (v0.2.1) The AI Explainability 360 toolkit is an open-source library that supports interpretability and explainability of datase

1.2k Dec 29, 2022
TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, Korean, Chinese, German and Easy to adapt for other languages)

🤪 TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we c

3k Jan 04, 2023
Portal is the fastest way to load and visualize your deep neural networks on images and videos 🔮

Portal is the fastest way to load and visualize your deep neural networks on images and videos 🔮

Datature 243 Jan 05, 2023
Neural network visualization toolkit for tf.keras

Neural network visualization toolkit for tf.keras

Yasuhiro Kubota 262 Dec 19, 2022
Pytorch Feature Map Extractor

MapExtrackt Convolutional Neural Networks Are Beautiful We all take our eyes for granted, we glance at an object for an instant and our brains can ide

Lewis Morris 40 Dec 07, 2022
L2X - Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation.

L2X Code for replicating the experiments in the paper Learning to Explain: An Information-Theoretic Perspective on Model Interpretation at ICML 2018,

Jianbo Chen 113 Sep 06, 2022