Repository for the paper: VoiceMe: Personalized voice generation in TTS

Overview

🗣 VoiceMe: Personalized voice generation in TTS

arXiv

Abstract

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

Demos

  • 📢 Demo website
  • 🔇 Unmute to listen to the videos on Github:
Examples-for-art-works.mp4
Example-chain.mp4

Preprocessing

Setup the repository

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

preprocessing_env="$main_dir/preprocessing-env"
conda create --prefix $preprocessing_env python=3.7
conda activate $preprocessing_env
pip install Cython
pip install git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]
pip install requests

Create face styles

We used the same sentence ("Kids are talking by the door", neutral recording) from the RAVDESS corpus from all 24 speakers. You can download all videos by running download_RAVDESS.sh. However, the stills used in the paper are also part of the repository (stills). We can create the AI Gahaku styles by running python ai_gahaku.py and the toonified version by running python toonify.py (you need to add your API key).

Obtain the PCA space

The model used in the paper was trained on SpeakerNet embeddings, so we to extract the embeddings from a dataset. Here we use the commonvoice data. To download it, run: python preprocess_commonvoice.py --language en

To extract the principal components, run compute_pca.py.

Synthesis

Setup

We'll assume, you'll setup a remote instance for synthesis. Clone the repo and setup the virtual environment:

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

synthesis_env="$main_dir/synthesis-env"
conda create --prefix $synthesis_env python=3.7
conda activate $synthesis_env

##############
# Setup Wav2Lip
##############
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

# Install Requirements
pip install -r requirements.txt
pip install opencv-python-headless==4.1.2.30
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"  --no-check-certificate

# Install as package
mv ../setup_wav2lip.py setup.py
pip install -e .
cd ..


##############
# Setup VITS
##############
git clone https://github.com/jaywalnut310/vits
cd vits

# Install Requirements
pip install -r requirements.txt

# Install monotonic_align
mv monotonic_align ../monotonic_align

# Download the VCTK checkpoint
pip install gdown
gdown https://drive.google.com/uc?id=11aHOlhnxzjpdWDpsz1vFDCzbeEfoIxru

# Install as package
mv ../setup_vits.py setup.py
pip install -e .

cd ../monotonic_align
python setup.py build_ext --inplace
cd ..


pip install flask
pip install wget

You'll need to do the last step manually (let me know if you know an automatic way). Download the checkpoint wav2lip_gan.pth from here and put it in Wav2Lip/checkpoints. Make sure you have espeak installed and it is in PATH.

Running

Start the remote service (I used port 31337)

python server.py --port 31337

You can send an example request locally, by running (don't forget to change host and port accordingly):

python request_demo.py

We also made a small 'playground' so you can see how slider values will influence the voice. Start the local flask app called client.py.

Experiment

The GSP experiment cannot be shared at this moment, as PsyNet is still under development.

Owner
Pol van Rijn
PhD student at Max Planck Institute for Empirical Aesthetics
Pol van Rijn
Fake Shakespearean Text Generator

Fake Shakespearean Text Generator This project contains an impelementation of stateful Char-RNN model to generate fake shakespearean texts. Files and

Recep YILDIRIM 1 Feb 15, 2022
This simple Python program calculates a love score based on your and your crush's full names in English

This simple Python program calculates a love score based on your and your crush's full names in English. There is no logic or reason in the calculation behind the love score. The calculation could ha

p.katekomol 1 Jan 24, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
This is a GUI program that will generate a word search puzzle image

Word Search Puzzle Generator Table of Contents About The Project Built With Getting Started Prerequisites Installation Usage Roadmap Contributing Cont

11 Feb 22, 2022
Mirco Ravanelli 2.3k Dec 27, 2022
Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

UlionTse 907 Dec 27, 2022
Framework for fine-tuning pretrained transformers for Named-Entity Recognition (NER) tasks

NERDA Not only is NERDA a mesmerizing muppet-like character. NERDA is also a python package, that offers a slick easy-to-use interface for fine-tuning

Ekstra Bladet 141 Dec 30, 2022
MEDIALpy: MEDIcal Abbreviations Lookup in Python

A small python package that allows the user to look up common medical abbreviations.

Aberystwyth Systems Biology 7 Nov 09, 2022
Japanese Long-Unit-Word Tokenizer with RemBertTokenizerFast of Transformers

Japanese-LUW-Tokenizer Japanese Long-Unit-Word (国語研長単位) Tokenizer for Transformers based on 青空文庫 Basic Usage from transformers import RemBertToken

Koichi Yasuoka 3 Dec 22, 2021
Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

OpenSpeech provides reference implementations of various ASR modeling papers and three languages recipe to perform tasks on automatic speech recogniti

Soohwan Kim 26 Dec 14, 2022
Contains links to publicly available datasets for modeling health outcomes using speech and language.

speech-nlp-datasets Contains links to publicly available datasets for modeling various health outcomes using speech and language. Speech-based Corpora

Tuka Alhanai 77 Dec 07, 2022
Get list of common stop words in various languages in Python

Python Stop Words Table of contents Overview Available languages Installation Basic usage Python compatibility Overview Get list of common stop words

Alireza Savand 142 Dec 21, 2022
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Research code for ECCV 2020 paper "UNITER: UNiversal Image-TExt Representation Learning"

UNITER: UNiversal Image-TExt Representation Learning This is the official repository of UNITER (ECCV 2020). This repository currently supports finetun

Yen-Chun Chen 680 Dec 24, 2022
AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

Shunsuke KITADA 12 Feb 16, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
中文問句產生器;使用台達電閱讀理解資料集(DRCD)

Transformer QG on DRCD The inputs of the model refers to we integrate C and A into a new C' in the following form. C' = [c1, c2, ..., [HL], a1, ..., a

Philip 1 Oct 22, 2021
Chatbot with Pytorch, Python & Nextjs

Installation Instructions Make sure that you have Python 3, gcc, venv, and pip installed. Clone the repository $ git clone https://github.com/sahr

Rohit Sah 0 Dec 11, 2022