Repository for the paper: VoiceMe: Personalized voice generation in TTS

Overview

🗣 VoiceMe: Personalized voice generation in TTS

arXiv

Abstract

Novel text-to-speech systems can generate entirely new voices that were not seen during training. However, it remains a difficult task to efficiently create personalized voices from a high dimensional speaker space. In this work, we use speaker embeddings from a state-of-the-art speaker verification model (SpeakerNet) trained on thousands of speakers to condition a TTS model. We employ a human sampling paradigm to explore this speaker latent space. We show that users can create voices that fit well to photos of faces, art portraits, and cartoons. We recruit online participants to collectively manipulate the voice of a speaking face. We show that (1) a separate group of human raters confirms that the created voices match the faces, (2) speaker gender apparent from the face is well-recovered in the voice, and (3) people are consistently moving towards the real voice prototype for the given face. Our results demonstrate that this technology can be applied in a wide number of applications including character voice development in audiobooks and games, personalized speech assistants, and individual voices for people with speech impairment.

Demos

  • 📢 Demo website
  • 🔇 Unmute to listen to the videos on Github:
Examples-for-art-works.mp4
Example-chain.mp4

Preprocessing

Setup the repository

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

preprocessing_env="$main_dir/preprocessing-env"
conda create --prefix $preprocessing_env python=3.7
conda activate $preprocessing_env
pip install Cython
pip install git+https://github.com/NVIDIA/[email protected]#egg=nemo_toolkit[all]
pip install requests

Create face styles

We used the same sentence ("Kids are talking by the door", neutral recording) from the RAVDESS corpus from all 24 speakers. You can download all videos by running download_RAVDESS.sh. However, the stills used in the paper are also part of the repository (stills). We can create the AI Gahaku styles by running python ai_gahaku.py and the toonified version by running python toonify.py (you need to add your API key).

Obtain the PCA space

The model used in the paper was trained on SpeakerNet embeddings, so we to extract the embeddings from a dataset. Here we use the commonvoice data. To download it, run: python preprocess_commonvoice.py --language en

To extract the principal components, run compute_pca.py.

Synthesis

Setup

We'll assume, you'll setup a remote instance for synthesis. Clone the repo and setup the virtual environment:

git clone https://github.com/polvanrijn/VoiceMe.git
cd VoiceMe
main_dir=$PWD

synthesis_env="$main_dir/synthesis-env"
conda create --prefix $synthesis_env python=3.7
conda activate $synthesis_env

##############
# Setup Wav2Lip
##############
git clone https://github.com/Rudrabha/Wav2Lip.git
cd Wav2Lip

# Install Requirements
pip install -r requirements.txt
pip install opencv-python-headless==4.1.2.30
wget "https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth" -O "face_detection/detection/sfd/s3fd.pth"  --no-check-certificate

# Install as package
mv ../setup_wav2lip.py setup.py
pip install -e .
cd ..


##############
# Setup VITS
##############
git clone https://github.com/jaywalnut310/vits
cd vits

# Install Requirements
pip install -r requirements.txt

# Install monotonic_align
mv monotonic_align ../monotonic_align

# Download the VCTK checkpoint
pip install gdown
gdown https://drive.google.com/uc?id=11aHOlhnxzjpdWDpsz1vFDCzbeEfoIxru

# Install as package
mv ../setup_vits.py setup.py
pip install -e .

cd ../monotonic_align
python setup.py build_ext --inplace
cd ..


pip install flask
pip install wget

You'll need to do the last step manually (let me know if you know an automatic way). Download the checkpoint wav2lip_gan.pth from here and put it in Wav2Lip/checkpoints. Make sure you have espeak installed and it is in PATH.

Running

Start the remote service (I used port 31337)

python server.py --port 31337

You can send an example request locally, by running (don't forget to change host and port accordingly):

python request_demo.py

We also made a small 'playground' so you can see how slider values will influence the voice. Start the local flask app called client.py.

Experiment

The GSP experiment cannot be shared at this moment, as PsyNet is still under development.

Owner
Pol van Rijn
PhD student at Max Planck Institute for Empirical Aesthetics
Pol van Rijn
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
小布助手对话短文本语义匹配的一个baseline

oppo-text-match 小布助手对话短文本语义匹配的一个baseline 模型 参考:https://kexue.fm/archives/8213 base版本线下大概0.952,线上0.866(单模型,没做K-flod融合)。 训练 测试环境:tensorflow 1.15 + keras

苏剑林(Jianlin Su) 132 Dec 14, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Package for controllable summarization

summarizers summarizers is package for controllable summarization based CTRLsum. currently, we only supports English. It doesn't work in other languag

Hyunwoong Ko 72 Dec 07, 2022
To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela 0 Feb 08, 2022
Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields

Geometry-Consistent Neural Shape Representation with Implicit Displacement Fields [project page][paper][cite] Geometry-Consistent Neural Shape Represe

Yifan Wang 100 Dec 19, 2022
A fast, efficient universal vector embedding utility package.

Magnitude: a fast, simple vector embedding utility library A feature-packed Python package and vector storage file format for utilizing vector embeddi

Plasticity 1.5k Jan 02, 2023
Spacy-ginza-ner-webapi - Named Entity Recognition API with spaCy and GiNZA

Named Entity Recognition API with spaCy and GiNZA I wrote a blog post about this

Yuki Okuda 3 Feb 27, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
Anomaly Detection 이상치 탐지 전처리 모듈

Anomaly Detection 시계열 데이터에 대한 이상치 탐지 1. Kernel Density Estimation을 활용한 이상치 탐지 train_data_path와 test_data_path에 존재하는 시점 정보를 포함하고 있는 csv 형태의 train data와

CLUST-consortium 43 Nov 28, 2022
FB ID CLONER WUTHOT CHECKPOINT, FACEBOOK ID CLONE FROM FILE

* MY SOCIAL MEDIA : Programming And Memes Want to contact Mr. Error ? CONTACT : [ema

Mr. Error 9 Jun 17, 2021
T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets

T‘rex Park is a Youzan sponsored project. Offering Chinese NLP and image models pretrained from E-commerce datasets (product titles, images, comments, etc.).

55 Nov 22, 2022
Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources Description This is the repository for the paper Unifying Cross-

Sapienza NLP group 16 Sep 09, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Malware-Related Sentence Classification

Malware-Related Sentence Classification This repo contains the code for the ICTAI 2021 paper "Enrichment of Features for Malware-Related Sentence Clas

Chau Nguyen 1 Mar 26, 2022
EdiTTS: Score-based Editing for Controllable Text-to-Speech

Official implementation of EdiTTS: Score-based Editing for Controllable Text-to-Speech

Neosapience 99 Jan 02, 2023
Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated

Create a semantic search engine with a neural network (i.e. BERT) whose knowledge base can be updated. This engine can later be used for downstream tasks in NLP such as Q&A, summarization, generation

Diego 1 Mar 20, 2022
German Text-To-Speech Engine using Tacotron and Griffin-Lim

jotts JoTTS is a German text-to-speech engine using tacotron and griffin-lim. The synthesizer model has been trained on my voice using Tacotron1. Due

padmalcom 6 Aug 28, 2022
Deal or No Deal? End-to-End Learning for Negotiation Dialogues

Introduction This is a PyTorch implementation of the following research papers: (1) Hierarchical Text Generation and Planning for Strategic Dialogue (

Facebook Research 1.4k Dec 29, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022