A self-supervised learning framework for audio-visual speech

Overview

AV-HuBERT (Audio-Visual Hidden Unit BERT)

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Robust Self-Supervised Audio-Visual Speech Recognition

lip-reading

Introduction

AV-HuBERT is a self-supervised representation learning framework for audio-visual speech. It achieves state-of-the-art results in lip reading, ASR and audio-visual speech recognition on the LRS3 audio-visual speech benchmark.

If you find AV-HuBERT useful in your research, please use the following BibTeX entry for citation.

@inproceedings{shi2022avhubert,
    author  = {Bowen Shi and Wei-Ning Hsu and Kushal Lakhotia and Abdelrahman Mohamed},
    title = {Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction},
    year = {2022}
}

@article{shi2022avsr,
    author  = {Bowen Shi and Wei-Ning Hsu and Abdelrahman Mohamed},
    title = {Robust Self-Supervised Audio-Visual Speech Recognition},
    journal = {arXiv preprint arXiv:2201.01763}
    year = {2022}
}

License

AV-HuBERT LICENSE AGREEMENT

This License Agreement (as may be amended in accordance with this License Agreement, “License”), between you (“Licensee” or “you”) and Meta Platforms, Inc. (“Meta” or “we”) applies to your use of any computer program, algorithm, source code, object code, or software that is made available by Meta under this License (“Software”) and any specifications, manuals, documentation, and other written information provided by Meta related to the Software (“Documentation”).

By using the Software, you agree to the terms of this License. If you do not agree to this License, then you do not have any rights to use the Software or Documentation (collectively, the “Software Products”), and you must immediately cease using the Software Products.

Pre-trained and fine-tuned models

Please find the checkpoints here

Installation

First, create a conda virtual environment and activate it:

conda create -n avhubert python=3.8 -y
conda activate avhubert

Then, clone this directory:

git clone https://github.com/facebookresearch/av_hubert.git
cd avhubert
git submodule init
git submodule update

Lastly, install Fairseq and the other packages:

pip install -r requirements.txt
cd fairseq
pip install --editable ./

Load a pretrained model

$ cd avhubert
$ python
>>> import fairseq
>>> import hubert_pretraining, hubert
>>> ckpt_path = "/path/to/the/checkpoint.pt"
>>> models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
>>> model = models[0]

Train a new model

Data preparation

Follow the steps in preparation to pre-process:

  • LRS3 and VoxCeleb2 datasets

Follow the steps in clustering (pre-train only) to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 100Hz for MFCC features and 25Hz for AV-HuBERT features by default.

Pre-train an AV-HuBERT model

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.km are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name, and the label rate is 100Hz.

To train a model, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  model.label_rate=100 hydra.run.dir=/path/to/experiment/pretrain/ \
  common.user_dir=`pwd`

Finetune an AV-HuBERT model with Seq2Seq

Suppose {train,valid}.tsv are saved at /path/to/data, {train,valid}.wrd are saved at /path/to/labels, the configuration file is saved at /path/to/conf/conf-name.

To fine-tune a pre-trained HuBERT model at /path/to/checkpoint, run:

$ cd avhubert
$ fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \
  task.data=/path/to/data task.label_dir=/path/to/label \
  task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \
  hydra.run.dir=/path/to/experiment/finetune/ common.user_dir=`pwd`

Decode an AV-HuBERT model

Suppose the test.tsv and test.wrd are the video list and transcripts of the split to be decoded, saved at /path/to/data, and the fine-tuned model is saved at /path/to/checkpoint.

Seq2Seq decoding

task.normalize needs to be consistent with the value used during fine-tuning. Decoding results will be saved at /path/to/experiment/decode/s2s/test.

$ cd avhubert
$ python -B infer_s2s.py --config-dir ./conf/ --config-name conf-name \
  dataset.gen_subset=test common_eval.path=/path/to/checkpoint \
  common_eval.results_path=/path/to/experiment/decode/s2s/test \
  override.modalities=['video'] common.user_dir=`pwd`

The command above uses the default decoding hyperparameter, which can be found in conf/s2s_decode.yaml. override.modalities can be set to ['video'] (for lip reading), or ['audio'] (for ASR) or ['audio','video'] (for audio-visual speech recognition).These parameters can be configured from the command line. For example, to search with a beam size of 20, we can append the command above with generation.beam=20. Important parameters include:

  • generation.beam
  • generation.lenpen

If you want to test your model under noisy environment, append the following to the above command.

+override.noise_wav=/path/to/noise override.noise_prob=1 override.noise_snr={snr}

{snr} is the signal-to-noise ratio (SNR) and /path/to/noise is a folder containing noise manifest files (/path/to/noise/{valid,test}.tsv). See preparation for setting up this folder.

Owner
Meta Research
Meta Research
A robust pointcloud registration pipeline based on correlation.

PHASER: A Robust and Correspondence-Free Global Pointcloud Registration Ubuntu 18.04+ROS Melodic: Overview Pointcloud registration using correspondenc

ETHZ ASL 101 Dec 01, 2022
The code for Expectation-Maximization Attention Networks for Semantic Segmentation (ICCV'2019 Oral)

EMANet News The bug in loading the pretrained model is now fixed. I have updated the .pth. To use it, download it again. EMANet-101 gets 80.99 on the

Xia Li 李夏 663 Nov 30, 2022
Code for "Continuous-Time Meta-Learning with Forward Mode Differentiation" (ICLR 2022)

Continuous-Time Meta-Learning with Forward Mode Differentiation ICLR 2022 (Spotlight) - Installation - Example - Citation This repository contains the

Tristan Deleu 25 Oct 20, 2022
CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer

CSAW-M This repository contains code for CSAW-M: An Ordinal Classification Dataset for Benchmarking Mammographic Masking of Cancer. Source code for tr

Yue Liu 7 Oct 11, 2022
Koç University deep learning framework.

Knet Knet (pronounced "kay-net") is the Koç University deep learning framework implemented in Julia by Deniz Yuret and collaborators. It supports GPU

1.4k Dec 31, 2022
Preparation material for Dropbox interviews

Dropbox-Onsite-Interviews A guide for the Dropbox onsite interview! The Dropbox interview question bank is very small. The bank has been in a Chinese

386 Dec 31, 2022
Chainer Implementation of Fully Convolutional Networks. (Training code to reproduce the original result is available.)

fcn - Fully Convolutional Networks Chainer implementation of Fully Convolutional Networks. Installation pip install fcn Inference Inference is done as

Kentaro Wada 218 Oct 27, 2022
Detectorch - detectron for PyTorch

Detectorch - detectron for PyTorch (Disclaimer: this is work in progress and does not feature all the functionalities of detectron. Currently only inf

Ignacio Rocco 558 Dec 23, 2022
This repository contains datasets and baselines for benchmarking Chinese text recognition.

Benchmarking-Chinese-Text-Recognition This repository contains datasets and baselines for benchmarking Chinese text recognition. Please see the corres

FudanVI Lab 254 Dec 30, 2022
PyTorch implementation of our paper How robust are discriminatively trained zero-shot learning models?

How robust are discriminatively trained zero-shot learning models? This repository contains the PyTorch implementation of our paper How robust are dis

Mehmet Kerim Yucel 5 Feb 04, 2022
DeepFaceLab fork which provides IPython Notebook to use DFL with Google Colab

DFL-Colab — DeepFaceLab fork for Google Colab This project provides you IPython Notebook to use DeepFaceLab with Google Colaboratory. You can create y

779 Jan 05, 2023
Probabilistic Tensor Decomposition of Neural Population Spiking Activity

Probabilistic Tensor Decomposition of Neural Population Spiking Activity Matlab (recommended) and Python (in developement) implementations of Soulat e

Hugo Soulat 6 Nov 30, 2022
Diverse Image Generation via Self-Conditioned GANs

Diverse Image Generation via Self-Conditioned GANs Project | Paper Diverse Image Generation via Self-Conditioned GANs Steven Liu, Tongzhou Wang, David

Steven Liu 147 Dec 03, 2022
PyTorch code for the NAACL 2021 paper "Improving Generation and Evaluation of Visual Stories via Semantic Consistency"

Improving Generation and Evaluation of Visual Stories via Semantic Consistency PyTorch code for the NAACL 2021 paper "Improving Generation and Evaluat

Adyasha Maharana 28 Dec 08, 2022
Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code

Python wrapper class for OpenVINO Model Server. User can submit inference request to OVMS with just a few lines of code.

Yasunori Shimura 7 Jul 27, 2022
Neural network for digit classification powered by cuda

cuda_nn_mnist Neural network library for digit classification powered by cuda Resources The library was built to work with MNIST dataset. python-mnist

Nikita Ardashev 1 Dec 20, 2021
Pytorch implemenation of Stochastic Multi-Label Image-to-image Translation (SMIT)

SMIT: Stochastic Multi-Label Image-to-image Translation This repository provides a PyTorch implementation of SMIT. SMIT can stochastically translate a

Biomedical Computer Vision Group @ Uniandes 37 Mar 01, 2022
SmartSim Infrastructure Library.

Home Install Documentation Slack Invite Cray Labs SmartSim SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and Ten

Cray Labs 139 Jan 01, 2023
PyArmadillo: an alternative approach to linear algebra in Python

PyArmadillo is a linear algebra library for the Python language, with an emphasis on ease of use.

Terry Zhuo 58 Oct 11, 2022