SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

Overview

SHAS: Approaching optimal Segmentation for End-to-End Speech Translation

In this repo you can find the code of the Supervised Hybrid Audio Segmentation (SHAS) method for End-to-End Speech Translation, proposed in Tsiamas et al. (2022). You can use our method with pre-trained models to segment a collection of audio files or train and fine-tune our method on your own segmented data. We provide instructions to replicate our results from the paper on MuST-C en-de and mTEDx es-en, fr-en, it-en, pt-en. You can also find easy-to-use implementations of other segmentation methods, like fixed-length, VAD, and the hybrid methods of Potapczyk and Przybysz (2020), Gállego et al. (2021), and Gaido et al. (2021).

Follow the instructions here to segment a collection of audio files, or the instruction here to replicate the results of the paper.

Abstract

Speech translation models are unable to directly process long audios, like TED talks, which have to be split into shorter segments. Speech translation datasets provide manual segmentations of the audios, which are not available in real-world scenarios, and existing segmentation methods usually significantly reduce translation quality at inference time. To bridge the gap between the manual segmentation of training and the automatic one at inference, we propose Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively learn the optimal segmentation from any manually segmented speech corpus. First, we train a classifier to identify the included frames in a segmentation, using speech representations from a pre-trained wav2vec 2.0. The optimal splitting points are then found by a probabilistic Divide-and-Conquer algorithm that progressively splits at the frame of lowest probability until all segments are below a pre-specified length. Experiments on MuST-C and mTEDx show that the translation of the segments produced by our method approaches the quality of the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of the manual segmentation's BLEU score, compared to the 87-93% of the best existing methods. Our method is additionally generalizable to different domains and achieves high zero-shot performance in unseen languages.

Results

Citation

If you find SHAS or the contents of this repo useful for your research, please consider citing:

@misc{tsiamas2022shas,
      title={SHAS: Approaching optimal Segmentation for End-to-End Speech Translation}, 
      author={Ioannis Tsiamas and Gerard I. Gállego and José A. R. Fonollosa and Marta R. Costa-jussà},
      year={2022},
      eprint={2202.04774},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Usage

Clone this repository to $SHAS_ROOT:

git clone https://github.com/mt-upc/SHAS.git ${SHAS_ROOT}    

Create a conda environment using the environment.yml file and activate it:

conda env create -f ${SHAS_ROOT}/environment.yml && \
conda activate shas

Segmentation with SHAS

Download one of the available pre-trained segmentation frame classifiers required for the SHAS method:

English Spanish French Italian Portuguese Multilingual

Make sure that the audio files you want to segment are in .wav format, mono, and sampled at 16kHz. You can convert them with:

path_to_wavs=...                       # path to the audio files that will be segmented
ls ${path_to_wavs}/*.* | parallel -j 4 ffmpeg -i {} -ac 1 -ar 16000 -hide_banner -loglevel error {.}.wav

Segment a collection of audio files with the SHAS method. This includes inference with the classifier and application of a probabilistic Divide-and-Conquer (pDAC) algorithm:

python ${SHAS_ROOT}/src/supervised_hybrid/segment.py \
  -wavs $path_to_wavs \                       # path to the audio files that will be segmented
  -ckpt $path_to_checkpoint \                 # path to the checkpoint of a trained segmentation frame classifier
  -yaml $path_to_custom_segmentation_yaml \   # where to save the custom segmentation yaml file
  -max $max_segment_length                    # the core parameter of pDAC (in seconds, empirically values between 14-18 work well)

Segmentation with other methods

Length-based (fixed-length) segmentation:

python ${SHAS_ROOT}/src/segmentation_methods/length_based.py \
  -wavs $path_to_wavs \
  -yaml $path_to_custom_segmentation_yaml \
  -n $segment_length    # (in seconds)

Pause-based segmentation with webrtc VAD:

python ${SHAS_ROOT}/src/segmentation_methods/pause_based.py \
  -wavs $path_to_wavs \
  -yaml $path_to_custom_segmentation_yaml \
  -l $frame_length \        # 10, 20 or 30
  -a $aggressiveness_mode   # 1, 2 or 3

Hybrid segmentation with either wav2vec 2.0 or VAD as pause predictor, and either the DAC or Streaming algorithms:

python ${SHAS_ROOT}/src/segmentation_methods/hybrid.py \
  -wavs $path_to_wavs \
  -yaml $path_to_custom_segmentation_yaml \
  -pause $pause_predictor \         # wav2vec or vad
  -alg $algorithm \                 # dac or strm
  -max $max_segment_length \        # (in seconds)
  -min $min_segment_length          # (in seconds) only active for the strm alg

More extensive usage

Follow these steps to replicate the results of the paper. Download the MuST-C and mTEDx data, prepare them for the Segmentation Frame Classifier training, train the classifier, generate a segmentation of a test set, translate the segments with Joint Speech-to-Text models from fairseq, do hypothesis-reference alignment, and compute BLEU scores.

Setting up the environment

Set the environment variables:

export SHAS_ROOT=...                # the path to this repo
export MUSTC_ROOT=...               # the path to save MuST-C v2.0
export MTEDX_ROOT=...               # the path to save mTEDx
export SEGM_DATASETS_ROOT=...       # the path to save the outputs of data_prep/prepare_dataset_for_segmentation
export ST_MODELS_PATH=...           # the path to the pre-trained joint-s2t models from fairseq
export RESULTS_ROOT=...             # the path to the results
export FAIRSEQ_ROOT=...             # the path to our fairseq fork
export MWERSEGMENTER_ROOT=...       # the path to the mwerSegmenter tool

Clone this repository to $SHAS_ROOT:

git clone https://github.com/mt-upc/SHAS.git ${SHAS_ROOT}    

If you want to evaluate a custom segmentation, the translated segments have to be aligned with the reference translations of the manual segmentation. We are using the mwerSegmenter for the alignment. Create a secondary python2 environment for using mwerSegmenter:

conda create -n p2-shas python=2.7

Download mwerSegmenter at ${MWERSEGMENTER_ROOT} and follow the instructions in ${MWERSEGMENTER_ROOT}/README to install it:

mkdir -p $MWERSEGMENTER_ROOT
wget https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz
tar -zxvf mwerSegmenter.tar.gz -C ${MWERSEGMENTER_ROOT}
rm -r mwerSegmenter.tar.gz

Create a conda environment using the environment.yml file and activate it:

conda env create -f ${SHAS_ROOT}/environment.yml && \
conda activate shas

We are using fairseq for Speech Translation. Install our fork of fairseq:

git clone -b audio-segment-2022 https://github.com/mt-upc/fairseq-internal.git ${FAIRSEQ_ROOT}
pip install --editable ${FAIRSEQ_ROOT}

Note: You can also use the latest public fairseq version, but BLEU scores will have minor differences with the ones reported in the paper.

Data

Download MuST-C v2 en-de to $MUSTC_ROOT:
The dataset is available here. Press the bottom ”click here to download the corpus”, and select version V2.

Download the mTEDx x-en and ASR data to $MTEDX_ROOT:

mkdir -p ${MTEDX_ROOT}
mkdir -p ${MTEDX_ROOT}/log_dir
for lang_pair in {es-en,fr-en,pt-en,it-en,es,fr,pt,it}; do
  wget https://www.openslr.org/resources/100/mtedx_${lang_pair}.tgz -o ${MTEDX_ROOT}/log_dir/${lang_pair} -c -b -O - | tar -xz -C ${MTEDX_ROOT}
done

Convert to mono and downsample at 16kHz:

ls ${MTEDX_ROOT}/*/data/{train,valid,test}/wav/*.flac | parallel -j 12 ffmpeg -i {} -ac 1 -ar 16000 -hide_banner -loglevel error {.}.wav

Prepare the datasets for segmentation

We create two tsv files (talks, segments) for each triplet of dataset-lang_pair-split. These will be used during training to create training examples by random segmentation and during evaluation to create fixed segmentation for inference.

# MuST-C en-de
mkdir -p ${SEGM_DATASETS_ROOT}/MUSTC/en-de
for split in {train,dev,tst-COMMON}; do
  python ${SHAS_ROOT}/src/data_prep/prepare_dataset_for_segmentation.py \
    -y ${MUSTC_ROOT}/en-de/data/${split}/txt/${split}.yaml \
    -w ${MUSTC_ROOT}/en-de/data/${split}/wav \
    -o ${SEGM_DATASETS_ROOT}/MUSTC/en-de
done
# mTEDx
for lang_pair in {es-en,fr-en,pt-en,it-en,es-es,fr-fr,pt-pt,it-it}; do
  mkdir -p ${SEGM_DATASETS_ROOT}/mTEDx/${lang_pair}
  for split in {train,valid,test}; do
    python ${SHAS_ROOT}/src/data_prep/prepare_dataset_for_segmentation.py \
      -y ${MTEDX_ROOT}/${lang_pair}/data/${split}/txt/${split}.yaml \
      -w ${MTEDX_ROOT}/${lang_pair}/data/${split}/wav \
      -o ${SEGM_DATASETS_ROOT}/mTEDx/${lang_pair}
  done
done

Download pre-trained Speech Translation models

For translating the custom segmentations we are using the Joint Speech-to-Text models from fairseq. Download the bilingual model trained on MuST-C en-de and the multlingual model trained on mTEDx:

# joint-s2t-mustc-en-de
en_de_model_path=${ST_MODELS_PATH}/joint-s2t-mustc-en-de
mkdir -p $en_de_model_path
for file in {checkpoint_ave_10.pt,config.yaml,src_dict.txt,dict.txt,spm.model}; docheck
  wget https://dl.fbaipublicfiles.com/joint_speech_text_4_s2t/must_c/en_de/${file} -O $en_de_model_path/${file}
done
# joint-s2t-multilingual
mult_model_path=${ST_MODELS_PATH}/joint-s2t-multilingual
mkdir -p $mult_model_path
for file in {checkpoint17.pt,config.yaml,tgt_dict.txt,dict.txt,spm.model}; do
  wget https://dl.fbaipublicfiles.com/joint_speech_text_4_s2t/iwslt/iwslt_data/${file} -O $mult_model_path/${file}
done

To generate translation with the ST models, we have to modify the path of the spm.model in the task configs and remove some hardcoded paths from the cfg arguments of the checkpoints.

sed -i "s+/path/spm.model+${en_de_model_path}/spm.model+" ${en_de_model_path}/config.yaml
python ${SHAS_ROOT}/src/data_prep/fix_joint_s2t_cfg.py -c ${en_de_model_path}/checkpoint_ave_10.pt
sed -i "s+/path/spm.model+${mult_model_path}/spm.model+" ${mult_model_path}/config.yaml
python ${SHAS_ROOT}/src/data_prep/fix_joint_s2t_cfg.py -c ${mult_model_path}/checkpoint17.pt

Train a Segmentation Frame Classifier (SFC) model

For a monolingual model (for example on English speech):

experiment_name=en_sfc_model
python ${SHAS_ROOT}/src/supervised_hybrid/train.py \
    --datasets ${SEGM_DATASETS_ROOT}/MUSTC/en-de \
    --results_path ${RESULTS_ROOT}/supervised_hybrid \
    --model_name facebook/wav2vec2-xls-r-300m \
    --experiment_name $experiment_name \
    --train_sets tst-COMMON \
    --eval_sets dev \
    --batch_size 14 \
    --learning_rate 2.5e-4 \
    --update_freq 20 \
    --max_epochs 8 \
    --classifier_n_transformer_layers 1 \
    --wav2vec_keep_layers 15

For a multilingual model trained on (English, Spanish, French, Italian, Portuguese) speech:

experiment_name=mult_sfc_model
python ${SHAS_ROOT}/src/supervised_hybrid/train.py \
    --datasets ${SEGM_DATASETS_ROOT}/MUSTC/en-de,${SEGM_DATASETS_ROOT}/mTEDx/es-es,${SEGM_DATASETS_ROOT}/mTEDx/fr-fr,${SEGM_DATASETS_ROOT}/mTEDx/it-it,${SEGM_DATASETS_ROOT}/mTEDx/pt-pt \
    --results_path ${RESULTS_ROOT}/supervised_hybrid \
    --model_name facebook/wav2vec2-xls-r-300m \
    --experiment_name $experiment_name \
    --train_sets train,train,train,train,train \
    --eval_sets dev,valid,valid,valid,valid \
    --batch_size 14 \
    --learning_rate 2.5e-4 \
    --update_freq 20 \
    --max_epochs 8 \
    --classifier_n_transformer_layers 2 \
    --wav2vec_keep_layers 15

(The above commands assume 1 active GPU, adjust accordingly the update_freq if you are using more)

Create a segmentation the SHAS method

Segment a collection of audio files, by doing inference with a trained Segmentation Frame Classifier and applying a probabilistic Divide-and-Conquer (pDAC) algorithm:

python ${SHAS_ROOT}/src/supervised_hybrid/segment.py \
  -wavs $path_to_wavs \                       # path to the audio files that will be segmented
  -ckpt $path_to_checkpoint \                 # path to the checkpoint of a trained segmentation frame classifier
  -yaml $path_to_custom_segmentation_yaml \   # where to save the custom segmentation yaml file
  -max $max_segment_length                    # the core parameter of pDAC (in seconds, empirically values between 14-18 work well)

Translate the segments and evaluate the translations

The eval_custom_segmentation.sh performs the following tasks:

  • (1): translates the segments using an ST model;
  • (2): does hypothesis-reference alignment with mwerSegmenter;
  • (3): computes scores with sacreBLEU;
bash ${SHAS_ROOT}/src/eval_scripts/eval_custom_segmentation.sh \
  $path_to_wavs \                               # path to the audio files that will be segmented
  $path_to_custom_segmentation_yaml \           # path to the custom segmentation yaml from segment.py
  $path_to_original_segmentation_yaml \         # path to the original segmentation yaml
  $path_to_original_segment_transcriptions \    # path to the text file of the original segment transcriptions
  $path_to_original_segment_translations \      # path to the text file of the original segment translations
  $src_lang \                                   # the source language id (for example: en)
  $tgt_lang \                                   # the target language id (for example: de)
  $path_to_st_model_ckpt                        # path to the checkpoint of the joint-s2t model (use the joint-s2t-mustc-en-de for en source and joint-s2t-multilingual for the rest)
Owner
Machine Translation @ UPC
Hi, we are the UPC Machine Translation Group! 👋
Machine Translation @ UPC
华为商城抢购手机的Python脚本 Python script of Huawei Store snapping up mobile phones

HUAWEI STORE GO 2021 说明 基于Python3+Selenium的华为商城抢购爬虫脚本,修改自近两年没更新的项目BUY-HW,为女神抢Nova 8(什么时候华为开始学小米玩饥饿营销了?) 原项目的登陆以及抢购部分已经不可用,本项目对原项目进行了改正以适应新华为商城,并增加一些功能

ZhangLiang 111 Dec 22, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
中文无监督SimCSE Pytorch实现

A PyTorch implementation of unsupervised SimCSE SimCSE: Simple Contrastive Learning of Sentence Embeddings 1. 用法 无监督训练 python train_unsup.py ./data/ne

99 Dec 23, 2022
Poetry PEP 517 Build Backend & Core Utilities

Poetry Core A PEP 517 build backend implementation developed for Poetry. This project is intended to be a light weight, fully compliant, self-containe

Poetry 293 Jan 02, 2023
In this project, we aim to achieve the task of predicting emojis from tweets. We aim to investigate the relationship between words and emojis.

Making Emojis More Predictable by Karan Abrol, Karanjot Singh and Pritish Wadhwa, Natural Language Processing (CSE546) under the guidance of Dr. Shad

Karanjot Singh 2 Jan 17, 2022
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
C.J. Hutto 3.8k Dec 30, 2022
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language mod

13.2k Jul 07, 2021
An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

IVR-Chatbot Achievements 🏆 Team Uhtred won the Maverick 2.0 Bot-a-thon 2021 organized by AbInbev India. ❓ Problem Statement As we all know that, lot

ARYAMAAN PANDEY 9 Dec 08, 2022
Sentiment Analysis Project using Count Vectorizer and TF-IDF Vectorizer

Sentiment Analysis Project This project contains two sentiment analysis programs for Hotel Reviews using a Hotel Reviews dataset from Datafiniti. The

Simran Farrukh 0 Mar 28, 2022
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
📝An easy-to-use package to restore punctuation of the text.

✏️ rpunct - Restore Punctuation This repo contains code for Punctuation restoration. This package is intended for direct use as a punctuation restorat

Daulet Nurmanbetov 72 Dec 30, 2022
NLP and Text Generation Experiments in TensorFlow 2.x / 1.x

Code has been run on Google Colab, thanks Google for providing computational resources Contents Natural Language Processing(自然语言处理) Text Classificati

1.5k Nov 14, 2022
Speech Recognition Database Management with python

Speech Recognition Database Management The main aim of this project is to recogn

Abhishek Kumar Jha 2 Feb 02, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context This repository contains the code in both PyTorch and TensorFlow for our paper

Zhilin Yang 3.3k Dec 28, 2022
Transformation spoken text to written text

Transformation spoken text to written text This model is used for formatting raw asr text output from spoken text to written text (Eg. date, number, i

Nguyen Binh 16 Dec 28, 2022
Official implementation of Meta-StyleSpeech and StyleSpeech

Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dongchan Min, Dong Bok Lee, Eunho Yang, and Sung Ju Hwang This is an official code

min95 169 Jan 05, 2023