Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Related tags

Deep LearningmRASP2
Overview

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

The code for training mCOLT/mRASP2, a multilingual NMT training framework, implemented based on fairseq.

mRASP2: paper

mRASP: paper, code


News

We have released two versions, this version is the original one. In this implementation:

  • You should first merge all data, by pre-pending language token before each sentence to indicate the language.
  • AA/RAS muse be done off-line (before binarize), check this toolkit.

New implementation: https://github.com/PANXiao1994/mRASP2/tree/new_impl

  • Acknowledgement: This work is supported by Bytedance. We thank Chengqi for uploading all files and checkpoints.

Introduction

mRASP2/mCOLT, representing multilingual Contrastive Learning for Transformer, is a multilingual neural machine translation model that supports complete many-to-many multilingual machine translation. It employs both parallel corpora and multilingual corpora in a unified training framework. For detailed information please refer to the paper.

img.png

Pre-requisite

pip install -r requirements.txt

Training Data and Checkpoints

We release our preprocessed training data and checkpoints in the following.

Dataset

We merge 32 English-centric language pairs, resulting in 64 directed translation pairs in total. The original 32 language pairs corpus contains about 197M pairs of sentences. We get about 262M pairs of sentences after applying RAS, since we keep both the original sentences and the substituted sentences. We release both the original dataset and dataset after applying RAS.

Dataset #Pair
32-lang-pairs-TRAIN 197603294
32-lang-pairs-RAS-TRAIN 262662792
mono-split-a -
mono-split-b -
mono-split-c -
mono-split-d -
mono-split-e -
mono-split-de-fr-en -
mono-split-nl-pl-pt -
32-lang-pairs-DEV-en-centric -
32-lang-pairs-DEV-many-to-many -
Vocab -
BPE Code -

Checkpoints & Results

  • Please note that the provided checkpoint is sightly different from that in the paper. In the following sections, we report the results of the provided checkpoints.

English-centric Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

6e6d-no-mono 12e12d-no-mono 12e12d
en2cs/wmt16 21.0 22.3 23.8
cs2en/wmt16 29.6 32.4 33.2
en2fr/wmt14 42.0 43.3 43.4
fr2en/wmt14 37.8 39.3 39.5
en2de/wmt14 27.4 29.2 29.5
de2en/wmt14 32.2 34.9 35.2
en2zh/wmt17 33.0 34.9 34.1
zh2en/wmt17 22.4 24.0 24.4
en2ro/wmt16 26.6 28.1 28.7
ro2en/wmt16 36.8 39.0 39.1
en2tr/wmt16 18.6 20.3 21.2
tr2en/wmt16 22.2 25.5 26.1
en2ru/wmt19 17.4 18.5 19.2
ru2en/wmt19 22.0 23.2 23.6
en2fi/wmt17 20.2 22.1 22.9
fi2en/wmt17 26.1 29.5 29.7
en2es/wmt13 32.8 34.1 34.6
es2en/wmt13 32.8 34.6 34.7
en2it/wmt09 28.9 30.0 30.8
it2en/wmt09 31.4 32.7 32.8

Unsupervised Directions

We report tokenized BLEU in the following table. (check eval.sh for details)

12e12d
en2pl/wmt20 6.2
pl2en/wmt20 13.5
en2nl/iwslt14 8.8
nl2en/iwslt14 27.1
en2pt/opus100 18.9
pt2en/opus100 29.2

Zero-shot Directions

  • row: source language
  • column: target language We report sacreBLEU in the following table.
12e12d ar zh nl fr de ru
ar - 32.5 3.2 22.8 11.2 16.7
zh 6.5 - 1.9 32.9 7.6 23.7
nl 1.7 8.2 - 7.5 10.2 2.9
fr 6.2 42.3 7.5 - 18.9 24.4
de 4.9 21.6 9.2 24.7 - 14.4
ru 7.1 40.6 4.5 29.9 13.5 -

Training

export NUM_GPU=4 && bash train_w_mono.sh ${model_config}
  • We give example of ${model_config} in ${PROJECT_REPO}/examples/configs/parallel_mono_12e12d_contrastive.yml

Inference

  • You must pre-pend the corresponding language token to the source side before binarize the test data.
${final_res_file} python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1; ">
fairseq-generate ${test_path} \
    --user-dir ${repo_dir}/mcolt \
    -s ${src} \
    -t ${tgt} \
    --skip-invalid-size-inputs-valid-test \
    --path ${ckpts} \
    --max-tokens ${batch_size} \
    --task translation_w_langtok \
    ${options} \
    --lang-prefix-tok "LANG_TOK_"`echo "${tgt} " | tr '[a-z]' '[A-Z]'` \
    --max-source-positions ${max_source_positions} \
    --max-target-positions ${max_target_positions} \
    --nbest 1 | grep -E '[S|H|P|T]-[0-9]+' > ${final_res_file}
python3 ${repo_dir}/scripts/utils.py ${res_file} ${ref_file} || exit 1;

Synonym dictionaries

We use the bilingual synonym dictionaries provised by MUSE.

We generate multilingual synonym dictionaries using this script, and apply RAS using this script.

Description File Size
dep=1 synonym_dict_raw_dep1 138.0 M
dep=2 synonym_dict_raw_dep2 1.6 G
dep=3 synonym_dict_raw_dep3 2.2 G

Contact

Please contact me via e-mail [email protected] or via wechat/zhihu

Citation

Please cite as:

@inproceedings{mrasp2,
  title = {Contrastive Learning for Many-to-many Multilingual Neural Machine Translation},
  author= {Xiao Pan and
           Mingxuan Wang and
           Liwei Wu and
           Lei Li},
  booktitle = {Proceedings of ACL 2021},
  year = {2021},
}
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

Vittorio Mazzia 203 Jan 08, 2023
DFM: A Performance Baseline for Deep Feature Matching

DFM: A Performance Baseline for Deep Feature Matching Python (Pytorch) and Matlab (MatConvNet) implementations of our paper DFM: A Performance Baselin

143 Jan 02, 2023
Code accompanying the paper "How Tight Can PAC-Bayes be in the Small Data Regime?"

How Tight Can PAC-Bayes be in the Small Data Regime? This is the code to reproduce all experiments for the following paper: @inproceedings{Foong:2021:

5 Dec 21, 2021
Official implementation of "Articulation Aware Canonical Surface Mapping"

Articulation-Aware Canonical Surface Mapping Nilesh Kulkarni, Abhinav Gupta, David F. Fouhey, Shubham Tulsiani Paper Project Page Requirements Python

Nilesh Kulkarni 56 Dec 16, 2022
The official PyTorch code implementation of "Personalized Trajectory Prediction via Distribution Discrimination" in ICCV 2021.

Personalized Trajectory Prediction via Distribution Discrimination (DisDis) The official PyTorch code implementation of "Personalized Trajectory Predi

25 Dec 20, 2022
A high-level Python library for Quantum Natural Language Processing

lambeq About lambeq is a toolkit for quantum natural language processing (QNLP). Documentation: https://cqcl.github.io/lambeq/ Getting started Prerequ

Cambridge Quantum 315 Jan 01, 2023
code release for USENIX'22 paper `On the Security Risks of AutoML`

This project is a minimized runnable project cut from trojanzoo, which contains more datasets, models, attacks and defenses. This repo will not be mai

Ren Pang 5 Apr 19, 2022
State-of-the-art language models can match human performance on many tasks

Status: Archive (code is provided as-is, no updates expected) Grade School Math [Blog Post] [Paper] State-of-the-art language models can match human p

OpenAI 259 Jan 08, 2023
Imaging, analysis, and simulation software for radio interferometry

ehtim (eht-imaging) Python modules for simulating and manipulating VLBI data and producing images with regularized maximum likelihood methods. This ve

Andrew Chael 5.2k Dec 28, 2022
Reference code for the paper "Cross-Camera Convolutional Color Constancy" (ICCV 2021)

Cross-Camera Convolutional Color Constancy, ICCV 2021 (Oral) Mahmoud Afifi1,2, Jonathan T. Barron2, Chloe LeGendre2, Yun-Ta Tsai2, and Francois Bleibe

Mahmoud Afifi 76 Jan 07, 2023
Converts given image (png, jpg, etc) to amogus gif.

Image to Amogus Converter Converts given image (.png, .jpg, etc) to an amogus gif! Usage Place image in the /target/ folder (or anywhere realistically

Hank Magan 1 Nov 24, 2021
NeROIC: Neural Object Capture and Rendering from Online Image Collections

NeROIC: Neural Object Capture and Rendering from Online Image Collections This repository is for the source code for the paper NeROIC: Neural Object C

Snap Research 647 Dec 27, 2022
NeurIPS-2021: Neural Auto-Curricula in Two-Player Zero-Sum Games.

NAC Official PyTorch implementation of NAC from the paper: Neural Auto-Curricula in Two-Player Zero-Sum Games. We release code for: Gradient based ora

Xidong Feng 19 Nov 11, 2022
The official PyTorch implementation of paper BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition

BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition Boyan Zhou, Quan Cui, Xiu-Shen Wei*, Zhao-Min Chen This repo

Megvii-Nanjing 616 Dec 21, 2022
Effective Use of Transformer Networks for Entity Tracking

Effective Use of Transformer Networks for Entity Tracking (EMNLP19) This is a PyTorch implementation of our EMNLP paper on the effectiveness of pre-tr

5 Nov 06, 2021
Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

SphereRPN Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021. Authors: Th

Thang Vu 15 Dec 02, 2022
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

77 Dec 16, 2022
Using BERT+Bi-LSTM+CRF

Chinese Medical Entity Recognition Based on BERT+Bi-LSTM+CRF Step 1 I share the dataset on my google drive, please download the whole 'CCKS_2019_Task1

Xiang WU 55 Dec 21, 2022
PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

PyTorch implementation of SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

Thalles Silva 1.7k Dec 28, 2022
Steer OpenAI's Jukebox with Music Taggers

TagBox Steer OpenAI's Jukebox with Music Taggers! The closest thing we have to VQGAN+CLIP for music! Unsupervised Source Separation By Steering Pretra

Ethan Manilow 34 Nov 02, 2022