Evaluating Cross-lingual Sentence Representations

Last update: Dec 19, 2022

Related tags

Deep Learning XNLI

Overview

XNLI: The Cross-Lingual NLI Corpus

XNLI is an evaluation corpus for language transfer and cross-lingual sentence classification in 15 languages.

New: XLM and Multilingual BERT use XNLI to evaluate the quality of the cross-lingual representations.

Introduction

Many NLP systems (e.g. sentiment analysis, topic classification, feed ranking) rely on training data in one high-resource language, but cannot be directly used to make predictions for other languages at test time. This problem happens in almost any industrial application that involves cross-lingual data.

Machine translation can be used to translate any sample into the high-resource language to alleviate this issue. However, having an MT system in every direction is costly and may not the best solution for cross-lingual classification. Cross-lingual encoders are a cheaper and more elegant alternative.

To evaluate such cross-lingual sentence understanding methods, we built XNLI, an extension of the SNLI/MultiNLI corpus in 15 languages. XNLI raises the following research question: how can we make predictions in any language at test time, when we only have English training data for that task?

While industrial applications may not include NLI in their routine tasks, we believe that NLI is a good testbed for evaluating cross-lingual sentence representations, and that better approaches for XNLI will result in better XLU methods.

The XNLI corpus

The Cross-lingual Natural Language Inference (XNLI) corpus is a crowd-sourced collection of 5,000 test and 2,500 dev pairs for the MultiNLI corpus. The pairs are annotated with textual entailment and translated into 14 languages: French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili and Urdu. This results in 112.5k annotated pairs. Each premise can be associated with the corresponding hypothesis in the 15 languages, summing up to more than 1.5M combinations.

Download

XNLI is distributed in a single ZIP file containing the corpus as both JSON lines (jsonl) and tab-separated text (txt). The English training data can be found on the MultiNLI website.

Download: XNLI 1.0 (17MB, ZIP)

XNLI can also be used as a 15way parallel corpus of 10,000 sentences, for building or evaluating Machine Translation systems. XNLI provides additional open parallel data for low-resource languages such as Swahili or Urdu.

Download XNLI-15way (12MB, ZIP)

Useful XLU resources

Parallel data for aligning sentence encoders: OPUS: the open parallel corpus

Monolingual word embeddings in 157 languages: fastText CommonCrawl vectors

Aligning word embeddings: MUSE

Data description paper and citation

A description of the data can be found here or in the corpus package zip. If you use the corpus in an academic paper, please consider citing:

@InProceedings{conneau2018xnli,
  author = "Conneau, Alexis
        and Rinott, Ruty
        and Lample, Guillaume
        and Williams, Adina
        and Bowman, Samuel R.
        and Schwenk, Holger
        and Stoyanov, Veselin",
  title = "XNLI: Evaluating Cross-lingual Sentence Representations",
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods
               in Natural Language Processing",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  location = "Brussels, Belgium",
}

Baselines and MT data

The XNLI paper presents several baselines for language adaptation.

We also release the machine translated data for reproducing the TRANSLATE-TRAIN and TRANSLATE-TEST:

Download: XNLI-MT 1.0 (445MB, ZIP)

License

XNLI is licensed under the license found in the LICENSE file. See details in the XNLI paper.

Evaluating Cross-lingual Sentence Representations

Related tags

Overview

XNLI: The Cross-Lingual NLI Corpus

Introduction

The XNLI corpus

Download

Useful XLU resources

Data description paper and citation

Baselines and MT data

License

Owner

Meta Research

DiscoNet: Learning Distilled Collaboration Graph for Multi-Agent Perception [NeurIPS 2021]

Training RNNs as Fast as CNNs

Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

An Extendible (General) Continual Learning Framework based on Pytorch - official codebase of Dark Experience for General Continual Learning

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

Multi-Anchor Active Domain Adaptation for Semantic Segmentation (ICCV 2021 Oral)

A CNN model to detect hand gestures.

Reproduce ResNet-v2(Identity Mappings in Deep Residual Networks) with MXNet

Adjust Decision Boundary for Class Imbalanced Learning

Audio Domain Adaptation for Acoustic Scene Classification using Disentanglement Learning

The code for our paper "NSP-BERT: A Prompt-based Zero-Shot Learner Through an Original Pre-training Task —— Next Sentence Prediction"

Pytorch Implementation of Residual Vision Transformers(ResViT)

LSTM and QRNN Language Model Toolkit for PyTorch

An Unsupervised Detection Framework for Chinese Jargons in the Darknet

PyTorch implementation of Neural View Synthesis and Matching for Semi-Supervised Few-Shot Learning of 3D Pose

Complete the code of prefix-tuning in low data setting

Fluency ENhanced Sentence-bert Evaluation (FENSE), metric for audio caption evaluation. And Benchmark dataset AudioCaps-Eval, Clotho-Eval.

Code for the paper "There is no Double-Descent in Random Forests"

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

My implementation of transformers related papers for computer vision in pytorch