mMARCO
A multilingual version of MS MARCO passage ranking dataset
This repository presents a neural machine translation-based method for translating the MS MARCO passage ranking dataset. The code available here is the same used in our paper mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset.
Translated Datasets
As described in our work, we made available 8 translated versions of MS MARCO passage ranking dataset. The translated passages collection and the queries set (training and validation) are available at:
Released Model Checkpoints
Our available fine-tuned models are:
Model | Description | [email protected]* |
---|---|---|
ptT5-base-pt-msmarco | a PTT5 model fine-tuned on Portuguese MS MARCO | 0.188 |
ptT5-base-en-pt-msmarco | a PTT5 model fine-tuned on English and Portuguese MS MARCO | 0.343 |
mT5-base-en-pt-msmarco | a mT5 model fine-tuned on both English and Portuguese MS MARCO | 0.375 |
mT5-base-multi-msmarco | a mT5 model fine-tuned on mMARCO | 0.366 |
mMiniLM-pt-msmarco | a mMiniLM model fine-tuned on Portuguese MS MARCO | - |
mMiniLM-en-pt-msmarco | a mMiniLM model fine-tuned on both English and Portuguese MS MARCO | 0.375 |
mMiniLM-multi-msmarco | a mMiniLM model fine-tuned on mMARCO | 0.363 |
* [email protected] on English MS MARCO
Dataset
We translate MS MARCO passage ranking dataset, a large-scale IR dataset comprising more than half million anonymized questions that were sampled from Bing's search query logs.
Translation Model
To translate the MS MARCO dataset, we use MarianNMT an open-source neural machine translation framework originally written in C++ for fast training and translation. The Language Technology Research Group at the University of Helsinki made available more than a thousand language pairs for translation, supported by HuggingFace framework.
How To Translate
In order to allow other users to translate the MS MARCO passage ranking dataset to other languages (or a dataset of your own will), we provide the translate.py
script. This script expects a .tsv file, in which each line follows a document_id \t document_text
format.
python translate.py --model_name_or_path Helsinki-NLP/opus-mt-{src}-{tgt} --target_language tgt_code--input_file collection.tsv --output_dir translated_data/
After translating, it is necessary to reassemble the file, as the documents were split into sentences.
python create_translated_collection.py --input_file translated_data/translated_file --output_file translated_{tgt}_collection
Translating the entire passages collection of MS MARCO took about 80 hours using a Tesla V100.
How to Cite
If you extend or use this work, please cite the paper where it was introduced:
@misc{bonifacio2021mmarco,
title={mMARCO: A Multilingual Version of MS MARCO Passage Ranking Dataset},
author={Luiz Henrique Bonifacio and Israel Campiotti and Roberto Lotufo and Rodrigo Nogueira},
year={2021},
eprint={2108.13897},
archivePrefix={arXiv},
primaryClass={cs.CL}
}