Yet Another Neural Machine Translation Toolkit

Related tags

Text Data & NLPyanmtt
Overview

YANMTT

YANMTT is short for Yet Another Neural Machine Translation Toolkit. For a backstory how I ended up creating this toolkit scroll to the bottom of this README. Although the name says that it is yet another toolkit, it was written with the purpose of better understanding of the flow of training, starting from data pre-processing, sharding, batching, distributed training and decoding. There is a significant emphashis on multilingualism and on cross-lingual learning.

List of features:

  1. Basic NMT pre-training, fine-tuning, decoding
    Distributed training (tested on up to 48 GPUs. We dont have that much money.).
    Mixed precision training (optimization issues on multiple GPUs).
    Tempered softmax training, entropy maximization training.
    Joint training using monolingual and parallel corpora.
    MBART pre-training with cross-lingual constraints.
    Sentence representation and attention extraction.
    Scoring translations using trained NMT models. (for reranking, filtering or quality estimation)
  2. Multilingual training
    Fine-grained control over checkpoint saving for optimising per language pair performance.
  3. Fine-grained parameter transfer
    Remap embeddings and layers between pre-trained and fine-tuned models.
    Eliminate compoents or layers prior to decoding or fine-tuning.
  4. Model compression
    Training compact models from scratch via recurrently stacked layers (similar to what is used in ALBERT).
    Distillation of pre-trained and fine-tuned models. Distillation styles supported: label cross-entropy, attention cross-entropy, layer similarity.
  5. Simultaneous NMT
    Simulated Wait-k NMT where we train and decode wait-K models or decode full-sentence models using wait-k.
  6. Multi-source and Document NMT
    Vanilla multi-source with two input sentences belonging to different languages.
    Document level NMT where one input is the current sentence and the other one is the context.
    Can be combined with wait-k NMT

Prerequisites (core):
Python v3.6 Pytorch v1.7.1
HuggingFace Transformers v4.3.2 (install the modified copy of the transformers library provided with this toolkit)
tensorflow-gpu v2.3.0
sentencepiece v0.1.95 (you will need to go to https://github.com/google/sentencepiece and install it as the spm_train binary will be used later)
gputil v1.4.0
cuda 10.0/10.1/10.2 (tested on 10.0)

How to install:

  1. Clone the repo and go to the toolkit directory via: "git clone https://github.com/prajdabre/yanmtt && cd yanmtt"
  2. Create a virtual environment with python3.6 via and activate it via: "virtualenv -p /usr/bin/python3.6 py36 && source py36/bin/activate"
  3. Update pip via "pip install pip --upgrade" and then install the required packages via: "pip install -r requirements.txt"
  4. Install the modified version of transformers provided along with this repo by: "cd transformers && python setup.py install"
  5. Modify the "create_autotokenizer.sh" file by specifying the correct path to sentencepiece trainer ("spm_train") in line 8
  6. Set the python path to the local transformers repo by: PYTHONPATH=$PYTHONPATH:/path/to/this/toolkit/transformers
  7. Make sure that the PATH and LD_LIBRARY_PATH variables point to the appropriate CUDA folders (bin and lib64/lib respectively)
  8. Whever you do a git pull and the files in the transformers repo has been updated remember to run "python setup.py install" to update the compiled python scripts

Scripts and their functionality:

  1. create_autotokenizer.sh and create_autotokenizer.py: These scripts govern the creation of a unigram SPM or BPE tokenizer. The shell script creates the subword segmenter using sentencepiece which can make both SPM and BPE models. All you need is a monolingual corpus for the languages you are interested in. The python script wraps this around an AlbertTokenizer (for SPM) or MBartTokenizer (for BPE), adds special user defined tokens and saves a configuration file for use in the future via an AutoTokenizer.
    Usage: see examples/create_tokenizer.sh

  2. pretrain_nmt.py: This is used to train an MBART model. At the very least you need a monolingual corpus for the languages you are interested in and a tokenizer trained for those languages. This script can also be used to do joint MBART style training jointly with regular NMT training although the NMT training is rather basic because there is no evaluation during training. If you want to do advanced NMT training then you should use the "train_nmt.py" script. Ultimately, you should not use the outcome of this script to perform final translations. Additional advanced usages involve: simulated wait-k simultaneous NMT, knowledge distillation, fine-tuning pre-existing MBART models with fine-grained control over what should be initialized or tuned etc. Read the code and the command line arguments for a better understanding of the advanced features.
    Usage: see examples/train_mbart_model.sh

  3. train_nmt.py: This is used to either train a NMT model from scratch or fine-tune a pre-existing MBART or NMT model. At the very least you need a parallel corpus (preferrably split into train, dev and test sets although we can make do with only a train set) for the language pairs you are interested in. There are several advanced features such as: simulated wait-k simultaneous NMT, knowledge distillation, fine-grained control over what should be initialized or tuned, document NMT, multi-source NMT, multilingual NMT training.
    Usage: see examples/train_or_fine_tune_model.sh

  4. decode_model.py: This is used to decode sentences using a trained model. Additionally you can do translation pair scoring, forced decoding, forced alignment (experimental), encoder/decoder representation extraction and alignment visualization.
    Usage: see examples/decode_or_probe_model.sh

  5. common_utils.py: This contains all housekeeping functions such as corpora splitting, batch generation, loss computation etc. Do take a look at all the methods since you may need to modify them.

  6. average_checkpoints.py: You can average the specified checkpoints using either arithmetic or geometric averaging.
    Usage: see examples/avergage_model_checkpoints.sh

  7. gpu_blocker.py: This is used to temporarily occupy a gpu in case you use a shared GPU environment. Run this in the background before launching the training processes so that while the training scripts are busy doing preprocessing like sharding or model loading, the GPU you aim for is not occupied by someone else. Usage will be shown in the example scripts for training.

Note:

  1. Whenever running the example usage scripts simply run them as examples/scriptname.sh from the root directory of the toolkit
  2. The data under examples/data is taken from https://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ and is released the ALT Parallel Corpus as a Creative Commons Attribution 4.0 International (CC BY 4.0)

License and copyright:

  1. MIT licence for code that I wrote.
  2. Apache licence for modifications or additions to the huggingface code.

Copyright 2021 National Institute of Information and Communication Technology (Raj Dabre)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contact:
Contact me (Raj Dabre) at [email protected] or [email protected]

Backstory: Why I made this toolkit

Despite the fact that I enjoy coding, I never really pushed myself throughout my Masters and Ph.D. towards writing a self contained toolkit. I had always known that coding is an important part of research and although I had made plenty of meaningful changes to several code bases, I never felt like I owned any of those changes. Fast forward to 2020 where I wanted to play with MBART/BART/MASS. It would have been easy to use fairseq or tensor2tensor but then again the feeling of lack of ownership would remain. Huggingface provides a lot of implementations but (at the time) had no actual script to easily do MBART pre-training. All I had was this single comment "https://github.com/huggingface/transformers/issues/5096#issuecomment-645860271". After a bit of hesitation I decided to get my hands dirty and make a quick notebook for MBART pretraining. That snowballed into me writing my own pipeline for data sharding, preprocessing and training. Since I was at it I wrote a pipeline for tine tuning. Why not go further and write a pipeline for decoding and analysis? Fine-grained control over fine-tuning? Distillation? Multi-source NMT? Document NMT? Simultaneous Wait-K NMT? 3 months later I ended up with this toolkit which I wanted to share with everyone. Since I have worked in low-resource MT and efficent MT this toolkit will mostly contain implementations that somehow involve transfer learning, compression/distillation, simultaneous NMT. I am pretty sure its not as fast or perfect like the ones written by the awesome people at GAFA but I will be more than happy if a few people use my toolkit.

Comments
  • ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)

    ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location)

    Hi, I followed the steps mentioned to install, but meet the following error when trying to run the pre-training command. ImportError: cannot import name 'AutoTokenizer' from 'transformers' (unknown location) P.S: I was able to install this library and do the pre-training a few weeks back but I tried to do it again and see the above error.

    Thanks for your work.

    opened by nikhilbyte 12
  • Unexpected Keyword arguments prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors

    Unexpected Keyword arguments prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors

    Hi @prajdabre ,

    Thanks for doing great work with this library. I was trying to use it and ran into this issue where prompt_params, adaptor_layers, deep_adaptor_tuning, deep_adaptor_tuning_ffn_only, parallel_adaptors params are being passed here to forward here but the MBartForConditionalGeneration class's forward function doesn't expect it.

    Wanted to understand from you if the fix is as simple as creating these params in forward function call with default value of None (in which case I'm guessing we would need to make changes in the forward functions implementation itself to use these params).

    Let me know if you think I might be missing something here. Thanks!

    opened by goru001 7
  • Error when when trying to pretrain with other language extensions apart from hi

    Error when when trying to pretrain with other language extensions apart from hi

    The command we are using: python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model ai4bharat/IndicBART --tokenizer_name_or_path ai4bharat/IndicBART --langs kn --mono_src /home/aniruddha/all_data/train.kn --batch_size 8 --batch_size_indicates_lines --shard_files --model_path aibharat/IndicBART/model --port 7878


    Traceback (most recent call last): File "pretrain_nmt.py", line 970, in run_demo() File "pretrain_nmt.py", line 967, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/aniruddha/anaconda3/envs/pretam/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in wrap fn(i, *args) File "/home/aniruddha/yanmtt/pretrain_nmt.py", line 521, in model_create_load_run_save lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id File "/home/aniruddha/yanmtt/common_utils.py", line 147, in label_smoothed_nll_loss smooth_loss.masked_fill(pad_mask, 0.0) RuntimeError: The expanded size of the tensor (333) must match the existing size (332) at non-singleton dimension 1. Target sizes: [8, 333, 1]. Tensor sizes: [8, 332, 1]


    But when the data file has ".hi " language extension the code works fine.

    opened by raypretam 6
  • AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    After I follow the installation and run examples/train_mbart_model.sh, I get the below error.

    Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 630, in run_demo() File "pretrain_nmt.py", line 627, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/root/yanmtt/pretrain_nmt.py", line 359, in model_create_load_run_save if mod_compute.additional_lm_logits is not None: AttributeError: 'Seq2SeqLMOutput' object has no attribute 'additional_lm_logits'

    What may be going wrong? The version of transformers I have is 4.3.2.

    opened by pruksmhc 6
  • Error in BART Monolingual Pre-training.

    Error in BART Monolingual Pre-training.

    I am getting the following error while training on the monolingual (Hindi) corpus. I successfully trained the tokenizer on the same corpus using create_autotokenizer.sh.

    Error Logs: Shuffling corpus! Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Keyword arguments {'sample': False, 'nbest': 64, 'alpha_or_dropout': 0.1} not recognized. Saving the model Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 989, in run_demo() File "pretrain_nmt.py", line 986, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/workspace/data/yanmtt/pretrain_nmt.py", line 530, in model_create_load_run_save mod_compute = model(input_ids=input_ids, attention_mask=input_masks, decoder_input_ids=decoder_input_ids, output_hidden_states=args.distillation, output_attentions=args.distillation, label_mask=label_mask if args.num_domains_for_domain_classifier > 1 else None) ## Run the model and get logits. File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward output = self._run_ddp_forward(*inputs, **kwargs) File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward return module_to_run(*inputs[0], **kwargs[0]) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) TypeError: forward() got an unexpected keyword argument 'label_mask'

    opened by Ab26 5
  • dependency conflicts in requirements.txt

    dependency conflicts in requirements.txt

    Steps to reproduce the error

    conda create --name indicbart python=3.6
    conda activate indicbart
    pip install -r requirements.txt
    

    The error

    The conflict is caused by: 
    The user requested scipy==1.5.4 
    imagehash 4.2.1 depends on scipy 
    missingno 0.5.0 depends on scipy 
    pandas-profiling 3.1.0 depends on scipy>=1.4.1 
    phik 0.12.0 depends on scipy>=1.5.2 
    seaborn 0.11.1 depends on scipy>=1.0 
    tensor2tensor 1.14.0 depends on scipy 
    tensorflow-gpu 2.3.0 depends on scipy==1.4.1
    

    Can you please guide as to how to resolve the dependency issues?

    opened by ShivprasadSagare 5
  • Trying to pretrain mBART model.

    Trying to pretrain mBART model.

    Hi, I'm trying to pre-train mBART model using the following parameters: !python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --fp16 --pretrained_model "facebook/mbart-large-50" --model_path "facebook/mbart-large-50" --tokenizer_name_or_path "facebook/mbart-large-50" --mono_src "/content/yanmtt/cleaned_Sanskrit_text_for_LM.txt" --shard_files --batch_size 16

    I'm getting this error.

    `Using label smoothing of 0.1 Using gradient clipping norm of 1.0 Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: [''] Shuffling corpus! Zero size batch due to an abnormal example. Skipping empty batch. Zero size batch due to an abnormal example. Skipping empty batch. Zero size batch due to an abnormal example. Skipping empty batch. Saving the model Loading from checkpoint Traceback (most recent call last): File "pretrain_nmt.py", line 888, in run_demo() File "pretrain_nmt.py", line 885, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/content/yanmtt/pretrain_nmt.py", line 488, in model_create_load_run_save lprobs, labels, args.label_smoothing, ignore_index=tok.pad_token_id File "/content/yanmtt/common_utils.py", line 130, in label_smoothed_nll_loss nll_loss = -lprobs.gather(dim=-1, index=target) RuntimeError: Size does not match at dimension 1 expected index [1, 13, 1] to be smaller than src [1, 12, 250054] apart from dimension 2`

    opened by nikhilbyte 4
  • Three custom languages and two tasks — is this a good place to start?

    Three custom languages and two tasks — is this a good place to start?

    I have aligned datasets for three different custom languages. Each corpus is a flat text file where each line is a sentence, and documents are separated by empty lines. All sentences and documents match between the datasets. There are two tasks I'd like to be able to perform: 1) translate between the languages, and 2) infill sentences from any single language. For the translation task, given languages A, B, and C, it's actually not likely I'll ever go from C -> A or B -> A, but I definitely want to translate A -> B and A -> C. Other translations that would be helpful would be B -> C and C -> B.

    From the MBART examples at HuggingFace it looks like MBartForConditionalGeneration could perhaps do task 1 (though maybe not in all directions listed above?), and BartForConditionalGeneration could do task 2. But is there any reason why MBartForConditionalGeneration couldn't do both? That is, if I pass an input with a <mask> token to MBART, will it perform the infilling, just as BART would? If so, then does your toolkit make sense as a place to start?

    Any thoughts very much appreciated.

    opened by jbmaxwell 4
  • Tokenization issue with pretrained model

    Tokenization issue with pretrained model

    I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

    The command is below: python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

    The error is: Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['en'] Shuffling corpus! Traceback (most recent call last): File "pretrain_nmt.py", line 628, in run_demo() File "pretrain_nmt.py", line 625, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora. File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call **kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus **kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus first_ids = get_input_ids(text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids tokens = self.tokenize(text, **kwargs) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize tokenized_text = split_on_tokens(no_split_token, text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens for token in tokenized_text File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in for token in tokenized_text TypeError: _tokenize() takes 2 positional arguments but 5 were given

    Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

    However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

    opened by pruksmhc 4
  • Pre-training hangs

    Pre-training hangs

    I run bash examples/create_tokenizer.sh and then bash examples/create_tokenizer.sh, but the latter shows

    IP address is localhost
    Monolingual training files are: {'hi': 'examples/data/train.hi', 'en': 'examples/data/train.en', 'vi': 'examples/data/train.vi'}
    Sharding files into 1 parts
    For language: hi  the total number of lines are: 18088 and number of lines per shard are: 18088
    File for language hi has been sharded.
    For language: en  the total number of lines are: 18088 and number of lines per shard are: 18088
    File for language en has been sharded.
    For language: vi  the total number of lines are: 18088 and number of lines per shard are: 18088
    File for language vi has been sharded.
    Sharding files into 1 parts
    

    and then hangs without showing anything else. If I press ^C to cancel, the following traceback is shown:

      File "pretrain_nmt.py", line 888, in <module>
        run_demo()
      File "pretrain_nmt.py", line 885, in run_demo
        mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,))         #
      File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
        return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
      File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
        while not context.join():
      File "/home/user/.conda/envs/yanmtt/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 101, in join
        timeout=timeout,
      File "/home/user/.conda/envs/yanmtt/lib/python3.6/multiprocessing/connection.py", line 911, in wait
        ready = selector.select(timeout)
      File "/home/user/.conda/envs/yanmtt/lib/python3.6/selectors.py", line 376, in select
        fd_event_list = self._poll.poll(timeout)
    

    I am running YANMTT in a Docker container on a machine with a GPU A100 40GB. The only dependency for which I am using a newer version is torch, as the version in requirements.txt is too old for my GPU.

    opened by jaspock 3
  • Using masked inputs at inference time

    Using masked inputs at inference time

    I am considering using YANMTT to train my own BART model. However, instead of using it as the initial model for a subsequent fine-tuning process, I am interested in using the BART model itsel to generate alternative versions of the input sentence. To do this, I would like to mask a percentage of the words in a sentence at inference time and let the model generate a variation of it via beam search decoding:

    • Original sentence: Mike goes to the bookstore on Thursday
    • Possible masked input sentence: <mask> goes to the bookstore <mask>
    • Possible model output: Jerry happily goes to the bookstore with his friends

    Can this be easily done with YANMTT? I am trying to have my own model for the generation of synthetic samples discussed in the paper "Detecting Hallucinated Content in Conditional Neural Sequence Generation" (section 3.1).

    opened by jaspock 3
  • Alternative to installing sentencpiece

    Alternative to installing sentencpiece

    Hello!

    Just wanted to know if there is an alternative to installing sentencepiece. I see it requires sudo access adding to me getting the following error:

    (bart_pretraining) make install [ 10%] Built target sentencepiece_train-static Consolidate compiler generated dependencies of target sentencepiece-static [ 46%] Built target sentencepiece-static Consolidate compiler generated dependencies of target sentencepiece [ 82%] Built target sentencepiece Consolidate compiler generated dependencies of target spm_decode [ 84%] Built target spm_decode Consolidate compiler generated dependencies of target sentencepiece_train [ 93%] Built target sentencepiece_train Consolidate compiler generated dependencies of target spm_normalize [ 95%] Built target spm_normalize Consolidate compiler generated dependencies of target spm_train [ 97%] Built target spm_train Consolidate compiler generated dependencies of target spm_export_vocab [ 99%] Built target spm_export_vocab Consolidate compiler generated dependencies of target spm_encode [100%] Built target spm_encode Install the project... -- Install configuration: "" CMake Error at cmake_install.cmake:46 (file): file cannot create directory: /usr/local/lib64/pkgconfig. Maybe need administrative privileges.

    when I run make -j $(nproc)

    opened by Sreyan88 2
  • Evaluation during training BARTforConditionalGeneration pre-training on English corpora

    Evaluation during training BARTforConditionalGeneration pre-training on English corpora

    Hello,

    Great repo! It's of great help to me. I just had 2 questions:

    1. How do you do evaluation for pre-training?
    2. Does the pre-training involve both mask infill and sentence permutation? If it does both can I just do mask infill? My main motive is to fine-tune a pre-trained BART with mask infill on an English corpus.

    Thank You so much!

    opened by Sreyan88 1
  • Mixtures of denoisers

    Mixtures of denoisers

    Currently, I have implemented the mBART (span denoising) and mT5 (span prediction) pre-training approaches but according to the ULL2 paper (https://arxiv.org/pdf/2205.05131.pdf) a more comprehensive mixture of denoisers would help a lot.

    Currently, you may use either mT5 or mBART style but I would like to enable the user to specify a comma separated list of denoising objectives and a comma separated list of the probabilities of using these objectives along with requisite hyperparams for each objective. If this is done we can play with some cool stuff.

    opened by prajdabre 0
  • Add post-norm to the model

    Add post-norm to the model

    Currently the mbart backbone code I use has pre-norm which is layer(norm(input))+input whereas some people seem to say that postnorm which is norm(layer(input)+input) might be better for zeor shot. Lord alone knows whats going to be useful when.

    Having a flag to control pre- and post-norm in the encoder and decoder would be perfect.

    good first issue 
    opened by prajdabre 0
Releases(v2.0)
  • v2.0(Apr 18, 2022)

    This is the second release of YANMTT! I have fixed a number of bugs, I noticed in the previous release along with several new features such as but not limited to:

    1. GUI to demo and debug your models trained with YANMTT as well as existing models like mBART, mBART-50, BART, IndicBART.
    2. Mixtures-of-experts layers for massive models.
    3. Adaptor and prompt tuning and hypercomplex adaptors.
    4. New multi-source fusion methods.
    Source code(tar.gz)
    Source code(zip)
  • v1.0(Mar 10, 2022)

    This is the first release of YANMTT. I made this release because I have made lots of recent changes to the toolkit that I intend to push. With these changes, several behaviors will change, including commands that may not work anymore. In particular, this will affect people who do IndicBART fine-tuning. So if you have used YANMTT before the 10th of March 2022 and wish to resume your experiments, then use the old version of the code by using the tag v1.0.

    Source code(tar.gz)
    Source code(zip)
Owner
Raj Dabre
Researcher at NICT Japan. Working on low resource Machine Translation. Wants to collab with researchers interested in adversarial and reinforcement learning.
Raj Dabre
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
[KBS] Aspect-based sentiment analysis via affective knowledge enhanced graph convolutional networks

#Sentic GCN Introduction This repository was used in our paper: Aspect-Based Sentiment Analysis via Affective Knowledge Enhanced Graph Convolutional N

Akuchi 35 Nov 16, 2022
(ACL 2022) The source code for the paper "Towards Abstractive Grounded Summarization of Podcast Transcripts"

Towards Abstractive Grounded Summarization of Podcast Transcripts We provide the source code for the paper "Towards Abstractive Grounded Summarization

10 Jul 01, 2022
Learning Spatio-Temporal Transformer for Visual Tracking

STARK The official implementation of the paper Learning Spatio-Temporal Transformer for Visual Tracking Highlights The strongest performances Tracker

Multimedia Research 485 Jan 04, 2023
Top2Vec is an algorithm for topic modeling and semantic search.

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors.

Dimo Angelov 2.4k Jan 06, 2023
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 | 한국어 State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow 🤗 Transformers provides thousands of pretrained models

Hugging Face 77.1k Dec 31, 2022
Persian Bert For Long-Range Sequences

ParsBigBird: Persian Bert For Long-Range Sequences The Bert and ParsBert algorithms can handle texts with token lengths of up to 512, however, many ta

Sajjad Ayoubi 63 Dec 14, 2022
SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。

SimpleChinese2 SimpleChinese2 集成了许多基本的中文NLP功能,使基于 Python 的中文文字处理和信息提取变得简单方便。 声明 本项目是为方便个人工作所创建的,仅有部分代码原创。

Ming 30 Dec 02, 2022
Translate - a PyTorch Language Library

NOTE PyTorch Translate is now deprecated, please use fairseq instead. Translate - a PyTorch Language Library Translate is a library for machine transl

775 Dec 24, 2022
SummerTime - Text Summarization Toolkit for Non-experts

A library to help users choose appropriate summarization tools based on their specific tasks or needs. Includes models, evaluation metrics, and datasets.

Yale-LILY 213 Jan 04, 2023
CCF BDCI 2020 房产行业聊天问答匹配赛道 A榜47/2985

CCF BDCI 2020 房产行业聊天问答匹配 A榜47/2985 赛题描述详见:https://www.datafountain.cn/competitions/474 文件说明 data: 存放训练数据和测试数据以及预处理代码 model_bert.py: 网络模型结构定义 adv_train

shuo 40 Sep 28, 2022
Text Analysis & Topic Extraction on Android App user reviews

AndroidApp_TextAnalysis Hi, there! This is code archive for Text Analysis and Topic Extraction from user_reviews of Android App. Dataset Source : http

Fitrie Ratnasari 1 Feb 14, 2022
Library for fast text representation and classification.

fastText fastText is a library for efficient learning of word representations and sentence classification. Table of contents Resources Models Suppleme

Facebook Research 24.1k Jan 05, 2023
Non-Autoregressive Translation with Layer-Wise Prediction and Deep Supervision

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Chenyang Huang 37 Jan 04, 2023
Watson Natural Language Understanding and Knowledge Studio

Material de demonstração dos serviços: Watson Natural Language Understanding e Knowledge Studio Visão Geral: https://www.ibm.com/br-pt/cloud/watson-na

Vanderlei Munhoz 4 Oct 24, 2021
A library for Multilingual Unsupervised or Supervised word Embeddings

MUSE: Multilingual Unsupervised and Supervised Embeddings MUSE is a Python library for multilingual word embeddings, whose goal is to provide the comm

Facebook Research 3k Jan 06, 2023
Learn meanings behind words is a key element in NLP. This project concentrates on the disambiguation of preposition senses. Therefore, we train a bert-transformer model and surpass the state-of-the-art.

New State-of-the-Art in Preposition Sense Disambiguation Supervisor: Prof. Dr. Alexander Mehler Alexander Henlein Institutions: Goethe University TTLa

Dirk Neuhäuser 4 Apr 06, 2022
Tutorial to pretrain & fine-tune a 🤗 Flax T5 model on a TPUv3-8 with GCP

Pretrain and Fine-tune a T5 model with Flax on GCP This tutorial details how pretrain and fine-tune a FlaxT5 model from HuggingFace using a TPU VM ava

Gabriele Sarti 41 Nov 18, 2022
Unlimited Call - Text Bombing Tool

FastBomber Unlimited Call - Text Bombing Tool Installation On Termux

Aryan 6 Nov 10, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022