Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Last update: Dec 31, 2022

Overview

Data Augmentation using Pre-trained Transformer Models

Code associated with the Data Augmentation using Pre-trained Transformer Models paper

Code contains implementation of the following data augmentation methods

EDA (Baseline)
Backtranslation (Baseline)
CBERT (Baseline)
BERT Prepend (Our paper)
GPT-2 Prepend (Our paper)
BART Prepend (Our paper)

DataSets

In paper, we use three datasets from following resources

Low-data regime experiment setup

Run src/utils/download_and_prepare_datasets.sh file to prepare all datsets.
download_and_prepare_datasets.sh performs following steps

Download data from github
Replace numeric labels with text for STSA-2 and TREC dataset
For a given dataset, creates 15 random splits of train and dev data.

Dependencies

To run this code, you need following dependencies

Pytorch 1.5
fairseq 0.9
transformers 2.9

How to run

To run data augmentation experiment for a given dataset, run bash script in scripts folder. For example, to run data augmentation on snips dataset,

run scripts/bart_snips_lower.sh for BART experiment
run scripts/bert_snips_lower.sh for rest of the data augmentation methods

How to cite

@inproceedings{kumar-etal-2020-data,
    title = "Data Augmentation using Pre-trained Transformer Models",
    author = "Kumar, Varun  and
      Choudhary, Ashutosh  and
      Cho, Eunah",
    booktitle = "Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems",
    month = dec,
    year = "2020",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.lifelongnlp-1.3",
    pages = "18--26",
}

Contact

Please reachout to [email protected] for any questions related to this code.

License

This project is licensed under the Creative Common Attribution Non-Commercial 4.0 license.

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Related tags

Overview

Data Augmentation using Pre-trained Transformer Models

DataSets

Low-data regime experiment setup

Dependencies

How to run

How to cite

Contact

License

Owner

Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Code for the paper "Language Models are Unsupervised Multitask Learners"

Japanese synonym library

ElasticBERT: A pre-trained model with multi-exit transformer architecture.

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

This is the source code of RPG (Reward-Randomized Policy Gradient)

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model training in NLP

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.

Model for recasing and repunctuating ASR transcripts

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

無料で使える中品質なテキスト読み上げソフトウェア、VOICEVOXの音声合成エンジン

Code examples for my Write Better Python Code series on YouTube.

Translation to python of Chris Sims' optimization function

Product-Review-Summarizer - Created a product review summarizer which clustered thousands of product reviews and summarized them into a maximum of 500 characters, saving precious time of customers and helping them make a wise buying decision.

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

Study German declensions (dER nettE Mann, ein nettER Mann, mit dEM nettEN Mann, ohne dEN nettEN Mann ...) Generate as many exercises as you want using the incredible power of SPACY!

An official repository for tutorials of Probabilistic Modelling and Reasoning (2021/2022) - a University of Edinburgh master's course.

End-2-end speech synthesis with recurrent neural networks

Document processing using transformers

This repository contains (not all) code from my project on Named Entity Recognition in philosophical text