Code associated with the Don't Stop Pretraining ACL 2020 paper

Overview

dont-stop-pretraining

Code associated with the Don't Stop Pretraining ACL 2020 paper

Citation

@inproceedings{dontstoppretraining2020,
 author = {Suchin Gururangan and Ana Marasović and Swabha Swayamdipta and Kyle Lo and Iz Beltagy and Doug Downey and Noah A. Smith},
 title = {Don't Stop Pretraining: Adapt Language Models to Domains and Tasks},
 year = {2020},
 booktitle = {Proceedings of ACL},
}

Installation

conda env create -f environment.yml
conda activate domains

Working with the latest allennlp version

This repository works with a pinned allennlp version for reproducibility purposes. This pinned version of allennlp relies on pytorch-transformers==1.2.0, which requires you to manually download custom transformer models on disk.

To run this code with the latest allennlp/ transformers version (and use the huggingface model repository to its full capacity) checkout the branch latest-allennlp. Caution that we haven't tested out all models on this branch, so your results may vary from what we report in paper.

If you'd like to use this pinned allennlp version, read on. Otherwise, checkout latest-allennlp.

Available Pretrained Models

We've uploaded DAPT and TAPT models to huggingface.

DAPT models

Available DAPT models:

allenai/cs_roberta_base
allenai/biomed_roberta_base
allenai/reviews_roberta_base
allenai/news_roberta_base

TAPT models

Available TAPT models:

allenai/dsp_roberta_base_dapt_news_tapt_ag_115K
allenai/dsp_roberta_base_tapt_ag_115K
allenai/dsp_roberta_base_dapt_reviews_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_tapt_amazon_helpfulness_115K
allenai/dsp_roberta_base_dapt_biomed_tapt_chemprot_4169
allenai/dsp_roberta_base_tapt_chemprot_4169
allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688
allenai/dsp_roberta_base_tapt_citation_intent_1688
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_dapt_news_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_tapt_hyperpartisan_news_5015
allenai/dsp_roberta_base_tapt_hyperpartisan_news_515
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_20000
allenai/dsp_roberta_base_dapt_reviews_tapt_imdb_70000
allenai/dsp_roberta_base_tapt_imdb_20000
allenai/dsp_roberta_base_tapt_imdb_70000
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_180K
allenai/dsp_roberta_base_tapt_rct_180K
allenai/dsp_roberta_base_dapt_biomed_tapt_rct_500
allenai/dsp_roberta_base_tapt_rct_500
allenai/dsp_roberta_base_dapt_cs_tapt_sciie_3219
allenai/dsp_roberta_base_tapt_sciie_3219

The final numbers in each model above are the dataset sizes. Larger dataset sizes (e.g. imdb_70000 vs. imdb_20000) are curated TAPT models. These only exist for imdb, rct, and hyperpartisan_news.

Downloading Pretrained models

You can download a pretrained model using the scripts/download_model.py script.

Just supply a model type and serialization directory, like so:

python -m scripts.download_model \
        --model allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --serialization_dir $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

This will output the allenai/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 model for Citation Intent corpus in $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688

Downloading data

All task data is available on a public S3 url; check environments/datasets.py.

If you run the scripts/train.py command (see next step), we will automatically download the relevant dataset(s) using the URLs in environments/datasets.py. However, if you'd like to download the data for use outside of this repository, you will have to curl each dataset individually:

curl -Lo train.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/train.jsonl
curl -Lo dev.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/dev.jsonl
curl -Lo test.jsonl https://allennlp.s3-us-west-2.amazonaws.com/dont_stop_pretraining/data/chemprot/test.jsonl

Example commands

Run basic RoBERTa model

The following command will train a RoBERTa classifier on the Citation Intent corpus. Check environments/datasets.py for other datasets you can pass to the --dataset flag.

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation_intent_base \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model roberta-base \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

You can supply other downloaded models to this script, by providing a path to the model:

python -m scripts.train \
        --config training_config/classifier.jsonnet \
        --serialization_dir model_logs/citation-intent-dapt-dapt \
        --hyperparameters ROBERTA_CLASSIFIER_SMALL \
        --dataset citation_intent \
        --model $(pwd)/pretrained_models/dsp_roberta_base_dapt_cs_tapt_citation_intent_1688 \
        --device 0 \
        --perf +f1 \
        --evaluate_on_test

Perform hyperparameter search

First, install allentune: https://github.com/allenai/allentune

Modify search_space/classifier.jsonnet accordingly.

Then run:

allentune search \
            --experiment-name ag_search \
            --num-cpus 56 \
            --num-gpus 4 \
            --search-space search_space/classifier.jsonnet \
            --num-samples 100 \
            --base-config training_config/classifier.jsonnet  \
            --include-package dont_stop_pretraining

Modify --num-gpus and --num-samples accordingly.

xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building blocks.

Description xFormers is a modular and field agnostic library to flexibly generate transformer architectures by interoperable and optimized building bl

Facebook Research 2.3k Jan 08, 2023
Write Alphabet, Words and Sentences with your eyes.

The-Next-Gen-AI-Eye-Writer The Eye tracking Technique has become one of the most popular techniques within the human and computer interaction era, thi

Rohan Kasabe 2 Apr 05, 2022
A minimal Conformer ASR implementation adapted from ESPnet.

Conformer ASR A minimal Conformer ASR implementation adapted from ESPnet. Introduction I want to use the pre-trained English ASR model provided by ESP

Niu Zhe 3 Jan 24, 2022
Trains an OpenNMT PyTorch model and SentencePiece tokenizer.

Trains an OpenNMT PyTorch model and SentencePiece tokenizer. Designed for use with Argos Translate and LibreTranslate.

Argos Open Tech 61 Dec 13, 2022
PyTorch Implementation of "Non-Autoregressive Neural Machine Translation"

Non-Autoregressive Transformer Code release for Non-Autoregressive Neural Machine Translation by Jiatao Gu, James Bradbury, Caiming Xiong, Victor O.K.

Salesforce 261 Nov 12, 2022
The implementation of Parameter Differentiation based Multilingual Neural Machine Translation

The implementation of Parameter Differentiation based Multilingual Neural Machine Translation .

Qian Wang 21 Dec 17, 2022
This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest

Rachford-Rice Contest This is the 25 + 1 year anniversary version of the 1995 Rachford-Rice contest. Can you solve the Rachford-Rice problem for all t

13 Sep 20, 2022
💛 Code and Dataset for our EMNLP 2021 paper: "Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes"

Perspective-taking and Pragmatics for Generating Empathetic Responses Focused on Emotion Causes Official PyTorch implementation and EmoCause evaluatio

Hyunwoo Kim 50 Dec 21, 2022
Training code for Korean multi-class sentiment analysis

KoSentimentAnalysis Bert implementation for the Korean multi-class sentiment analysis 왜 한국어 감정 다중분류 모델은 거의 없는 것일까?에서 시작된 프로젝트 Environment: Pytorch, Da

Donghoon Shin 3 Dec 02, 2022
2021 AI CUP Competition on Traditional Chinese Scene Text Recognition - Intermediate Contest

繁體中文場景文字辨識 程式碼說明 組別:這就是我 成員:蔣明憲 唐碩謙 黃玥菱 林冠霆 蕭靖騰 目錄 環境套件 安裝方式 資料夾布局 前處理-製作偵測訓練註解檔 前處理-製作分類訓練樣本 part.py : 從 json 裁切出分類訓練樣本 Class.py : 將切出來的樣本按照文字分類到各資料夾

HuanyueTW 3 Jan 14, 2022
Training code of Spatial Time Memory Network. Semi-supervised video object segmentation.

Training-code-of-STM This repository fully reproduces Space-Time Memory Networks Performance on Davis17 val set&Weights backbone training stage traini

haochen wang 128 Dec 11, 2022
Healthsea is a spaCy pipeline for analyzing user reviews of supplementary products for their effects on health.

Welcome to Healthsea ✨ Create better access to health with spaCy. Healthsea is a pipeline for analyzing user reviews to supplement products by extract

Explosion 75 Dec 19, 2022
Code for "Finetuning Pretrained Transformers into Variational Autoencoders"

transformers-into-vaes Code for Finetuning Pretrained Transformers into Variational Autoencoders (our submission to NLP Insights Workshop 2021). Gathe

Seongmin Park 22 Nov 26, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

BERTopic BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable

Maarten Grootendorst 3.6k Jan 07, 2023
A notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository

We provide a notebook that shows how to import the IITB English-Hindi Parallel Corpus from the HuggingFace datasets repository. The notebook also shows how to segment the corpus using BPE tokenizatio

Computation for Indian Language Technology (CFILT) 9 Oct 13, 2022
Quantifiers and Negations in RE Documents

Quantifiers-and-Negations-in-RE-Documents This project was part of my work for a

Nicolas Ruscher 1 Feb 01, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022
Code for the paper "Language Models are Unsupervised Multitask Learners"

Status: Archive (code is provided as-is, no updates expected) gpt-2 Code and models from the paper "Language Models are Unsupervised Multitask Learner

OpenAI 16.1k Jan 08, 2023