EMNLP 2021 - Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Overview

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

This is the official implementation for "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling" (EMNLP 2021).

Requirements

  • torch
  • transformers
  • datasets
  • scikit-learn
  • tensorflow
  • spacy

How to pre-train

1. Clone this repository

git clone https://github.com/gucci-j/light-transformer-emnlp2021.git

2. Install required packages

cd ./light-transformer-emnlp2021
pip install -r requirements.txt

requirements.txt is located just under light-transformer-emnlp2021.

We also need spaCy's en_core_web_sm for preprocessing. If you have not installed this model, please run python -m spacy download en_core_web_sm.

3. Preprocess datasets

cd ./src/utils
python preprocess_roberta.py --path=/path/to/save/data/

You need to specify the following argument:

  • path: (str) Where to save the processed data?

4. Pre-training

You need to secify configs as command line arguments. Sample configs for pre-training MLM are shown as below. python pretrainer.py --help will display helper messages.

cd ../
python pretrainer.py \
--data_dir=/path/to/dataset/ \
--do_train \
--learning_rate=1e-4 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=12774 \
--save_steps=12774 \
--seed=42 \
--per_device_train_batch_size=16 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm=True \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 
  • pretrain_model should be selected from:
    • RobertaForMaskedLM (MLM)
    • RobertaForShuffledWordClassification (Shuffle)
    • RobertaForRandomWordClassification (Random)
    • RobertaForShuffleRandomThreeWayClassification (Shuffle+Random)
    • RobertaForFourWayTokenTypeClassification (Token Type)
    • RobertaForFirstCharPrediction (First Char)

Check the pre-training process

You can monitor the progress of pre-training via the Tensorboard. Simply run the following:

tensorboard --logdir=/path/to/log/dir/

Distributed training

pretrainer.py is compatible with distributed training. Sample configs for pre-training MLM are as follows.

python -m torch/distributed/launch.py \
--nproc_per_node=8 \
pretrainer.py \
--data_dir=/path/to/dataset/ \
--model_path=None \
--do_train \
--learning_rate=5e-5 \
--weight_decay=0.01 \
--adam_epsilon=1e-8 \
--max_grad_norm=1.0 \
--num_train_epochs=1 \
--warmup_steps=24000 \
--save_steps=1000 \
--seed=42 \
--per_device_train_batch_size=8 \
--logging_steps=100 \
--output_dir=/path/to/save/weights/ \
--overwrite_output_dir \
--logging_dir=/path/to/save/log/files/ \
--disable_tqdm \
--prediction_loss_only \
--fp16 \
--mlm_prob=0.15 \
--pretrain_model=RobertaForMaskedLM 

For more details about launch.py, please refer to https://github.com/pytorch/pytorch/blob/master/torch/distributed/launch.py.

Mixed precision training

Installation

  • For PyTorch version >= 1.6, there is a native functionality to enable mixed precision training.
  • For older versions, NVIDIA apex must be installed.
    • You might encounter some errors when installing apex due to permission problems. To fix these, specify export TMPDIR='/path/to/your/favourite/dir/' and change permissions of all files under apex/.git/ to 777.
    • You also need to specify an optimisation method from https://nvidia.github.io/apex/amp.html.

Usage
To use mixed precision during pre-training, just specify --fp16 as an input argument. For older PyTorch versions, also specify --fp16_opt_level from O0, O1, O2, and O3.

How to fine-tune

GLUE

  1. Download GLUE data

    git clone https://github.com/huggingface/transformers
    python transformers/utils/download_glue_data.py
    
  2. Create a json config file
    You need to create a .json file for configuration or use command line arguments.

    {
        "model_name_or_path": "/path/to/pretrained/weights/",
        "tokenizer_name": "roberta-base",
        "task_name": "MNLI",
        "do_train": true,
        "do_eval": true,
        "data_dir": "/path/to/MNLI/dataset/",
        "max_seq_length": 128,
        "learning_rate": 2e-5,
        "num_train_epochs": 3, 
        "per_device_train_batch_size": 32,
        "per_device_eval_batch_size": 128,
        "logging_steps": 500,
        "logging_first_step": true,
        "save_steps": 1000,
        "save_total_limit": 2,
        "evaluate_during_training": true,
        "output_dir": "/path/to/save/models/",
        "overwrite_output_dir": true,
        "logging_dir": "/path/to/save/log/files/",
        "disable_tqdm": true
    }

    For task_name and data_dir, please choose one from CoLA, SST-2, MRPC, STS-B, QQP, MNLI, QNLI, RTE, and WNLI.

  3. Fine-tune

    python run_glue.py /path/to/json/
    

    Instead of specifying a JSON path, you can directly specify configs as input arguments.
    You can also monitor training via Tensorboard.
    --help option will display a helper message.

SQuAD

  1. Download SQuAD data

    cd ./utils
    python download_squad_data.py --save_dir=/path/to/squad/
    
  2. Fine-tune

    cd ..
    export SQUAD_DIR=/path/to/squad/
    python run_squad.py \
    --model_type roberta \
    --model_name_or_path=/path/to/pretrained/weights/ \
    --tokenizer_name roberta-base \
    --do_train \
    --do_eval \
    --do_lower_case \
    --data_dir=$SQUAD_DIR \
    --train_file $SQUAD_DIR/train-v1.1.json \
    --predict_file $SQUAD_DIR/dev-v1.1.json \
    --per_gpu_train_batch_size 16 \
    --per_gpu_eval_batch_size 32 \
    --learning_rate 3e-5 \
    --weight_decay=0.01 \
    --warmup_steps=3327 \
    --num_train_epochs 10.0 \
    --max_seq_length 384 \
    --doc_stride 128 \
    --logging_steps=278 \
    --save_steps=50000 \
    --patience=5 \
    --objective_type=maximize \
    --metric_name=f1 \
    --overwrite_output_dir \
    --evaluate_during_training \
    --output_dir=/path/to/save/weights/ \
    --logging_dir=/path/to/save/logs/ \
    --seed=42 
    

    Similar to pre-training, you can monitor the fine-tuning status via Tensorboard.
    --help option will display a helper message.

Citation

@inproceedings{yamaguchi-etal-2021-frustratingly,
    title = "Frustratingly Simple Pretraining Alternatives to Masked Language Modeling",
    author = "Yamaguchi, Atsuki  and
      Chrysostomou, George  and
      Margatina, Katerina  and
      Aletras, Nikolaos",
    booktitle = "Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2021",
    publisher = "Association for Computational Linguistics",
}

License

MIT License

Owner
Atsuki Yamaguchi
NLP researcher
Atsuki Yamaguchi
Apache Spark - A unified analytics engine for large-scale data processing

Apache Spark Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an op

The Apache Software Foundation 34.7k Jan 04, 2023
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

212 Dec 25, 2022
LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

LowRankModels.jl LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low r

Madeleine Udell 183 Dec 17, 2022
UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

UpChecker UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down.

Yan 4 Apr 07, 2022
A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

LP-Option-Hedging Description A Python program to analyze leveraged liquidity farming/mining and find the optimal option combination for hedging imper

Aureliano 18 Dec 19, 2022
Vision Transformer for 3D medical image registration (Pytorch).

ViT-V-Net: Vision Transformer for Volumetric Medical Image Registration keywords: vision transformer, convolutional neural networks, image registratio

Junyu Chen 192 Dec 20, 2022
Self-attentive task GAN for space domain awareness data augmentation.

SATGAN TODO: update the article URL once published. Article about this implemention The self-attentive task generative adversarial network (SATGAN) le

Nathan 2 Mar 24, 2022
In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021

In this repo we reproduce and extend results of Learning in High Dimension Always Amounts to Extrapolation by Balestriero et al. 2021. Balestriero et

Sean M. Hendryx 1 Jan 27, 2022
Perception-aware multi-sensor fusion for 3D LiDAR semantic segmentation (ICCV 2021)

Perception-Aware Multi-Sensor Fusion for 3D LiDAR Semantic Segmentation (ICCV 2021) [中文|EN] 概述 本工作主要探索一种高效的多传感器(激光雷达和摄像头)融合点云语义分割方法。现有的多传感器融合方法主要将点云投影

ICE 126 Dec 30, 2022
[ICCV 2021] Relaxed Transformer Decoders for Direct Action Proposal Generation

RTD-Net (ICCV 2021) This repo holds the codes of paper: "Relaxed Transformer Decoders for Direct Action Proposal Generation", accepted in ICCV 2021. N

Multimedia Computing Group, Nanjing University 80 Nov 30, 2022
ChebLieNet, a spectral graph neural network turned equivariant by Riemannian geometry on Lie groups.

ChebLieNet: Invariant spectral graph NNs turned equivariant by Riemannian geometry on Lie groups Hugo Aguettaz, Erik J. Bekkers, Michaël Defferrard We

haguettaz 12 Dec 10, 2022
We utilize deep reinforcement learning to obtain favorable trajectories for visual-inertial system calibration.

Unified Data Collection for Visual-Inertial Calibration via Deep Reinforcement Learning Update: The lastest code will be updated in this branch. Pleas

ETHZ ASL 27 Dec 29, 2022
GemNet model in PyTorch, as proposed in "GemNet: Universal Directional Graph Neural Networks for Molecules" (NeurIPS 2021)

GemNet: Universal Directional Graph Neural Networks for Molecules Reference implementation in PyTorch of the geometric message passing neural network

Data Analytics and Machine Learning Group 124 Dec 30, 2022
Official implementation of the paper "Lightweight Deep CNN for Natural Image Matting via Similarity Preserving Knowledge Distillation"

Lightweight-Deep-CNN-for-Natural-Image-Matting-via-Similarity-Preserving-Knowledge-Distillation Introduction Accepted at IEEE Signal Processing Letter

DongGeun-Yoon 19 Jun 07, 2022
A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery

PiSL A python implementation of Physics-informed Spline Learning for nonlinear dynamics discovery. Sun, F., Liu, Y. and Sun, H., 2021. Physics-informe

Fangzheng (Andy) Sun 8 Jul 13, 2022
basic tutorial on pytorch

Quick Tutorial on PyTorch PyTorch Basics Linear Regression Logistic Regression Artificial Neural Networks Convolutional Neural Networks Recurrent Neur

7 Sep 15, 2022
3D detection and tracking viewer (visualization) for kitti & waymo dataset

3D detection and tracking viewer (visualization) for kitti & waymo dataset

222 Jan 08, 2023
Controlling Hill Climb Racing with Hand Tacking

Controlling Hill Climb Racing with Hand Tacking Opened Palm for Gas Closed Palm for Brake

Rohit Ingole 3 Jan 18, 2022
Deep Residual Networks with 1K Layers

Deep Residual Networks with 1K Layers By Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Microsoft Research Asia (MSRA). Table of Contents Introduc

Kaiming He 856 Jan 06, 2023