Pre-training BERT Masked Language Models (MLM)

This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to pre-train JuriBERT presented in [https://arxiv.org/abs/2110.01485].

It also contains the code of the classification task that was used to evaluate JuriBERT.

Our models can be found at [http://master2-bigdata.polytechnique.fr/FrenchLinguisticResources/resources#juribert] and downloaded upon request.

Instructions

To pre-train a new BERT model you need the path to a dataset containing raw text. You can also specify an existing tokenizer for the model. Paths for saving the model and the checkpoints are required.

python pretrain.py \
      --files /path/to/text \
      --model_path /path/to/save/model \
      --checkpoint /path/to/save/checkpoints \
      --epochs 30 \
      --hidden_layers 2 \
      --hidden_size 128 \
      --attention_heads 2 \
      --save_steps 10 \
      --save_limit 0 \
      --min_freq 0

To finetune on a classification task you need the path to the pre-trained model and a CSV file containing the classification dataset. You need to specify the columns containing the category and the text as well as the path for saving the final model and the checkpoints.

python classification.py \
  --model "custom" \
  --pretrained_path /path/to/model.bin \
  --tokenizer_path /path/to/tokenizer.json \
  --data /path/to/data.csv \
  --category "category-column" \
  --text "text-column" \
  --model_path /path/to/save/model \
  --checkpoint /path/to/save/checkpoints

You can use --help to see all the available commands.

To test the masked language model use:

fill_mask = pipeline(
    "fill-mask",
    model="/path/to/model",
    tokenizer=tokenizer
)

fill_mask("Paris est la capitale de la <mask>.")

Pre-training BERT masked language models with custom vocabulary

Related tags

Overview

Pre-training BERT Masked Language Models (MLM)

Instructions

Owner

Stella Douka

Beta Distribution Guided Aspect-aware Graph for Aspect Category Sentiment Analysis with Affective Knowledge. Proceedings of EMNLP 2021

This is the source code of RPG (Reward-Randomized Policy Gradient)

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Jupyter Notebook tutorials on solving real-world problems with Machine Learning & Deep Learning using PyTorch

PyJPBoatRace: Python-based Japanese boatrace tools 🚤

An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

What are the best Systems? New Perspectives on NLP Benchmarking

A Python script which randomly chooses and prints a file from a directory.

Code for the paper "A Simple but Tough-to-Beat Baseline for Sentence Embeddings".

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

FireFlyer Record file format, writer and reader for DL training samples.

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

BiNE: Bipartite Network Embedding

NLPretext packages in a unique library all the text preprocessing functions you need to ease your NLP project.

Official PyTorch implementation of "Dual Path Learning for Domain Adaptation of Semantic Segmentation".

LCG T-TEST USING EUCLIDEAN METHOD

Extract city and country mentions from Text like GeoText without regex, but FlashText, a Aho-Corasick implementation.

A 30000+ Chinese MRC dataset - Delta Reading Comprehension Dataset

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform