This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Last update: Dec 24, 2022

Related tags

Overview

MoEBERT

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Installation

Create and activate conda environment.

conda env create -f environment.yml

Install Transformers locally.

pip install -e .

Note: The code is adapted from this codebase. Arguments regarding LoRA and adapter can be safely ignored.

Instructions

MoEBERT targets task-specific distillation. Before running any distillation code, a pre-trained BERT model should be fine-tuned on the target task. Path to the fine-tuned model should be passed to --model_name_or_path.

Importance Score Computation

Use bert_base_mnli_example.sh to compute the importance scores, add a --preprocess_importance argument, remove the --do_train argument.
If multiple GPUs are used to compute the importance scores, a importance_[rank].pkl file will be saved for each GPU. Use merge_importance.py to merge these files.
To use the pre-computed importance scores, pass the file name to --moebert_load_importance.

Knowledge Distillation

For GLUE tasks, see examples/text-classification/run_glue.py.
For question answering tasks, see examples/question-answering/run_qa.py.
Run bash bert_base_mnli_example.sh as an example.
The codebase supports different routing strategies: gate-token, gate-sentence, hash-random and hash-balance. Choices should be passed to --moebert_route_method.
- To use hash-balance, a balanced hash list needs to be pre-computed using hash_balance.py. Path to the saved hash list should be passed to --moebert_route_hash_list.
- Add a load balancing loss by setting --moebert_load_balance when using trainable gating mechanisms.
- The sentence-based gating mechanism (gate-sentence) is advantageous for inference because it induces significantly less communication overhead compared with token-level routing methods.

This PyTorch package implements MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation (NAACL 2022).

Related tags

Overview

MoEBERT

Installation

Instructions

Importance Score Computation

Knowledge Distillation

Owner

Simiao Zuo

This project is used for the paper Differentiable Programming of Isometric Tensor Network

a general-purpose Transformer based vision backbone

Code used to generate the results appearing in "Train longer, generalize better: closing the generalization gap in large batch training of neural networks"

Robust Partial Matching for Person Search in the Wild

Complete system for facial identity system. Include one-shot model, database operation, features visualization, monitoring

Nested cross-validation is necessary to avoid biased model performance in embedded feature selection in high-dimensional data with tiny sample sizes

Sinkformers: Transformers with Doubly Stochastic Attention

This is the PyTorch implementation of GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation

face2comics by Sxela (Alex Spirin) - face2comics datasets

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

Introducing neural networks to predict stock prices

The implementation of the paper "HIST: A Graph-based Framework for Stock Trend Forecasting via Mining Concept-Oriented Shared Information".

CharacterGAN: Few-Shot Keypoint Character Animation and Reposing

Set of methods to ensemble boxes from different object detection models, including implementation of "Weighted boxes fusion (WBF)" method.

SimBERT升级版（SimBERTv2）！

A geometric deep learning pipeline for predicting protein interface contacts.

PyTorch and GPyTorch implementation of the paper "Conditioning Sparse Variational Gaussian Processes for Online Decision-making."

A self-supervised learning framework for audio-visual speech

Leaf: Multiple-Choice Question Generation

A Pytorch Implementation for Compact Bilinear Pooling.