[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Last update: Dec 07, 2022

Overview

SapBERT: Self-alignment pretraining for BERT

This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations [arxiv]; and our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [PDF].

Huggingface Models

[SapBERT]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-XLMR]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base as the base model. Use [CLS] (before pooler) as the representation of the input.

[SapBERT-mean-token]

Same as the standard SapBERT but trained with mean-pooling instead of [CLS] representations.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Train SapBERT

Prepare training data as insrtructed in data/generate_pretraining_data.ipynb.

Run:

cd umls_pretraining
./pretrain.sh 0,1

where 0,1 specifies the GPU devices.

Evaluate SapBERT

Please view evaluation/README.md for details.

Citations

@article{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	journal={arXiv preprint arXiv:2010.11784},
	year={2020}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

[NAACL & ACL 2021] SapBERT: Self-alignment pretraining for BERT.

Related tags

Overview

SapBERT: Self-alignment pretraining for BERT

Huggingface Models

[SapBERT]

[SapBERT-XLMR]

[SapBERT-mean-token]

Environment

Train SapBERT

Evaluate SapBERT

Citations

Acknowledgement

License

Owner

Cambridge Language Technology Lab

Like Dirt-Samples, but cleaned up

Implementation of Artificial Neural Network Algorithm

Official Implementation for "StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery" (ICCV 2021 Oral)

Using machine learning to predict and analyze high and low reader engagement for New York Times articles posted to Facebook.

Learning and Building Convolutional Neural Networks using PyTorch

Tensorflow implementation of ID-Unet: Iterative Soft and Hard Deformation for View Synthesis.

Source Code of NeurIPS21 paper: Recognizing Vector Graphics without Rasterization

Cooperative Driving Dataset: a dataset for multi-agent driving scenarios

The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

Logistic Bandit experiments. Official code for the paper "Jointly Efficient and Optimal Algorithms for Logistic Bandits".

From this paper "SESNet: A Semantically Enhanced Siamese Network for Remote Sensing Change Detection"

ShapeGlot: Learning Language for Shape Differentiation

HuSpaCy: industrial-strength Hungarian natural language processing

This is the repo for the paper "Improving the Accuracy-Memory Trade-Off of Random Forests Via Leaf-Refinement".

Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Pytorch implementation of Nueral Style transfer

An image classification app boilerplate to serve your deep learning models asap!

Place holder for HOPE: a human-centric and task-oriented MT evaluation framework using professional post-editing

On-device speech-to-index engine powered by deep learning.

A Python library that provides a simplified alternative to DBAPI 2