SapBERT: Self-alignment pretraining for BERT
This repo holds code for the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations [arxiv]; and our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [PDF].
Huggingface Models
[SapBERT]
Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext
as the base model. Use [CLS]
(before pooler) as the representation of the input.
[SapBERT-XLMR]
Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base
as the base model. Use [CLS]
(before pooler) as the representation of the input.
[SapBERT-mean-token]
Same as the standard SapBERT but trained with mean-pooling instead of [CLS]
representations.
Environment
The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt
for more details.
Train SapBERT
Prepare training data as insrtructed in data/generate_pretraining_data.ipynb
.
Run:
cd umls_pretraining
./pretrain.sh 0,1
where 0,1
specifies the GPU devices.
Evaluate SapBERT
Please view evaluation/README.md
for details.
Citations
@article{liu2021self,
title={Self-Alignment Pretraining for Biomedical Entity Representations},
author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
journal={arXiv preprint arXiv:2010.11784},
year={2020}
}
Acknowledgement
Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.
License
SapBERT is MIT licensed. See the LICENSE file for details.