OntoProtein: Protein Pretraining With Ontology Embedding

Overview

OntoProtein

This is the implement of the paper "OntoProtein: Protein Pretraining With Ontology Embedding". OntoProtein is an effective method that make use of structure in GO (Gene Ontology) into text-enhanced protein pre-training model.

Quick links

Overview

In this work we present OntoProtein, a knowledge-enhanced protein language model that jointly optimize the KE and MLM objectives, which bring excellent improvements to a wide range of protein tasks. And we introduce ProteinKG25, a new large-scale KG dataset, promting the research on protein language pre-training.

Requirements

To run our code, please install dependency packages for related steps.

Environment for pre-training data generation

python3.8 / biopython 1.37 / goatools

Environment for OntoProtein pre-training

python3.8 / pytorch 1.9 / transformer 4.5.1+ / deepspeed 0.5.1/ lmdb /

Environment for protein-related tasks

python3.8 / pytorch 1.9 / transformer 4.5.1+ / lmdb

Note: environments configurations of some baseline models or methods in our experiments, e.g. BLAST, DeepGraphGO, we provide related links to configurate as follows:

BLAST / Interproscan / DeepGraphGO / GNN-PPI

Data preparation

For pretraining OntoProtein, fine-tuning on protein-related tasks and inference, we provide acquirement approach of related data.

Pre-training data

To incorporate Gene Ontology knowledge into language models and train OntoProtein, we construct ProteinKG25, a large-scale KG dataset with aligned descriptions and protein sequences respectively to GO terms and protein entities. There have two approach to acquire the pre-training data: 1) download our prepared data ProteinKG25, 2) generate your own pre-training data.

times

Download released data

We have released our prepared data ProteinKG25 in Google Drive.

The whole compressed package includes following files:

  • go_def.txt: GO term definition, which is text data. We concatenate GO term name and corresponding definition by colon.
  • go_type.txt: The ontology type which the specific GO term belong to. The index is correponding to GO ID in go2id.txt file.
  • go2id.txt: The ID mapping of GO terms.
  • go_go_triplet.txt: GO-GO triplet data. The triplet data constitutes the interior structure of Gene Ontology. The data format is < h r t>, where h and t are respectively head entity and tail entity, both GO term nodes. r is relation between two GO terms, e.g. is_a and part_of.
  • protein_seq.txt: Protein sequence data. The whole protein sequence data are used as inputs in MLM module and protein representations in KE module.
  • protein2id.txt: The ID mapping of proteins.
  • protein_go_train_triplet.txt: Protein-GO triplet data. The triplet data constitutes the exterior structure of Gene Ontology, i.e. Gene annotation. The data format is <h r t>, where h and t are respectively head entity and tail entity. It is different from GO-GO triplet that a triplet in Protein-GO triplet means a specific gene annotation, where the head entity is a specific protein and tail entity is the corresponding GO term, e.g. protein binding function. r is relation between the protein and GO term.
  • relation2id.txt: The ID mapping of relations. We mix relations in two triplet relation.

Generate your own pre-training data

For generating your own pre-training data, you need download following raw data:

  • go.obo: the structure data of Gene Ontology. The download link and detailed format see in Gene Ontology`
  • uniprot_sprot.dat: protein Swiss-Prot database. [link]
  • goa_uniprot_all.gpa: Gene Annotation data. [link]

When download these raw data, you can excute following script to generate pre-training data:

python tools/gen_onto_protein_data.py

Downstream task data

Our experiments involved with several protein-related downstream tasks. [Download datasets]

Protein pre-training model

You can pre-training your own OntoProtein based above pretraining dataset. We provide the script bash script/run_pretrain.sh to run pre-training. And the detailed arguments are all listed in src/training_args.py, you can set pre-training hyperparameters to your need.

Usage for protein-related tasks

Running examples

The shell files of training and evaluation for every task are provided in script/ , and could directly run.

Also, you can utilize the running codes run_downstream.py , and write your shell files according to your need:

  • run_downstream.py: support {ss3, ss8, contact, remote_homology, fluorescence, stability} tasks;

Training models

Running shell files: bash script/run_{task}.sh, and the contents of shell files are as follow:

sh run_main.sh \
    --model ./model/ss3/ProtBertModel \
    --output_file ss3-ProtBert \
    --task_name ss3 \
    --do_train True \
    --epoch 5 \
    --optimizer AdamW \
    --per_device_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --eval_step 100 \
    --eval_batchsize 4 \
    --warmup_ratio 0.08 \
    --frozen_bert False

You can set more detailed parameters in run_main.sh. The details of main.sh are as follows:

LR=3e-5
SEED=3
DATA_DIR=data/datasets
OUTPUT_DIR=data/output_data/$TASK_NAME-$SEED-$OI

python run_downstream.py \
  --task_name $TASK_NAME \
  --data_dir $DATA_DIR \
  --do_train $DO_TRAIN \
  --do_predict True \
  --model_name_or_path $MODEL \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $EB \
  --gradient_accumulation_steps $GS \
  --learning_rate $LR \
  --num_train_epochs $EPOCHS \
  --warmup_ratio $WR \
  --logging_steps $ES \
  --eval_steps $ES \
  --output_dir $OUTPUT_DIR \
  --seed $SEED \
  --optimizer $OPTIMIZER \
  --frozen_bert $FROZEN_BERT \
  --mean_output $MEAN_OUTPUT \

Notice: the best checkpoint is saved in OUTPUT_DIR/.

Owner
ZJUNLP
NLP Group of Knowledge Engine Lab at Zhejiang University
ZJUNLP
Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees" Installa

0 Oct 13, 2021
A Collection of LiDAR-Camera-Calibration Papers, Toolboxes and Notes

A Collection of LiDAR-Camera-Calibration Papers, Toolboxes and Notes

443 Jan 06, 2023
Dynamic vae - Dynamic VAE algorithm is used for anomaly detection of battery data

Dynamic VAE frame Automatic feature extraction can be achieved by probability di

10 Oct 07, 2022
The fastai book, published as Jupyter Notebooks

English / Spanish / Korean / Chinese / Bengali / Indonesian The fastai book These notebooks cover an introduction to deep learning, fastai, and PyTorc

fast.ai 17k Jan 07, 2023
Project dự đoán giá cổ phiếu bằng thuật toán LSTM gồm: code train và code demo

Web predicts stock prices using Long - Short Term Memory algorithm Give me some start please!!! User interface image: Choose: DayBegin, DayEnd, Stock

Vo Thuong Truong Nhon 8 Nov 11, 2022
Code for T-Few from "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning"

T-Few This repository contains the official code for the paper: "Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learni

220 Dec 31, 2022
Music Generation using Neural Networks Streamlit App

Music_Gen_Streamlit "Music Generation using Neural Networks" Streamlit App TO DO: Make a run_app.sh Introduction [~5 min] (Sohaib) Team Member names/i

Muhammad Sohaib Arshid 6 Aug 09, 2022
Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks

Classification of Long Sequential Data using Circular Dilated Convolutional Neural Networks arXiv preprint: https://arxiv.org/abs/2201.02143. Architec

19 Nov 30, 2022
Контрольная работа по математическим методам машинного обучения

ML-MathMethods-Test Контрольная работа по математическим методам машинного обучения. Вычисление основных статистик, диаграмм и графиков, проверка разл

Stas Ivanovskii 1 Jan 06, 2022
The audio-video synchronization of MKV Container Format is exploited to achieve data hiding

The audio-video synchronization of MKV Container Format is exploited to achieve data hiding, where the hidden data can be utilized for various management purposes, including hyper-linking, annotation

Maxim Zaika 1 Nov 17, 2021
3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

3DMV 3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans. This work is based on our ECCV'18 p

Владислав Молодцов 0 Feb 06, 2022
Code for paper Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting

Decoupled Spatial-Temporal Graph Neural Networks Code for our paper: Decoupled Dynamic Spatial-Temporal Graph Neural Network for Traffic Forecasting.

S22 43 Jan 04, 2023
[RSS 2021] An End-to-End Differentiable Framework for Contact-Aware Robot Design

DiffHand This repository contains the implementation for the paper An End-to-End Differentiable Framework for Contact-Aware Robot Design (RSS 2021). I

Jie Xu 60 Jan 04, 2023
3D HourGlass Networks for Human Pose Estimation Through Videos

3D-HourGlass-Network 3D CNN Based Hourglass Network for Human Pose Estimation (3D Human Pose) from videos. This was my summer'18 research project. Dis

Naman Jain 51 Jan 02, 2023
Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting

Guided Internet-delivered Cognitive Behavioral Therapy Adherence Forecasting #Dataset The folder "Dataset" contains the dataset use in this work and m

0 Jan 08, 2022
Python wrapper to access the amazon selling partner API

PYTHON-AMAZON-SP-API Amazon Selling-Partner API If you have questions, please join on slack Contributions very welcome! Installation pip install pytho

Michael Primke 330 Jan 06, 2023
This is the official PyTorch implementation of our paper: "Artistic Style Transfer with Internal-external Learning and Contrastive Learning".

Artistic Style Transfer with Internal-external Learning and Contrastive Learning This is the official PyTorch implementation of our paper: "Artistic S

51 Dec 20, 2022
EfficientMPC - Efficient Model Predictive Control Implementation

efficientMPC Efficient Model Predictive Control Implementation The original algo

Vin 8 Dec 04, 2022
Ppq - A powerful offline neural network quantization tool with custimized IR

PPL Quantization Tool(PPL 量化工具) PPL Quantization Tool (PPQ) is a powerful offlin

605 Jan 03, 2023
PyGAD, a Python 3 library for building the genetic algorithm and training machine learning algorithms (Keras & PyTorch).

PyGAD: Genetic Algorithm in Python PyGAD is an open-source easy-to-use Python 3 library for building the genetic algorithm and optimizing machine lear

Ahmed Gad 1.1k Dec 26, 2022