当前位置：网站首页>Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge

Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge

2022-07-04 01:46:00 【Zhiyuan community】

Thesis title ：OntoProtein: Protein Pretraining With Gene Ontology Embedding

The author of this article ： Zhang Ningyu （ Zhejiang University ）、 Bi Zhen （ Zhejiang University ）、 Liang Xiaozhuan （ Zhejiang University ）、 Cheng Siyuan （ Zhejiang University ）、 Hong Haosen （ Zhejiang University ）、 Deng Shumin （ Zhejiang University ）、 Lian Jiachang （ Zhejiang University ）、 Zhang Qiang （ Zhejiang University ）、 Chen Huajun （ Zhejiang University ）

Give a conference ：ICLR 2022

Thesis link ：https://arxiv.org/pdf/2201.11147.pdf

Code link ：https://github.com/zjunlp/OntoProtein

Welcome to reprint , Reprint please indicate Source

One 、 introduction

In recent years , The pre training model has a powerful algorithm effect , It has swept the major industries represented by natural language processing AI List and test data set . Similar to natural language , The primary structure of protein has sequence characteristics , This provides favorable conditions for introducing language pre training model into protein representation . However , Proteins are essentially different from natural language texts , It contains a lot of biological knowledge that is difficult to acquire for pre training objectives . in fact , Human scientists have accumulated a large amount of biological knowledge about protein structure and function . So how to use this knowledge to promote protein pre training ？ This article will introduce ICLR2022 A new job ：OntoProtein, A novel protein pre training method integrated with knowledge map is proposed .

Two 、 Protein pre training

Proteins are the basic macromolecules that control organisms and life itself , The study of proteins contributes to the understanding of human health and the development of disease therapy . Proteins contain primary structures , Secondary structure and tertiary structure , The primary structure has similar sequence characteristics with language . Inspired by the pre training model of natural language processing , Many protein pre training models and tools have been proposed , Include MSA Transformer[1]、ProtTrans[2]、 Enlightenment · Wensu [3]、 Baidu PaddleHelix etc. . Large scale unsupervised protein pre training can even acquire a certain degree of protein structure and function from the training corpus . However , Proteins are essentially different from natural language texts , It contains a lot of knowledge unique to Biology , It is difficult to learn directly through the pre training target , And it will be affected by the data distribution. The protein expression of low-frequency long tail . To solve these problems , We use the vast amount of biological knowledge about protein structure and function accumulated by human scientists , A protein pre training method based on knowledge map is proposed for the first time . The following first introduces the construction method of knowledge map .

3、 ... and 、 Gene knowledge map

By accessing the public gene ontology knowledge map “Gene Ontology（ abbreviation Go）”, And compare it with that from Swiss-Prot Alignment of protein sequences in the database , To build a knowledge map for pre training ProteinKG25, The knowledge map contains 4,990,097 A triad , among 4,879,951 A protein -Go Triple of ,110,146 individual Go-Go A triple , And has been fully opened for community use . As shown in the figure below , be based on “ Structure determines function ” Thought , If you explicitly tell the model what kind of structure has what kind of function in the process of protein pre training , Obviously, it can promote the prediction of protein function 、 The effect of tasks such as protein interaction prediction .

Four 、 Protein pre training integrated into gene knowledge map ：OntoProtein

Based on the constructed knowledge map , We designed a special protein pre training model OntoProtein. Note that there are two different sequences in the pre training input ： Protein sequence and description of protein function 、 Text description information of biological process, etc . therefore , We use two different encoders . For protein sequences, we use the existing protein pre training model ProtBert Encoding , For text sequences, we use BERT Encoding . In order to better pre train and fuse triple knowledge information , We have adopted two optimization objectives . The first is the traditional mask language model , We pass the random Mask One of the sequences Token And predict the Token. The second is the triple knowledge enhancement goal , We implant biological triple knowledge by embedding learning similar to knowledge map , As shown in the following formula ：

Notice that the factual knowledge here is divided into two different triples , Namely Go-Go And protein -Go, Therefore, we propose a knowledge enhanced negative sampling method , In order to obtain more representative negative samples and improve the effect of pre training , The sampling method is as follows :

5、 ... and 、 experimental analysis

We're testing the protein benchmark TAPE, And protein protein interactions 、 Protein function prediction （ We refer to CAFA The competition constructed a new protein function prediction data set ） We did experiments on . As shown in the following table , It can be found that the protein pre training method fused with knowledge map has achieved better or comparable performance to a certain extent . Specially , Our method does not use homologous sequence alignment （MSA）, Therefore, it is difficult to go beyond MSA Transformer Methods . Please refer to the paper for detailed experimental results , We will sort out and publish the pre training model to Huggingface For community use .

6、 ... and 、 Summary and prospect

The current booming AI for Science It is promoting the deep integration of Kepler paradigm driven by data and Newton paradigm driven by first principles . be based on “ Data and knowledge two wheel drive ” The academic thought of , In this paper, we propose a protein pre training method based on knowledge map for the first time OntoProtein, The effect of the model is verified in several downstream tasks . some time , We will maintain OntoProtein For more scholars to use , It is planned to explore the knowledge map enhancement pre training method integrating homologous sequence alignment to achieve better performance .

[1] MSA Transformer ICML2021

[2] ProtTrans: Towards Cracking the Language of Life’s Code Through Self-Supervised Learning TPAMI2021

[3] Modeling Protein Using Large-scale Pretrain Language Model 2021

原网站

版权声明
本文为[Zhiyuan community]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202141904479549.html

当前位置：网站首页>Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge

Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge

边栏推荐

猜你喜欢

随机推荐