当前位置:网站首页>Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
2022-07-04 01:46:00 【Zhiyuan community】
Thesis title :OntoProtein: Protein Pretraining With Gene Ontology Embedding
The author of this article : Zhang Ningyu ( Zhejiang University )、 Bi Zhen ( Zhejiang University )、 Liang Xiaozhuan ( Zhejiang University )、 Cheng Siyuan ( Zhejiang University )、 Hong Haosen ( Zhejiang University )、 Deng Shumin ( Zhejiang University )、 Lian Jiachang ( Zhejiang University )、 Zhang Qiang ( Zhejiang University )、 Chen Huajun ( Zhejiang University )
Give a conference :ICLR 2022
Thesis link :https://arxiv.org/pdf/2201.11147.pdf
Code link :https://github.com/zjunlp/OntoProtein
Welcome to reprint , Reprint please indicate Source

One 、 introduction
Two 、 Protein pre training
Proteins are the basic macromolecules that control organisms and life itself , The study of proteins contributes to the understanding of human health and the development of disease therapy . Proteins contain primary structures , Secondary structure and tertiary structure , The primary structure has similar sequence characteristics with language . Inspired by the pre training model of natural language processing , Many protein pre training models and tools have been proposed , Include MSA Transformer[1]、ProtTrans[2]、 Enlightenment · Wensu [3]、 Baidu PaddleHelix etc. . Large scale unsupervised protein pre training can even acquire a certain degree of protein structure and function from the training corpus . However , Proteins are essentially different from natural language texts , It contains a lot of knowledge unique to Biology , It is difficult to learn directly through the pre training target , And it will be affected by the data distribution. The protein expression of low-frequency long tail . To solve these problems , We use the vast amount of biological knowledge about protein structure and function accumulated by human scientists , A protein pre training method based on knowledge map is proposed for the first time . The following first introduces the construction method of knowledge map .
3、 ... and 、 Gene knowledge map
By accessing the public gene ontology knowledge map “Gene Ontology( abbreviation Go)”, And compare it with that from Swiss-Prot Alignment of protein sequences in the database , To build a knowledge map for pre training ProteinKG25, The knowledge map contains 4,990,097 A triad , among 4,879,951 A protein -Go Triple of ,110,146 individual Go-Go A triple , And has been fully opened for community use . As shown in the figure below , be based on “ Structure determines function ” Thought , If you explicitly tell the model what kind of structure has what kind of function in the process of protein pre training , Obviously, it can promote the prediction of protein function 、 The effect of tasks such as protein interaction prediction .
Four 、 Protein pre training integrated into gene knowledge map :OntoProtein
Based on the constructed knowledge map , We designed a special protein pre training model OntoProtein. Note that there are two different sequences in the pre training input : Protein sequence and description of protein function 、 Text description information of biological process, etc . therefore , We use two different encoders . For protein sequences, we use the existing protein pre training model ProtBert Encoding , For text sequences, we use BERT Encoding . In order to better pre train and fuse triple knowledge information , We have adopted two optimization objectives . The first is the traditional mask language model , We pass the random Mask One of the sequences Token And predict the Token. The second is the triple knowledge enhancement goal , We implant biological triple knowledge by embedding learning similar to knowledge map , As shown in the following formula :

![]()
Notice that the factual knowledge here is divided into two different triples , Namely Go-Go And protein -Go, Therefore, we propose a knowledge enhanced negative sampling method , In order to obtain more representative negative samples and improve the effect of pre training , The sampling method is as follows :

5、 ... and 、 experimental analysis


6、 ... and 、 Summary and prospect
The current booming AI for Science It is promoting the deep integration of Kepler paradigm driven by data and Newton paradigm driven by first principles . be based on “ Data and knowledge two wheel drive ” The academic thought of , In this paper, we propose a protein pre training method based on knowledge map for the first time OntoProtein, The effect of the model is verified in several downstream tasks . some time , We will maintain OntoProtein For more scholars to use , It is planned to explore the knowledge map enhancement pre training method integrating homologous sequence alignment to achieve better performance .
边栏推荐
- be based on. NETCORE development blog project starblog - (14) realize theme switching function
- 0 basic learning C language - nixie tube dynamic scanning display
- QML add gradient animation during state transition
- Typescript basic knowledge sorting
- 7.1 learning content
- MySQL deadly serial question 2 -- are you familiar with MySQL index?
- Idsia & supsi & usi | continuous control behavior learning and adaptive robot operation based on Reinforcement Learning
- Magical usage of edge browser (highly recommended by program ape and student party)
- String hash, find the string hash value after deleting any character, double hash
- Day05 table
猜你喜欢

Huawei cloud micro certification Huawei cloud computing service practice has been stable

Infiltration learning diary day19
![When the watch system of Jerry's is abnormal, it is used to restore the system [chapter]](/img/fb/7d4a026260f8817460cc67f06e49ae.jpg)
When the watch system of Jerry's is abnormal, it is used to restore the system [chapter]

1189. Maximum number of "balloons"

Small program graduation project based on wechat examination small program graduation project opening report function reference
![[turn] solve the problem of](/img/c2/368582a8ed26254409fe391899ba41.jpg)
[turn] solve the problem of "RSA public key not find" appearing in Navicat premium 15 registration

Small program graduation project based on wechat e-book small program graduation project opening report function reference

IPv6 experiment

In the process of seeking human intelligent AI, meta bet on self supervised learning

How can enterprises optimize the best cost of cloud computing?
随机推荐
The latest analysis of hoisting machinery command in 2022 and free examination questions of hoisting machinery command
Prose article appreciation - the rain in the warm country has never changed into cold, hard and brilliant flowers. Knowledgeable people think he is monotonous, and he thinks he is unlucky, doesn't he?
Typescript basic knowledge sorting
Reading notes - learn to write: what is writing?
Intel's new GPU patent shows that its graphics card products will use MCM Packaging Technology
Day05 table
Small program graduation project based on wechat reservation small program graduation project opening report reference
Winter vacation daily question -- a single element in an ordered array
Trading software programming
TP5 automatic registration hook mechanism hook extension, with a complete case
Stringutils and collectionutils
Sequence sorting of basic exercises of test questions
All in one 1412: binary classification
MySQL -- Introduction and use of single line functions
In the process of seeking human intelligent AI, meta bet on self supervised learning
Pratique technique | analyse et solution des défaillances en ligne (Partie 1)
Chinese Mitten Crab - current market situation and future development trend
IPv6 experiment
Huawei cloud micro certification Huawei cloud computing service practice has been stable
In yolov5, denselayer is used to replace focus, and the FPN structure is changed to bi FPN