当前位置:网站首页>Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
2022-07-04 01:46:00 【Zhiyuan community】
Thesis title :OntoProtein: Protein Pretraining With Gene Ontology Embedding
The author of this article : Zhang Ningyu ( Zhejiang University )、 Bi Zhen ( Zhejiang University )、 Liang Xiaozhuan ( Zhejiang University )、 Cheng Siyuan ( Zhejiang University )、 Hong Haosen ( Zhejiang University )、 Deng Shumin ( Zhejiang University )、 Lian Jiachang ( Zhejiang University )、 Zhang Qiang ( Zhejiang University )、 Chen Huajun ( Zhejiang University )
Give a conference :ICLR 2022
Thesis link :https://arxiv.org/pdf/2201.11147.pdf
Code link :https://github.com/zjunlp/OntoProtein
Welcome to reprint , Reprint please indicate Source

One 、 introduction
Two 、 Protein pre training
Proteins are the basic macromolecules that control organisms and life itself , The study of proteins contributes to the understanding of human health and the development of disease therapy . Proteins contain primary structures , Secondary structure and tertiary structure , The primary structure has similar sequence characteristics with language . Inspired by the pre training model of natural language processing , Many protein pre training models and tools have been proposed , Include MSA Transformer[1]、ProtTrans[2]、 Enlightenment · Wensu [3]、 Baidu PaddleHelix etc. . Large scale unsupervised protein pre training can even acquire a certain degree of protein structure and function from the training corpus . However , Proteins are essentially different from natural language texts , It contains a lot of knowledge unique to Biology , It is difficult to learn directly through the pre training target , And it will be affected by the data distribution. The protein expression of low-frequency long tail . To solve these problems , We use the vast amount of biological knowledge about protein structure and function accumulated by human scientists , A protein pre training method based on knowledge map is proposed for the first time . The following first introduces the construction method of knowledge map .
3、 ... and 、 Gene knowledge map
By accessing the public gene ontology knowledge map “Gene Ontology( abbreviation Go)”, And compare it with that from Swiss-Prot Alignment of protein sequences in the database , To build a knowledge map for pre training ProteinKG25, The knowledge map contains 4,990,097 A triad , among 4,879,951 A protein -Go Triple of ,110,146 individual Go-Go A triple , And has been fully opened for community use . As shown in the figure below , be based on “ Structure determines function ” Thought , If you explicitly tell the model what kind of structure has what kind of function in the process of protein pre training , Obviously, it can promote the prediction of protein function 、 The effect of tasks such as protein interaction prediction .
Four 、 Protein pre training integrated into gene knowledge map :OntoProtein
Based on the constructed knowledge map , We designed a special protein pre training model OntoProtein. Note that there are two different sequences in the pre training input : Protein sequence and description of protein function 、 Text description information of biological process, etc . therefore , We use two different encoders . For protein sequences, we use the existing protein pre training model ProtBert Encoding , For text sequences, we use BERT Encoding . In order to better pre train and fuse triple knowledge information , We have adopted two optimization objectives . The first is the traditional mask language model , We pass the random Mask One of the sequences Token And predict the Token. The second is the triple knowledge enhancement goal , We implant biological triple knowledge by embedding learning similar to knowledge map , As shown in the following formula :

![]()
Notice that the factual knowledge here is divided into two different triples , Namely Go-Go And protein -Go, Therefore, we propose a knowledge enhanced negative sampling method , In order to obtain more representative negative samples and improve the effect of pre training , The sampling method is as follows :

5、 ... and 、 experimental analysis


6、 ... and 、 Summary and prospect
The current booming AI for Science It is promoting the deep integration of Kepler paradigm driven by data and Newton paradigm driven by first principles . be based on “ Data and knowledge two wheel drive ” The academic thought of , In this paper, we propose a protein pre training method based on knowledge map for the first time OntoProtein, The effect of the model is verified in several downstream tasks . some time , We will maintain OntoProtein For more scholars to use , It is planned to explore the knowledge map enhancement pre training method integrating homologous sequence alignment to achieve better performance .
边栏推荐
- Why can't it run (unresolved)
- Who moved my code!
- Summary of common tools and technical points of PMP examination
- Fundamentals of machine learning: feature selection with lasso
- I don't know why it can't run in the project and how to change it
- TP5 automatic registration hook mechanism hook extension, with a complete case
- Huawei cloud micro certification Huawei cloud computing service practice has been stable
- Pyrethroid pesticide intermediates - market status and future development trend
- Cancer biopsy instruments and kits - market status and future development trends
- Idsia & supsi & usi | continuous control behavior learning and adaptive robot operation based on Reinforcement Learning
猜你喜欢

Long article review: entropy, free energy, symmetry and dynamics in the brain

Who moved my code!

A fan summed up so many interview questions for you. There is always one you need!

2020-12-02 SSM advanced integration Shang Silicon Valley

Ka! Why does the seat belt suddenly fail to pull? After reading these pictures, I can't stop wearing them

Basic editing specifications and variables of shell script

MySQL - use of aggregate functions and group by groups

Do you know the eight signs of a team becoming agile?
![Jerry's watch listens to the message notification of the target third-party software and pushes the message to the device [article]](/img/8b/ff062f34d36e1caa9909c8ab431daf.jpg)
Jerry's watch listens to the message notification of the target third-party software and pushes the message to the device [article]

Infiltration learning diary day19
随机推荐
Pratique technique | analyse et solution des défaillances en ligne (Partie 1)
Hash table, string hash (special KMP)
The latest analysis of hoisting machinery command in 2022 and free examination questions of hoisting machinery command
In the process of seeking human intelligent AI, meta bet on self supervised learning
Should enterprises start building progressive web applications?
Méthode de calcul de la connexion MSSQL de la carte esp32c3
Force deduction solution summary 1189- maximum number of "balloons"
Meta metauniverse female safety problems occur frequently, how to solve the relevant problems in the metauniverse?
MySQL utilise la vue pour signaler les erreurs, Explicit / show ne peut pas être publié; Verrouillage des fichiers privés pour la table sous - jacente
Jerry's watch information type table [chapter]
HackTheBox-baby breaking grad
Cancer biopsy instruments and kits - market status and future development trends
51 single chip microcomputer timer 2 is used as serial port
Feign implements dynamic URL
Maximum likelihood method, likelihood function and log likelihood function
Setting function of Jerry's watch management device [chapter]
2022 new examination questions for safety management personnel of hazardous chemical business units and certificate examination for safety management personnel of hazardous chemical business units
Which insurance products can the elderly buy?
All in one 1412: binary classification
ThinkPHP uses redis to update database tables