当前位置:网站首页>Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
Iclr2022 | ontoprotein: protein pre training integrated with gene ontology knowledge
2022-07-04 01:46:00 【Zhiyuan community】
Thesis title :OntoProtein: Protein Pretraining With Gene Ontology Embedding
The author of this article : Zhang Ningyu ( Zhejiang University )、 Bi Zhen ( Zhejiang University )、 Liang Xiaozhuan ( Zhejiang University )、 Cheng Siyuan ( Zhejiang University )、 Hong Haosen ( Zhejiang University )、 Deng Shumin ( Zhejiang University )、 Lian Jiachang ( Zhejiang University )、 Zhang Qiang ( Zhejiang University )、 Chen Huajun ( Zhejiang University )
Give a conference :ICLR 2022
Thesis link :https://arxiv.org/pdf/2201.11147.pdf
Code link :https://github.com/zjunlp/OntoProtein
Welcome to reprint , Reprint please indicate Source

One 、 introduction
Two 、 Protein pre training
Proteins are the basic macromolecules that control organisms and life itself , The study of proteins contributes to the understanding of human health and the development of disease therapy . Proteins contain primary structures , Secondary structure and tertiary structure , The primary structure has similar sequence characteristics with language . Inspired by the pre training model of natural language processing , Many protein pre training models and tools have been proposed , Include MSA Transformer[1]、ProtTrans[2]、 Enlightenment · Wensu [3]、 Baidu PaddleHelix etc. . Large scale unsupervised protein pre training can even acquire a certain degree of protein structure and function from the training corpus . However , Proteins are essentially different from natural language texts , It contains a lot of knowledge unique to Biology , It is difficult to learn directly through the pre training target , And it will be affected by the data distribution. The protein expression of low-frequency long tail . To solve these problems , We use the vast amount of biological knowledge about protein structure and function accumulated by human scientists , A protein pre training method based on knowledge map is proposed for the first time . The following first introduces the construction method of knowledge map .
3、 ... and 、 Gene knowledge map
By accessing the public gene ontology knowledge map “Gene Ontology( abbreviation Go)”, And compare it with that from Swiss-Prot Alignment of protein sequences in the database , To build a knowledge map for pre training ProteinKG25, The knowledge map contains 4,990,097 A triad , among 4,879,951 A protein -Go Triple of ,110,146 individual Go-Go A triple , And has been fully opened for community use . As shown in the figure below , be based on “ Structure determines function ” Thought , If you explicitly tell the model what kind of structure has what kind of function in the process of protein pre training , Obviously, it can promote the prediction of protein function 、 The effect of tasks such as protein interaction prediction .
Four 、 Protein pre training integrated into gene knowledge map :OntoProtein
Based on the constructed knowledge map , We designed a special protein pre training model OntoProtein. Note that there are two different sequences in the pre training input : Protein sequence and description of protein function 、 Text description information of biological process, etc . therefore , We use two different encoders . For protein sequences, we use the existing protein pre training model ProtBert Encoding , For text sequences, we use BERT Encoding . In order to better pre train and fuse triple knowledge information , We have adopted two optimization objectives . The first is the traditional mask language model , We pass the random Mask One of the sequences Token And predict the Token. The second is the triple knowledge enhancement goal , We implant biological triple knowledge by embedding learning similar to knowledge map , As shown in the following formula :

![]()
Notice that the factual knowledge here is divided into two different triples , Namely Go-Go And protein -Go, Therefore, we propose a knowledge enhanced negative sampling method , In order to obtain more representative negative samples and improve the effect of pre training , The sampling method is as follows :

5、 ... and 、 experimental analysis


6、 ... and 、 Summary and prospect
The current booming AI for Science It is promoting the deep integration of Kepler paradigm driven by data and Newton paradigm driven by first principles . be based on “ Data and knowledge two wheel drive ” The academic thought of , In this paper, we propose a protein pre training method based on knowledge map for the first time OntoProtein, The effect of the model is verified in several downstream tasks . some time , We will maintain OntoProtein For more scholars to use , It is planned to explore the knowledge map enhancement pre training method integrating homologous sequence alignment to achieve better performance .
边栏推荐
- Gnupg website
- Human resource management online assignment
- 求esp32C3板子連接mssql方法
- Gee: create a new feature and set corresponding attributes
- Winter vacation daily question -- a single element in an ordered array
- Related configuration commands of Huawei rip
- 2022 electrician (elementary) examination question bank and electrician (elementary) simulation examination question bank
- G3 boiler water treatment registration examination and G3 boiler water treatment theory examination in 2022
- MySQL - use of aggregate functions and group by groups
- Conditional statements of shell programming
猜你喜欢

Long article review: entropy, free energy, symmetry and dynamics in the brain

What is the student party's Bluetooth headset recommendation? Student party easy to use Bluetooth headset recommended

Yyds dry goods inventory it's not easy to say I love you | use the minimum web API to upload files

Maximum entropy model

A fan summed up so many interview questions for you. There is always one you need!

What are the advantages and disadvantages of data center agents?

Pratique technique | analyse et solution des défaillances en ligne (Partie 1)
![[turn] solve the problem of](/img/c2/368582a8ed26254409fe391899ba41.jpg)
[turn] solve the problem of "RSA public key not find" appearing in Navicat premium 15 registration

Three layer switching ①

A malware detection method for checking PLC system using satisfiability modulus theoretical model
随机推荐
MySQL utilise la vue pour signaler les erreurs, Explicit / show ne peut pas être publié; Verrouillage des fichiers privés pour la table sous - jacente
Idsia & supsi & usi | continuous control behavior learning and adaptive robot operation based on Reinforcement Learning
Audio resource settings for U3D resource management
2022 new examination questions for safety management personnel of hazardous chemical business units and certificate examination for safety management personnel of hazardous chemical business units
Avoid playing with super high conversion rate in material minefields
String hash, find the string hash value after deleting any character, double hash
Who moved my code!
Why is the operation unsuccessful (unresolved) uncaught syntaxerror: invalid or unexpected token (resolved)
Chinese Mitten Crab - current market situation and future development trend
Pratique technique | analyse et solution des défaillances en ligne (Partie 1)
A little understanding of GSLB (global server load balance) technology
Pyrethroid pesticide intermediates - market status and future development trend
Flex flexible layout, box in the middle of the page
0 basic learning C language - nixie tube dynamic scanning display
Ka! Why does the seat belt suddenly fail to pull? After reading these pictures, I can't stop wearing them
Long article review: entropy, free energy, symmetry and dynamics in the brain
Basic editing specifications and variables of shell script
MySQL deadly serial question 2 -- are you familiar with MySQL index?
求esp32C3板子連接mssql方法
C library function int fprintf (file *stream, const char *format,...) Send formatted output to stream