当前位置:网站首页>2020 Bioinformatics | TransformerCPI
2020 Bioinformatics | TransformerCPI
2022-07-04 04:21:00 【Stunned flounder (】
2020 Bioinformatics | TransformerCPI: Improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments
Paper: https://academic.oup.com/bioinformatics/article/36/16/4406/5840724?login=false
Code: https://github.com/lifanchen-simm/transformerCPI
Abstract
Identifying compounds - Protein interactions (CPI) It is a key task in drug discovery and chemical genomics research , Proteins without three-dimensional structure account for a large part of potential biological targets , This requires the development of prediction using only protein sequence information CPI Methods . However , Sequence based CPI The model may face some specific defects , Including the use of inappropriate data sets 、 Hidden ligand bias and improper splitting of data sets , This leads to overestimation of its prediction performance .
result
To solve these problems , Here we have built a dedicated for CPI Predicted new data set , A new method named TransformerCPI New transformer neural network , A more rigorous label reversal experiment is introduced to test whether the model learns the real interaction characteristics .TransformerCPI Significantly improved performance has been achieved in the new experiment , And it can be deconvoluted to highlight the important interaction regions between protein sequences and compound atoms , This may contribute to the study of Chemical Biology , And provide useful guidance for further ligand structure optimization .
The current problems
- ** Use inappropriate data sets :** The learning of the model mainly depends on the data set it inputs , Inappropriate data sets make the model easy to deviate from the target . Based on chemical genomics CPI Modeling , The overall goal of modeling is to predict the interactions between different proteins and compounds based on the abstract representation of the characteristics of proteins and ligands . There is a collection of DUD-E Data sets , Designed to train structure based virtual screening . Besides ,DUD-E、MUV、Human and BindingDB Most of the ligands in appear in only one class , Negative samples are generated by algorithms that may introduce undetectable noise . These datasets can be separated by ligand information , And there is no guarantee that the model learns protein information or interaction characteristics .
- Hide ligand deviation :. Structure based virtual filtering 、 be based on 3D-CNN And others in DUD-E Training model on dataset (Sieg wait forsomeone ,2019 year )) It has been pointed out that prediction is mainly based on ligand modes rather than interaction characteristics , It leads to the mismatch between theoretical modeling and practical application .
- ** Improperly split data sets :** Machine learning researchers randomly divide the data into training sets and test sets . However , The traditional classification measurement is used on the randomly divided test set , It is not clear whether the model learns the real interaction features or other unexpected hidden variables , This may lead to accurate models that answer the wrong questions .
TransformerCPI The model architecture of
The model proposed by the author is based on transformer framework , The architecture was originally designed for neural machine translation tasks .transformer It is an automatic regression encoder - Decoder model , It combines multi head attention layer and position feedforward function to solve sequence to sequence tasks . Many pre training models are limited to seq2seq Mission , But the author was inspired by his powerful ability to capture features between two sequences , The converter structure was modified to predict compounds and proteins as two sequences CPI.TransformerCPI An overview of this is shown in the figure 2 Shown , Which retains transformer The decoder of , And modify its encoder and the final linear layer .
To convert protein sequences into sequential representations , The author first divided the protein sequence into overlapping 3-gram The amino acid sequence of , Then through the pre training method word2vec Translate all words into real values and embed .Word2vec It's an unsupervised Technology , It is used to learn high-quality distributed vector representations that describe complex syntactic and semantic word relationships . Integrate Skip-Gram and CBOW,word2vec Finally, words can be mapped to low dimensional real valued vectors , Among them, words with similar semantics map to vectors close to each other .
Then the sequence feature vector of the protein is passed to the encoder , To understand more abstract representations of proteins . The author replaces the original self attention layer in the encoder with a relatively simple structure . Considering the traditional transformer Architecture usually requires a huge training corpus , And it is easy to fit on small or medium-sized data sets , Therefore, the author uses a Conv1D And linear gated convolution network . Because it shows better performance on the data set designed by the author . The input of the gated convolution network is the protein eigenvector .
The author will design to solve the problem of semi supervised node classification GCN Move to solving the problem of molecular representation . When obtaining protein sequence representation and atomic representation . The interactive function is through transformer The decoder of , The decoder consists of a self attention layer and a feedforward layer . Protein sequence is the input of encoder , The atomic sequence is the input of the decoder , The output of the decoder is an interaction sequence with interaction characteristics and the same length as the atomic sequence . And the author modified the mask operation of the decoder , To ensure that the model can access the entire sequence , This is one of the most critical modifications to transform the transformation architecture from autoregressive tasks to classification tasks .
Last , Feed back the final interaction eigenvector to the following fully connected layers , And return the probability of interaction between compound and protein y.
Data sets
Open dataset : Three benchmark datasets , Human data sets , Caenorhabditis elegans data set and BindingDB Data sets .
** Label inversion dataset :** Many previous studies have passed CPI The random cross combination of pairs or the use of similarity based methods to generate negative samples , This may introduce unexpected noise and unnoticed deviations . First , The author from GLASS The database builds a GPCR Data sets . secondly , The author is based on KIBA The dataset builds Kinase(Kinase, Activating enzyme ) Data sets .
In order to confirm the interactive function of the actual learning of the model and accurately evaluate the impact of hidden variables , The author proposed a more rigorous label reversal experiment . The ligands in the training set only appear in one class of samples ( Positive or negative interaction CPI Yes ), Ligands only appear in another class of samples in the test set . In this way , Models are forced to use protein information to understand interaction patterns , And make opposite predictions for those selected ligands . If the model only stores ligand patterns , It is impossible to make a correct prediction , Because its stored ligand has the wrong ( contrary ) label . therefore , The label reversal experiment is specially designed to evaluate the results based on chemical genomics CPI Model , And it can indicate how much influence the hidden ligand bias has .
summary
The transformation architecture with self attention mechanism is modified to solve sequence based CPI Classification task , Thus, a system named TransformerCPI Model of , The model shows high performance on three benchmark datasets . To address the potential risks of deep learning , The authors constructed a specific genomics based on Chemistry CPI New dataset for the task , And designed a more rigorous label reversal experiment . Compared with other models , In the new experiment ,TransformerCPI Excellent performance , It shows that it can learn the required interaction characteristics and reduce the risk of hidden ligand offset . Last , The explanatory power of the model is studied by mapping attention weights to protein sequences and complex atoms , This can help the author determine whether the prediction is reliable and physically meaningful . Overall speaking ,TransformerCPI It provides a way to explain the model , It also provides useful guidance for further experimental optimization .
边栏推荐
- Smart subway | cloud computing injects wisdom into urban subway transportation
- 01 qemu 启动编译好的镜像 VFS: Unable to mount root fs on unknown-block(0,0)
- RHCSA 07 - 用户与群组管理
- I Build a simple microservice project
- The new data center helps speed up the construction of a digital economy with data as a key element
- 【华为云IoT】读书笔记之《万物互联:物联网核心技术与安全》第3章(上)
- 三年进账35.31亿,这个江西老表要IPO了
- Common methods of threads
- Katalon使用script实现查询List大小
- *. No main manifest attribute in jar
猜你喜欢
Two sides of the evening: tell me about the bloom filter and cuckoo filter? Application scenario? I'm confused..
函数计算异步任务能力介绍 - 任务触发去重
Global exposure and roller shutter exposure of industrial cameras
如何有效远程办公之我见 | 社区征文
postgresql 用户不能自己创建表格配置
How to telecommute more efficiently | community essay solicitation
DP83848+网线热拔插
2020 Bioinformatics | TransformerCPI
Mitsubishi M70 macro variable reading Mitsubishi M80 public variable acquisition Mitsubishi CNC variable reading acquisition Mitsubishi CNC remote tool compensation Mitsubishi machine tool online tool
Brief explanation of depth first search (with basic questions)
随机推荐
RHCSA 04 - 进程管理
Redis:哈希hash类型数据操作命令
毕业三年,远程半年 | 社区征文
hbuildx中夜神模拟器的配置以及热更新
JDBC 进阶
Restore the subtlety of window position
(pointeur) Écrivez - vous une fonction qui compare la taille de la chaîne et fonctionne comme strcmp.
支持首次触发的 Go Ticker
【微服务|openfeign】@FeignClient详解
*. No main manifest attribute in jar
leetcode刷题:二叉树06(对称二叉树)
[microservice openfeign] use openfeign to remotely call the file upload interface
CesiumJS 2022^ 源码解读[0] - 文章目录与源码工程结构
STM32外接DHT11显示温湿度
(指针)自己写一个比较字符串大小的函数,功能与strcmp类似。
idea修改主体颜色
STM32 external DHT11 display temperature and humidity
Flink学习7:应用程序结构
ctf-pikachu-CSRF
三菱M70宏变量读取三菱M80公共变量采集三菱CNC变量读取采集三菱CNC远程刀补三菱机床在线刀补三菱数控在线测量