当前位置:网站首页>2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets
2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets
2022-06-13 04:30:00 【Dazed flounder】
The paper :https://arxiv.org/abs/2205.07249
Code :https://github.com/pengxingang/Pocket2Mol
Pocket2Mol : be based on 3D Efficient molecular sampling of protein pockets
This paper introduces by xingang peng Published in ICML Articles on :Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets. The authors propose a new sampling method that can satisfy multiple geometric constraints imposed by pockets :Pocket2Mol, This is a two module E(3)- Equivariant generative network , It can not only capture the space and bonding relationship between the binding pocket atoms , You can also rely on Markov chain Monte Carlo method (MCMC) In this case, the new candidate drugs are sampled from the easy to handle distribution under the condition of pocket representation . among , Pocket based drug design improvements are as follows : 1) A new depth geometric neural network is developed to accurately model the three-dimensional structure of the pocket ; 2) Design new sampling strategy , Realize more efficient conditional 3D coordinate sampling ; 3) The ability of a model to sample a pair of interatomic chemical bonds . Experimental results show that , from Pocket2Mol The molecules sampled in have significantly better binding affinity and other drug properties , For example, drug similarity and synthetic accessibility .
Introduce
The early approach was to integrate evaluation functions ( Such as the docking fraction between the sampling molecule and the pocket ) To modify the pocketless model , To guide candidate searches 1 ^1 1. Another type of model would 3D The transformation of pocket structures into molecules SMILES String or 2D Molecular diagram 2 ^2 2, Instead of simulating small molecular structures and 3D Interaction between pockets . A conditional generation model is developed to simulate the three-dimensional atomic density distribution in the three-dimensional pocket structure , then , The challenge of this problem shifts to the efficiency of the structure sampling algorithm in the learning distribution . Besides , Previous models overemphasize the importance of the three-dimensional position of atoms , It ignores the formation of chemical bonds , This leads to unrealistic connections between atoms in practice .
Related work
Molecular generation based on three-dimensional protein pockets
- An improved GAN Model to represent the molecules in the hidden space in the protein pocket , And use the caption network to decode these representations into SMILES character string . Or two structure descriptors are designed to encode the pocket , And use conditions RNN Generated SMILES.
- Another working idea began to consider pockets and small molecules 3D Molecular structure . A ligand neural network is proposed to generate 3D Molecular structure , Monte Carlo tree search is used to optimize the candidate molecules combined with specific pockets .
Equivariant network based on vector features
Usually based on gnn To realize the global rotation equivariance of three-dimensional objects . However , They require that the input and hidden features of each layer be equivariant , This is not consistent with the vector characteristics of the side chain angle of each amino acid .
Generation of atomic positions
- A common strategy is to predict the distance distribution between the new atom and all previous atoms , And sampling from node distribution .
- Another strategy is to establish a local spherical coordinate system , Predict location in local space , But the conversion between Euclidean space and spherical space is inefficient and not direct .
Method
Pocket2Mol The core idea of is to know the probability distribution of the atom or bond type at each position in the pocket according to the existing atoms .. To understand this context specific distribution , The authors used autoregressive strategy to predict the random mask part of training drugs .
The generation process
- Formally , The protein pocket is represented as a set with coordinates
2. The generated has n The molecular fragments of atoms are expressed as coordinates - The first i A heavy atom 、 Its coordination and valence bond with other atoms . The model is marked as φ, The generation process is defined as follows :
The production process consists of four main steps , As shown in Figure 1 .
(1) Leading edge predictors of the model F f r o F_{fro} Ffro The leading atom of the current molecular fragment will be predicted . Frontiers are defined as molecular atoms that can be covalently attached to new atoms . If all the atoms are not frontier , It means that the current molecule is complete , The generation process stops .
(2) secondly , The model samples an atom from the boundary set as the focus atom .
(3) then , Based on the focus atom , Model position predictor f p o s f_{pos} fpos Predict the relative positions of new atoms . Last , The atomic element predictor of the model f e l e f_{ele} fele And key type predictors f b o n d f_{bond} fbond The probabilities of element types and bond types with existing atoms will be predicted , Then the element types and valence bonds of the new atoms are sampled .
(4) such , The new atoms are successfully added to the current molecular fragment , The formation process continues until no frontier atoms are found .
Model structure
Based on the above generation process , The model needs to be composed of four modules : Encoder 、 Boundary predictor 、 Position predictor and Element and key predictors .、
E(3) - Hierarchical neural network
Having scalar and vector characteristics 3D The representation of vertices and edges in the graph can help to enhance the expressive ability of neural networks . Protein pockets P(pro) And molecular fragments G(mol)n All vertices and edges in are associated with scalar and vector features , To better capture 3D geometric information .
To the original GVP It's been modified , stay GVP A vector nonlinear activation is added to the output vector of , Write it down as Gper:
Encoder
Protein pockets and molecular fragments are expressed as k a near neighbor (KNN) chart , Where the vertex is an atom , Each atom is associated with k Close neighbors are connected . The input vector vertex features include atomic coordinates , Vector edge feature is the unit direction vector of the edge in three-dimensional space .
First , Use multiple embedding layers to embed vertices ( v i ( 0 ) , v e c v i ( 0 ) ) (v^{(0)}_i, vec{v}^{(0)}_i) (vi(0),vecvi(0)) And edge ( e i j ( 0 ) , e ⃗ i j ( 0 ) ) (e^{(0)}_{ij},\vec{e}^{(0)}_{ij}) (eij(0),eij(0)) features . And then L Messaging modules M l ( L = 1 , … , L ) M_l(L = 1,…,L) Ml(L=1,…,L) And update module U l ( L = 1 , … , L ) U_l(L = 1,…,L) Ul(L=1,…,L) Staggered connection , Learn local structure representation
The message passing module is in the form of :
The calculation method of vector message is to multiply the vector features of vertices and edges by the scalar features and then sum them , Between vertices and edges 、 Information exchange between scalar features and vector features . The update module is in the form of :
forecast
Boundary prediction : Take the geometric vector MLP (GV-MLP) Define as a GVP Block followed by a GVL block , be called G m l p G_{mlp} Gmlp. The leading edge predictor uses atoms i Is characterized by input , Using a GV-MLP Layer to predict the front p f r o p_fro pfro Probability , As shown below :
Location predictors :
The position predictor takes the focus atom i i i Is characterized by input , Predict the relative positions of new atoms . Because vector features are equivariant in the model , They can directly generate atomic coordinates pointing to the focal point r i r_i ri The relative coordinates of Δ r i \Delta r_i Δri. The output of the position predictor is modeled as a Gaussian mixture model with diagonal covariance p ( Δ r i ) = ∑ k = 1 K π i ( i ) N ( u i ( k ) , Σ i ( k ) ) p(\Delta r_i) =\sum^{K}_{k=1}\pi^{(i)}_iN(u^{(k)}_i,\Sigma^{(k)}_i) p(Δri)=∑k=1Kπi(i)N(ui(k),Σi(k)) among , The prediction of parameters by multiple neural networks is as follows :
Element and key predictors : In predicting new atoms i i i After the location of the , Elements - The bond predictor will predict new atoms i i i And valence bonds between all atoms in existing molecular fragments q ( ∀ q ∈ V ( m o l ) ) q(\forall q \in V^{(mol)}) q(∀q∈V(mol)). chart 2 The structure of predictive neural network is shown .
First , We collect... In all the atoms k k k Nearest neighbor atom j ∈ K N N ( i ) j \in KNN(i) j∈KNN(i), Then use a messaging module , Integrate local information from neighbor atoms into new atoms i i i Location , As its expression ( v i , v ⃗ i ) (v^i,\vec{v}^i) (vi,vi), On this basis, the atom i i i Element type of .
In the parallel path , atom i i i and q q q The edges between are expressed as ( z i q , z ⃗ i q ) (z_{iq},\vec{z}_{iq}) (ziq,ziq), It's atoms i i i Characteristics of 、 atom q q q Features and edges of e i q e_{iq} eiq The processing of feature stitching , And then a GV-MLP block , namely :
among ( e i q ′ , e ⃗ i q ) (e'_{iq}, \vec{e}_{iq}) (eiq′,eiq) Insert the processed input edge feature and a GV-MLP block .
For vector features , A new attention module is proposed , The definition is as follows
Training
In the training phase , Random screening of atoms in molecules , The training model recovers the masked atoms . say concretely , For each pair of pocket ligands , From uniformly distributed U[0,1] A mask ratio is sampled from the number of molecular atoms corresponding to the mask . The remaining molecular atoms that have valence bonds with the masked atoms are defined as boundaries . then , Position predictors and elements - The key predictor attempts to predict the position corresponding to the boundary 、 Element type and bond with the remaining molecular atoms to recover the masked atom with valence bond with the boundary . If all the molecules and atoms are masked , The boundary is defined as 4a Protein atoms with masking atoms inside , The masked atoms around the boundary will be recovered . For element type prediction , We added a representation to the query location Nothing Element type of . In the process of training , We not only sample the positions of mask atoms used for element type prediction , Negative positions from the environment space are also sampled , And assign their labels to Nothing.
The loss of Frontier prediction L f r o L_{fro} Lfro Is the binary cross entropy loss at the prediction frontier . Loss of position predictor L p o s L_{pos} Lpos Is the negative log likelihood of the position of the masked atom . Prediction of element types and key types , We use cross entropy loss for classification , Expressed as L e l e L_{ele} Lele and L b o n d L_{bond} Lbond.
The overall loss function is the sum of the above four loss functions
use Adam The optimizer optimizes both the encoder and all three predictors .
result
Pocket2Mol, This is a graph neural network E(3) Equivariant generative network , Chemical and geometric features for modeling 3D protein pockets , A new efficient algorithm is used to sample the new 3D Candidate drugs . Experiments show that ,Pocket2Mol The resulting molecules not only have better affinity and chemical properties , And it has a more real and accurate structure .
Reference resources
- Structure-based de novo drug design using 3d deep generative models
- From target to drug: Generative modeling for the multimodal structure-based ligand design. Molecular Pharmaceutics
- novo molecule design through the molecular generative model conditioned by 3d information of protein binding sites.
边栏推荐
- Sword finger offer 11 Minimum number of rotation array - binary lookup
- EIA map making - data processing + map making
- Small program imitating Taobao Jiugong grid sliding effect
- Discussion sur la modélisation de la série 143
- Day45. data analysis practice (1): supermarket operation data analysis
- Suffix Automaton
- Online audio adjustment technology summary
- Google Chrome browser reports an error: net:: err_ BLOCKED_ BY_ CLIENT
- Common terms of electromagnetic compatibility
- php开发14 友情链接模块的编写
猜你喜欢
Unity Shader 学习 004-Shader 调试 平台差异性 第三方调试工具
Solution to failure to download files by wechat scanning QR code
【Flutter 問題系列第 67 篇】在 Flutter 中使用 Get 插件在 Dialog 彈窗中不能二次跳轉路由問題的解决方案
Reread the classic: end to end object detection with transformers
Advanced Mathematics (Seventh Edition) Tongji University exercises 1-3 personal solutions
How to use debounce in lodash to realize anti shake
Small program imitating Taobao Jiugong grid sliding effect
Principle, composition and functions of sensors of Dajiang UAV flight control system
Use ASE encryption and decryption cache encapsulation in Vue project
[notes] summarize common horizontal and vertical centering methods
随机推荐
Milliards de données pour déterminer si un élément existe
El expression
小程序基础入门(黑马学习笔记)
Collection of wrong questions in soft test -- morning questions in the first half of 2011
Principle and control program of single chip microcomputer serial port communication
MVP framework for personal summary
Ionic Cordova command line
Redis
一款開源的Markdown轉富文本編輯器的實現原理剖析
Unity shader learning 004 shader debugging platform difference third-party debugging tools
ET框架-22 创建ServerInfo实体及事件
环评图件制作-数据处理+图件制作
SEO specification
How to use debounce in lodash to realize anti shake
Redis persistence mode AOF and RDB
10 minutes to thoroughly understand how to configure sub domain names to deploy multiple projects
Advanced Mathematics (Seventh Edition) Tongji University exercises 1-3 personal solutions
Analysis of the implementation principle of an open source markdown to rich text editor
JSTL -- JSP standard tag library
一款开源的Markdown转富文本编辑器的实现原理剖析