当前位置：网站首页>2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets

2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets

2022-06-13 04:30:00 【Dazed flounder】

The paper ：https://arxiv.org/abs/2205.07249
Code ：https://github.com/pengxingang/Pocket2Mol

Pocket2Mol : be based on 3D Efficient molecular sampling of protein pockets

This paper introduces by xingang peng Published in ICML Articles on ：Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets. The authors propose a new sampling method that can satisfy multiple geometric constraints imposed by pockets ：Pocket2Mol, This is a two module E(3)- Equivariant generative network , It can not only capture the space and bonding relationship between the binding pocket atoms , You can also rely on Markov chain Monte Carlo method （MCMC） In this case, the new candidate drugs are sampled from the easy to handle distribution under the condition of pocket representation . among , Pocket based drug design improvements are as follows : 1) A new depth geometric neural network is developed to accurately model the three-dimensional structure of the pocket ; 2) Design new sampling strategy , Realize more efficient conditional 3D coordinate sampling ; 3) The ability of a model to sample a pair of interatomic chemical bonds . Experimental results show that , from Pocket2Mol The molecules sampled in have significantly better binding affinity and other drug properties , For example, drug similarity and synthetic accessibility .

Introduce

The early approach was to integrate evaluation functions ( Such as the docking fraction between the sampling molecule and the pocket ) To modify the pocketless model , To guide candidate searches $^1$ . Another type of model would 3D The transformation of pocket structures into molecules SMILES String or 2D Molecular diagram $^2$ , Instead of simulating small molecular structures and 3D Interaction between pockets . A conditional generation model is developed to simulate the three-dimensional atomic density distribution in the three-dimensional pocket structure , then , The challenge of this problem shifts to the efficiency of the structure sampling algorithm in the learning distribution . Besides , Previous models overemphasize the importance of the three-dimensional position of atoms , It ignores the formation of chemical bonds , This leads to unrealistic connections between atoms in practice .

Related work

Molecular generation based on three-dimensional protein pockets

An improved GAN Model to represent the molecules in the hidden space in the protein pocket , And use the caption network to decode these representations into SMILES character string . Or two structure descriptors are designed to encode the pocket , And use conditions RNN Generated SMILES.
Another working idea began to consider pockets and small molecules 3D Molecular structure . A ligand neural network is proposed to generate 3D Molecular structure , Monte Carlo tree search is used to optimize the candidate molecules combined with specific pockets .
Equivariant network based on vector features
Usually based on gnn To realize the global rotation equivariance of three-dimensional objects . However , They require that the input and hidden features of each layer be equivariant , This is not consistent with the vector characteristics of the side chain angle of each amino acid .

Generation of atomic positions

A common strategy is to predict the distance distribution between the new atom and all previous atoms , And sampling from node distribution .
Another strategy is to establish a local spherical coordinate system , Predict location in local space , But the conversion between Euclidean space and spherical space is inefficient and not direct .

Method

Pocket2Mol The core idea of is to know the probability distribution of the atom or bond type at each position in the pocket according to the existing atoms .. To understand this context specific distribution , The authors used autoregressive strategy to predict the random mask part of training drugs .

The generation process

Formally , The protein pocket is represented as a set with coordinates

2. The generated has n The molecular fragments of atoms are expressed as coordinates
The first i A heavy atom 、 Its coordination and valence bond with other atoms . The model is marked as φ, The generation process is defined as follows ：

The production process consists of four main steps , As shown in Figure 1 .
(1) Leading edge predictors of the model $F_{fro}$ The leading atom of the current molecular fragment will be predicted . Frontiers are defined as molecular atoms that can be covalently attached to new atoms . If all the atoms are not frontier , It means that the current molecule is complete , The generation process stops .
(2) secondly , The model samples an atom from the boundary set as the focus atom .
(3) then , Based on the focus atom , Model position predictor $f_{pos}$ Predict the relative positions of new atoms . Last , The atomic element predictor of the model $f_{ele}$ And key type predictors $f_{bond}$ The probabilities of element types and bond types with existing atoms will be predicted , Then the element types and valence bonds of the new atoms are sampled .
(4) such , The new atoms are successfully added to the current molecular fragment , The formation process continues until no frontier atoms are found .

Model structure

Based on the above generation process , The model needs to be composed of four modules ： Encoder 、 Boundary predictor 、 Position predictor and Element and key predictors .、

E(3) - Hierarchical neural network
Having scalar and vector characteristics 3D The representation of vertices and edges in the graph can help to enhance the expressive ability of neural networks . Protein pockets P(pro) And molecular fragments G(mol)n All vertices and edges in are associated with scalar and vector features , To better capture 3D geometric information .
To the original GVP It's been modified , stay GVP A vector nonlinear activation is added to the output vector of , Write it down as Gper：

Encoder
Protein pockets and molecular fragments are expressed as k a near neighbor (KNN) chart , Where the vertex is an atom , Each atom is associated with k Close neighbors are connected . The input vector vertex features include atomic coordinates , Vector edge feature is the unit direction vector of the edge in three-dimensional space .

First , Use multiple embedding layers to embed vertices $(v^{(0)}_i, vec{v}^{(0)}_i)$ And edge $(e^{(0)}_{ij},\vec{e}^{(0)}_{ij})$ features . And then L Messaging modules $M_l(L = 1,…,L)$ And update module $U_l(L = 1,…,L)$ Staggered connection , Learn local structure representation

The message passing module is in the form of :

The calculation method of vector message is to multiply the vector features of vertices and edges by the scalar features and then sum them , Between vertices and edges 、 Information exchange between scalar features and vector features . The update module is in the form of ：

forecast

Boundary prediction : Take the geometric vector MLP (GV-MLP) Define as a GVP Block followed by a GVL block , be called $G_{mlp}$ . The leading edge predictor uses atoms i Is characterized by input , Using a GV-MLP Layer to predict the front $p_fro$ Probability , As shown below :

Location predictors :
The position predictor takes the focus atom $i$ Is characterized by input , Predict the relative positions of new atoms . Because vector features are equivariant in the model , They can directly generate atomic coordinates pointing to the focal point $r_i$ The relative coordinates of $\Delta r_i$ . The output of the position predictor is modeled as a Gaussian mixture model with diagonal covariance $p(\Delta r_i) =\sum^{K}_{k=1}\pi^{(i)}_iN(u^{(k)}_i,\Sigma^{(k)}_i)$ among , The prediction of parameters by multiple neural networks is as follows :

Element and key predictors : In predicting new atoms $i$ After the location of the , Elements - The bond predictor will predict new atoms $i$ And valence bonds between all atoms in existing molecular fragments $q(\forall q \in V^{(mol)})$ . chart 2 The structure of predictive neural network is shown .

First , We collect... In all the atoms $k$ Nearest neighbor atom $\in KNN(i)$ , Then use a messaging module , Integrate local information from neighbor atoms into new atoms $i$ Location , As its expression $(v^i,\vec{v}^i)$ , On this basis, the atom $i$ Element type of .
In the parallel path , atom $i$ and $q$ The edges between are expressed as $(z_{iq},\vec{z}_{iq})$ , It's atoms $i$ Characteristics of 、 atom $q$ Features and edges of $e_{iq}$ The processing of feature stitching , And then a GV-MLP block , namely :

among $(e'_{iq}, \vec{e}_{iq})$ Insert the processed input edge feature and a GV-MLP block .
For vector features , A new attention module is proposed , The definition is as follows

Training

In the training phase , Random screening of atoms in molecules , The training model recovers the masked atoms . say concretely , For each pair of pocket ligands , From uniformly distributed U[0,1] A mask ratio is sampled from the number of molecular atoms corresponding to the mask . The remaining molecular atoms that have valence bonds with the masked atoms are defined as boundaries . then , Position predictors and elements - The key predictor attempts to predict the position corresponding to the boundary 、 Element type and bond with the remaining molecular atoms to recover the masked atom with valence bond with the boundary . If all the molecules and atoms are masked , The boundary is defined as 4a Protein atoms with masking atoms inside , The masked atoms around the boundary will be recovered . For element type prediction , We added a representation to the query location Nothing Element type of . In the process of training , We not only sample the positions of mask atoms used for element type prediction , Negative positions from the environment space are also sampled , And assign their labels to Nothing.

The loss of Frontier prediction $L_{fro}$ Is the binary cross entropy loss at the prediction frontier . Loss of position predictor $L_{pos}$ Is the negative log likelihood of the position of the masked atom . Prediction of element types and key types , We use cross entropy loss for classification , Expressed as $L_{ele}$ and $L_{bond}$ .
The overall loss function is the sum of the above four loss functions

use Adam The optimizer optimizes both the encoder and all three predictors .

result

Pocket2Mol, This is a graph neural network E(3) Equivariant generative network , Chemical and geometric features for modeling 3D protein pockets , A new efficient algorithm is used to sample the new 3D Candidate drugs . Experiments show that ,Pocket2Mol The resulting molecules not only have better affinity and chemical properties , And it has a more real and accurate structure .

Reference resources

Structure-based de novo drug design using 3d deep generative models
From target to drug: Generative modeling for the multimodal structure-based ligand design. Molecular Pharmaceutics
novo molecule design through the molecular generative model conditioned by 3d information of protein binding sites.

原网站

版权声明
本文为[Dazed flounder]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/164/202206130421385738.html