当前位置:网站首页>Knowledge based bert: a method to extract molecular features like a computational chemist

Knowledge based bert: a method to extract molecular features like a computational chemist

2022-06-10 17:08:00 DrugAI

Today, we will introduce a team of Professor Hou tingjun from the Institute of intelligent innovative drugs of Zhejiang University 、 The team of Professor Cao Dongsheng of Central South University and Tencent quantum computing laboratory are jointly in Briefings in Bioinformatics A published paper “Knowledge-based BERT: a method to extract molecular features like computational chemists”. A new pre training strategy is proposed in this paper , By learning the molecular and atomic characteristics predefined by computational chemists , So that the model can start from... Like a computational chemist SMILES Extract molecular features from .K-BERT It shows excellent prediction ability on multiple drug forming data sets . Besides , from K-BERT Generated generic fingerprint K-BERT-FP stay 15 A drug data set showed a correlation with MACCS Considerable predictive power . And through further pre training ,K-BERT-FP You can also learn traditional binary fingerprints ( Such as MACCS and ECFP4) Molecular size and chiral information that cannot be characterized .

Research background

The molecular property prediction model based on machine learning algorithm has become an important tool to classify the unpromising leading molecules in the early stage of drug discovery . Compared with the mainstream molecular property prediction methods based on descriptors and graphs , be based on SMILES Without human expert knowledge SMILES Extract molecular features from , But they need more powerful feature extraction algorithms and more data for training , This makes it based on SMILES The method is not as popular as the first two .

Knowledge-based BERT Pre training strategy

The author in BERT A new pre training strategy is proposed , Let the model directly from SMILES Extract molecular features from . The author puts forward three pre training tasks : Atomic feature prediction task 、 Molecular feature prediction task and comparative learning task . The atomic feature prediction task allows the model to learn the information manually extracted from the graph based method ( Initial atomic information ), The molecular feature prediction task allows the model to learn information manually extracted from descriptor based methods ( Molecular descriptors / The fingerprint ), Contrast learning tasks allow models to make the same molecule different SMILES A string of embedding More similar , So that K-BERT Be able to recognize differences in the same molecule SMILES character string .

Pretraining task 1- Prediction of atomic characteristics ( chart 1A): Yes RDKit The atomic characteristics of each heavy atom in the calculated molecule are predicted . Atomic characteristics will include degrees 、 Aromaticity 、 hydrogen 、 Chirality and chirality type, etc , Therefore, it can be regarded as a multi task classification task ;

Pretraining task 2- Molecular characterization prediction ( chart 1B): Yes RDKit The calculated molecular characteristics are predicted . This study adopts MACCS The fingerprint , This task can also be regarded as a multi task classification task ( It can be replaced with other fingerprints / The descriptor );

Pretraining task 3- Comparative learning ( chart 1C): about canonical SMILES Input , adopt SMILES Randomization results in many different SMILES form . The goal of this pre training task is to maximize the differences of the same molecule SMILES Cosine similarity of string embedding , Minimize the similarity between different molecules , Make the model better “ understand ”SMILES.

chart 1. Knowledge-based BERT(K-BERT) Pre training strategy

Model training and evaluation

Input characterization : Every SMILES It's all used Schwaller The markedness method proposed by et al token. And then token( Such as ’O’、‘Br’ and ‘[[email protected]]’) Code as K-BERT The input of .

Pre-training: The atomic characteristics of each heavy atom and the molecular characteristics of each molecule pass through RDKit Calculation , And used for pre training tasks 1 and 2. Use RDKit Calculation CHEMBL One of each molecule in canonical SMILES and 4 A randomly generated SMILES, For pre training tasks 3.CHEMBL About 180 Million molecules were used for pre training K-BERT, The goal is to minimize 3 The loss function of a pre training task .

Fine-tuning: Pictured 1D Shown ,K-BERT There is 6 individual transformer encoder, We load the pre - training model from the pre - training model 5 individual transformer encoder Parameters of , The first 6 layer transformer encoder And prediction layer to re initialize randomly . Then on the downstream task data , Retraining the model .

Data augmentation: Of each molecule SMILES, adopt rdkit Random expansion to 5 Different SMILES. In training set , Every SMILES Are considered to be alone ( Different ) molecular . In the test set and the verification set , The difference of the same molecule SMLES Are considered to be molecules ,5 Different SMILES The mean value of the prediction result of is taken as the prediction result of the molecule .

ROC-AUC It is used to evaluate the performance of the classification model .R²、MAE、RMSE Used to evaluate the performance of regression models .

Data sets and experimental tasks

Small data set of drug forming agents : The author tested K-BERT stay 15 Performance on a small data set of adult drugs , The number of molecules in the data set is 2000 following . The specific data sets are as follows :Pgp-substrate (Pgp-sub)、human intestinal absorption (HIA)、 human oral bioavailability 20% (F20%)、human oral bioavailability 30% (F30%)、CYPsubstrate (CYP1A2-sub、CYP2C19-sub、CYP2C9-sub、CYP2D6-sub and CYP3A4-sub)、half-life (T1/2)、 drug-induced liver injury (DILI)、FDA maximum recommended daily dose (FDAMDD)、skin sensitization (SkinSen)、carcinogenicity (Carcinogenicity) and respiratory toxicity (Respiratory).

Malaria Data sets :Malaria Data sets are collected from Malaria Treatment Response Portal A subset of , The molecules inside contain chiral information . This data set is used to evaluate K-BERT Whether we can learn chiral information .

CHIRAL1 Data sets :CHIRAL1 The dataset is Lyu Dopamine receptors reported by et al D4 Docking a subset of filtered data .CHIRAL1 Each molecule in has only one tetrahedral center , According to the chirality of the center R and S. In this study , share 204778 Molecules for further pre training , bring K-BERT Can learn chiral information .

experimental result

K-BERT Performance on the small data set of drug forming agents

The author first evaluates K-BERT stay 15 Performance on a drug data set .K-BERT Achieved excellent performance ( surface 1), stay 8 The best results were achieved in data sets .

surface 1. K-BERT And other methods in 15 Performance on a drug data set .

Pre-training It can improve the ability of model to extract molecular features

Data enhancement is based on SMILES The model shows great advantages . This paper finds that , Both data enhancement and pre training enhance the model from SMILES The ability to extract molecular features from . As shown in the table 2 Shown , The author adopts different strategy training models . Because the data enhancement will be compared here , Contrast learning has similar data enhancement operations in the pre training process , For the sake of fairness , surface 2 None of them used the pre training task of comparative learning .K-BERT-WCL Significantly better than K-BERT-WP, It shows that pre training can improve the ability of model to extract molecular features . meanwhile ,K-BERT-WP-AUG Perform better than K-BERT-WP Can also explain , Data enhancement can also help models better understand SMILES And then improve the performance of the model .K-BERT-WCL and K-BERT-WCL-AUG It's almost the same , This shows that data enhancement is very limited to the pre trained model . It's also in line with expectations , Through pre training , The model has been well understood SMILES The rules . In this case, the same molecule is used differently SMILES Data enhancement , It is equivalent to training the same molecule many times , Naturally, it is difficult to improve model performance .

surface 2. K-BERT In different Pre-training and Fine-tuning Performance under strategy .

Comparing learning tasks can make the model better “ understand ”SMILES

The author compared the differences of the same molecule SMILES Generated by the model embedding The average of Tanimoto Similarity degree . The result is shown in Fig. 2 Shown , Pre training task through comparative learning ,embedding The similarity has been significantly improved . This explanation , Contrastive learning can help the model recognize the differences of the same molecule SMILES character string . Besides , The author uses molecules ‘C=CCC(O)CC(C)(C)C’( Not in the pre training dataset ) For example , adopt RDkit Randomly generate ten SMILES character string , And for the different atoms in the molecule embedding the t-SNE visualization , The result is shown in Fig. 2 Shown . Results show , Pre training task through comparative learning , The model can identify different SMILES In the same chemical environment .

chart 2. 50 One molecule embedding Average Tanimoto Similarity comparison

chart 3. Different atoms embedding Of t-SNE visualization

K-BERT It can generate general molecular fingerprints K-BERT-FP

K-BERT Generated molecules embedding It can be used as a general molecular fingerprint K-BERT-FP( Not limited to one task ). The author compares K-BERT-FP and MACCS Performance on the drug data set ( chart 4), Results show K-BERT-FP and MACCS Achieved comparable predictive power .

chart 4. K-BERT-FP and MACCS Comparison

K-BERT Can capture MACCS Molecular size information that cannot be captured

In order to prove K-BERT-FP No MACCS Simple replication of , But it can capture some MACCS Information that cannot be captured ( Such as molecular size information ). The author first compares K-BERT-FP and MACCS Yes DrugBank Of molecules in the dataset TMAPs Visualization results . The result is shown in Fig. 5 Shown ,K-BERT-FP and MACCS Can be better to DrugBank The molecules in the dataset are visualized ,K-BERT-FP And there is no more MACCS Better organizational skills . This may be due to some molecular fragment information in the macromolecule , Implicitly contains molecular size information , bring MACCS It can also reflect molecular size information . For further comparison , The author constructed a data set Sim-Sub-Dataset, This data set is generated repeatedly based on similar fragments ( chart 6). because MACCS Only for whether the characterization contains a certain molecular fragment , The number of uncharacteristic molecular fragments , therefore MACCS Can not reflect the molecular size of such molecules . The author compares K-BERT-FP and MACCS The ability to predict the molecular weight of the dataset . Results such as table 3 Shown ,K-BERT-FP Significantly better than MACCS, This explanation K-BERT-FP Can capture MACCS Information that cannot be captured .

chart 5. K-BERT-FP and MACCS Yes DrugBank Of TMAPs visualization

chart 6. Sim-Sub-Dataset How to generate

surface 3. K-BERT-FP and MACCS stay Sim-Sub-Dataset Performance on

After further pre training ,K-BERT Can capture MACCS Chiral information that cannot be captured

The author first compares K-BERT and MACCS In characterization CHIRAL1 On dataset 2500 The ability to form chiral isomers . The result is shown in Fig. 7A Sum graph 7B Shown ,K-BERT and MACCS It doesn't make a good distinction CHIRAL1 Chiral isomers on . In order to make K-BERT-FP Contains chiral information , The author will K-BERT In chiral datasets CHIRAL1 Further pre training on ,K-BERT-FP-CHIRAL1 The task of molecular prediction pre training is still MACCS Molecular fingerprints , and K-BERT-FP-CHIRAL1-R-S The task of molecular prediction pre training is changed to predict the chirality of molecules R/S. The result is shown in Fig. 7C Sum graph 7D Shown ,K-BERT-FP-CHIRAL1-R-S It can well distinguish the isomers . Besides , Isomers of the same group of molecules can be found in another group , This explanation K-BERT-FP-CHIRAL1-R-S While characterizing molecular chiral information , It still contains information about the structure of molecules . meanwhile , The authors compared different fingerprints in chiral data sets Malaria Forecast performance on ( be based on XGBoost modeling ). The results show that ,K-BERT-FP-CHIRAL1-R-S Better than other fingerprints , This means that through customized pre training tasks ,K-BERT Ability to focus on chiral information , And then improve the model's ability to extract adversary features .

chart 7. CHIRAL1 On dataset 2500 Of chiral isomers TMAPs visualization . (A). MACCS TMAP color encoded by R/S chirality; (B). K-BERT-FP TMAP color encoded by R/S chirality. (C). K-BERT-FP-CHIRAL1 TMAP color encoded by R/S chirality. (D). K-BERT-FP-CHIRAL1-R-S TMAP color encoded by R/S chirality.

summary

The authors propose a pre training strategy that can extract molecular features like pharmaceutical chemists K-BERT,K-BERT Be able to better learn from SMILES Extracting molecular features from strings , And it shows a strong prediction ability in the drug forming prediction data set . Besides , The author also found that K-BERT It can generate a general molecular fingerprint K-BERT-FP, And K-BERT-FP Can capture MACCS Molecular size information that cannot be captured . After further pre training ,K-BERT-FP It can also capture chiral information . This shows that , Through the understanding of specific tasks , You can set different pre training tasks to K-BERT-FP Capture specific molecular characteristics .

Reference material

Zhenxing Wu, Dejun Jiang, Jike Wang, Xujun Zhang, Hongyan Du, Lurong Pan, Chang-Yu Hsieh, Dongsheng Cao, Tingjun Hou, Knowledge-based BERT: a method to extract molecular features like computational chemists, Briefings in Bioinformatics, 2022;, bbac131,

https://doi.org/10.1093/bib/bbac131

原网站

版权声明
本文为[DrugAI]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/161/202206101602369423.html

随机推荐