当前位置:网站首页>Rdkit I: using rdkit to screen the structural characteristics of chemical small molecules
Rdkit I: using rdkit to screen the structural characteristics of chemical small molecules
2022-07-29 03:25:00 【Order anything】
Recently, I am working on the project of small molecule screening , It involves some processing methods , Later, I will summarize some problems and solutions one by one .
First, a brief introduction RDkit,RDkit It is used to deal with small chemical molecules python Open source package , It was first developed by Novartis , The bottom is made up of C++ Compiling . stay github There is its source code and instructions on , The address is as follows :
https://github.com/rdkit/rdkithttps://github.com/rdkit/rdkit RDkit stay Anaconda or miniconda Installation in environment :
conda install -yq -c rdkit rdkit
Here's a brief introduction smiles code ,smiles Coding is essentially a kind of coding that uses strings to express the two-dimensional structure of small molecules , adopt rdkit package , Can be chemdraw in .mol The small molecule structural formula of the format , or sdf The spatial coordinates of the format are converted into smiles code , Thus, it is brought into the machine learning and deep learning models for learning .
Now let's get to the point :
In onemillion small chemical molecules , Screening , Just stay NO2<2 individual ,Cl < 3 individual ,Br < 2 individual ,F < 6 individual , Number of aromatic rings < 5 Molecules of .
To solve this problem , Two... Are needed RDkit package , Here is only the simplest usage , If you are interested, you can see the package introduction or source code :
1. rdkit.Chem.Lipinski
Lipinski( Lippings rule ) It is a common constraint rule of small molecule drugs , stay rdkit Of Lipinski The package contains the calculation of various parameters , Just know the small molecules smiles code , You can analyze its ,HeavyAtomCount,NumAromaticRings…… And so on , Some of the more common ones are :NumAromaticRings,NumHAcceptors,NumHDonors,NumRotatableBonds These kinds of .
2.rdkit.Chem
rdkit.Chem The package contains functions for operating small molecule objects , Including atomic operations , Key operation , Ring operation , Pharmacophore search and other functions .
Our main atomic operations here , Atomic operations involve functions including :
- Traverse the atom :m.GetAtoms()
- Get atomic index :GetIdx()
- Get atomic serial number :GetAtomicNum()
- Get atomic symbol :GetSymbol()
- Get the number of atomic connections ( suffer H Whether to hide the influence ):GetDegree()
- Get the total number of atomic connections ( And H Whether to hide or not is irrelevant ):GetTotalDegree()
- Get the atomic form charge :GetFormalCharge()
- Get the atomic hybridization method :GetHybridization()
- Get the atomic explicit valence :GetExplicitValence()
- Get the implicit valence of atoms :GetImplicitValence()
- Get the total valence of atoms :GetTotalValence()
There is a Zhihu article to read , Write very well :
RDKit| Molecular basis operation and pharmacophore search - You know List of articles Atomic operation key operation ring operation manual realization oxygen group pharmacophore search 1. Atomic operation in rdkit in , Every atom in a molecule is an object , You can get all kinds of information through the attributes and functions of atomic objects . Traverse the atom :m.GetAtoms() Get atomic index :GetIdx() a …https://zhuanlan.zhihu.com/p/143111689 Before you start coding , Let's introduce the format of input data , Because I found that for some contacts RDkit For the late old biochemist , It's better to write more carefully . The input data is molecular smiles code , as follows , All the molecules that need to be screened smiles Put the code in a csv or txt In the file , Column name is smiles:
No verbosity , Code up :
import pandas as pd
from rdkit import Chem
from rdkit.Chem import Lipinski
# Read in the data
df = pd.read_csv('smiles.csv')
# Number of screening aromatic rings :aromatic rings < 5
df['NumAromaticRings'] = df['smiles'].apply(lambda x:Lipinski.NumAromaticRings(Chem.MolFromSmiles(x)))
# It can also be parallel , use parallel_apply, According to your own needs
df = df.drop(df[df.NumAromaticRings >= 5].index)
# Screening F,Cl,Br And other elements
m = [Chem.MolFromSmiles(x) for x in df.smiles.tolist()]
# Get a list of fluorine numbers
num_F = []
for i in range(len(m)):
F = [atom.GetSymbol() for atom in m[i].GetAtoms()].count('F')
num_F.append(F)
# Get the list of chlorine content
num_Cl = []
for i in range(len(m)):
Cl = [atom.GetSymbol() for atom in m[i].GetAtoms()].count('Cl')
num_Cl.append(Cl)
# Get a list of Bromine Numbers
num_Br = []
for i in range(len(m)):
Br = [atom.GetSymbol() for atom in m[i].GetAtoms()].count('Br')
num_Br.append(Br)
# F,Br,CL Quantity can be filtered
print(f'max_F:{max(num_F)};max_Cl:{max(num_Cl)};max_Br:{max(num_Br)}')
# Condition screening
transform = {'num_F':num_F,'num_Cl':num_Cl,'num_Br':num_Br}
atom_nums = pd.DataFrame(transform)
df = pd.concat([df,atom_nums],axis=1)
print(df.info())
df = df.drop(df[df.num_F >= 6].index)
df = df.drop(df[df.num_Cl >= 3].index)
df = df.drop(df[df.num_Br >= 2].index)
# Screening NO2 Groups , Here is based on the character string search
dff = df[df['smiles'].str.contains(pat='[N+](=O)[O-]',regex=False)]
# Get to include NO2 Of the group smiles code dataframe form
smiles = dff.smiles.tolist()
NO2 = [xcount('[N+](=O)[O-]') for x in n]
transform2 = {'smiles':smiles,'num_NO2':num_NO2}
FG_nums = pd.DataFrame(transform2)
# With smiles by inner And df Difference set
df2 = df.append(FG_nums)
df2 = df2.drop_duplicates(subset=['smiles'],keep=False)
# Save the final result
df2.to_csv('result.csv',header = True, index = False)
One article a week , Next week's notice RDkit Dealing with clustering
边栏推荐
- Let's talk about the summary of single merchant function modules
- Learn more than 4000 words, understand the problem of this pointing in JS, and handwrite to realize call, apply and bind
- "PHP Basics" output approximate value of PI
- Score addition and subtraction of force deduction and brushing questions (one question per day 7/27)
- makefile详解
- 反脆弱·从不确定性中获益---管理?
- 复现20字符短域名绕过以及xss相关知识点
- Learn exkmp again (exkmp template)
- Typescript learning (I)
- Summarize the knowledge points of the ten JVM modules. If you don't believe it, you still don't understand it
猜你喜欢
"PHP Basics" output approximate value of PI
Design of smoke temperature, humidity and formaldehyde monitoring based on single chip microcomputer
Summary of basic knowledge points of C language
Summarize the knowledge points of the ten JVM modules. If you don't believe it, you still don't understand it
How dare you write a resume that is proficient in concurrent programming? Why do you use a two-way linked list in AQS?
简历竟然敢写精通并发编程,那你说说AQS为什么要用双向链表?
AI platform, AI midrange architecture
2022-07-28 study notes of group 4 self-cultivation class (every day)
Pp-yoloe details
VISO fast rendering convolution block
随机推荐
多行文本省略
Does domestic ERP have a chance to beat sap?
i. MX 8m plus integrated dedicated neural processing engine (NPU)
Alibaba Sentinel - workflow and principle analysis
C traps and defects Chapter 3 semantic "traps" 3.7 evaluation order
Flask creation process day05-06 creation project
Introduction and advanced level of MySQL (11)
Code speed optimization
1.5 nn. Module neural network (III)
12_ UE4 advanced_ Change a more beautiful character model
A simple and general method to obtain the size of function stack space
Incremental real-time disaster recovery notes
西瓜书学习第六章---SVM
复现20字符短域名绕过以及xss相关知识点
接口自动化测试实践指导(上):接口自动化需要做哪些准备工作
【科技1】
反脆弱·从不确定性中获益---管理?
Summary of SAP localized content in China
3D advanced renderer: artlandis studio 2021.2 Chinese version
Three military product baselines (functional baseline, distribution baseline, product baseline) and the documents contained in the baseline