当前位置:网站首页>Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)
Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)
2022-07-29 03:25:00 【Order anything】
Before introducing these three codes , Let's give you a brief introduction rdkit Data formats that can be recognized and mutually converted in :.smi( namely smiles The form of coding ),.mol( That is, the form of molecular graph ),.sdf( The form of molecular coordinates ),…… These are commonly used , If you encounter some unusual , You can refer to rdkit In the source package Chem modular .
SMILES code
SMILES The full name of the code is :Simplified Molecular-Input Line-Entry System
Code as name , Actually smiles Coding is a common pattern of defining molecules with text strings .SMILES Strings describe the atoms and bonds of molecules in a simple and intuitive way for chemists .
Compared with other molecular expressions smiles Coding has two advantages :
1. Uniqueness : Every SMILES The code corresponds to a unique chemical structure , At the same time, each chemical structure corresponds to SMILES Coding is also unique , The two are one-to-one correspondence .

2. Save a space :SMILES Another important feature of is , Compared with most other methods of expressing structure , It can save storage space .SMILES It takes up even less space than binary tables 50% to 70%, Besides ,SMILES The compression of is very effective . adopt Ziv-Lempel Compress , The storage memory of the same database can be reduced to its original size 27%.
smiles Please refer to the following blog post for coding rules , In practice , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .
stay rdkit Bag Chem Module , In possession of molecules smiles Under the premise of coding , Can pass smiles Code to get some physical and chemical properties of molecules , See the following code display for the specific process :
import pandas as pd
import rdkit
from rdkit import Chem
from rdkit import rdBase, Chem
from rdkit.Chem import PandasTools, Descriptors, rdMolDescriptors, MolFromSmiles
from rdkit.Chem import QED,Lipinski
from moses.metrics import SA,mol_passes_filters
# This table has only one column , Molecular smiles code , The title is 0
df = pd.read_csv('smiles.csv')
# analysis : Calculating the logP,MW,HBA+HBD,TPSA,NRB
df['logP'] = df['0'].apply(lambda x: Descriptors.MolLogP(Chem.MolFromSmiles(x)))
df['TPSA'] = df['0'].apply(lambda x: Descriptors.TPSA(Chem.MolFromSmiles(x)))
df['MW'] = df['0'].apply(lambda x: Descriptors.MolWt(Chem.MolFromSmiles(x)))
df['HBA'] = df['0'].apply(lambda x: rdMolDescriptors.CalcNumLipinskiHBA(Chem.MolFromSmiles(x)))
df['HBD'] = df['0'].apply(lambda x: rdMolDescriptors.CalcNumLipinskiHBD(Chem.MolFromSmiles(x)))
# Calculation QED
df['QED'] = df['0'].apply(lambda x:(QED.properties(Chem.MolFromSmiles(x))))
# Calculation SA
df['SA'] = df['0'].apply(lambda x: SA(Chem.MolFromSmiles(x)))SMART code
SMART The coding is described above SMILES Language extension , Can be used to create queries . Can be SMART Patterns are viewed as regular expressions similar to those used to search for text ( To put it another way ,smart Coding is equivalent to smiles A fuzzy search of coding ).
SMART Coding is generally used in the following situations :
Search the molecular database to identify molecules containing specific substructures ;
Arrange a group of molecules on a common substructure , To improve the visual effect ;
Highlight the substructure in the figure
Constrain the substructure during the calculation
about SMART For coding rules, please refer to its official website , And smiles The coding is the same , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .
Daylight>SMARTS Examples
https://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html#ES_AM The following code block shows a Smart Small examples of coding :
from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Pharm2D.SigFactory import SigFactory
from rdkit.Chem.Pharm2D import Generate, Gobbi_Pharm2D
import os
import pandas as pd
# Some molecules are known smiles code , It can be done by rdkit Deal with it smart code —— Next, imidazole and guanidine are converted into smart Format
S1 = Chem.MolToSmarts(Chem.MolFromSmiles('C1NCNC1'))
S2 = Chem.MolToSmarts(Chem.MolFromSmiles('NC(=N)N'))
# Imidazole :[#6]1-[#7]-[#6]-[#7]-[#6]-1; Guanidine :[#7]-[#6](=[#7])-[#7]
# Read in the data
df = pd.read_csv("smiles_sorted.csv")
# smile Coding form , Single column , Column title by 0
suppl_list = df[0].tolist()
suppl_end = [Chem.MolFromSmiles(x) for x in suppl_list]
# Screening functions of guanidine and imidazole groups
def pharmacophore_smarts(m):
''' Pass in mol Format list '''
# Definition smart
PosIonizable_Guanidine = '[#7]-[#6](=[#7])-[#7]'
PosIonizable_Imidazole = '[#6]1-[#7]-[#6]-[#7]-[#6]-1'
atomPharma = {}
PosIonizable_1 = m.HasSubstructMatch(Chem.MolFromSmarts(PosIonizable_Guanidine))
PosIonizable_2 = m.HasSubstructMatch(Chem.MolFromSmarts(PosIonizable_Imidazole))
return PosIonizable_1,PosIonizable_2
# Screening guanidine and imidazolyl —— The result is DataFrame form
result = pd.DataFrame([pharmacophore_smarts(m) for m in suppl_end],columns=['PosIonizable_1','PosIonizable_2'])
# Remove unqualified data
result_new = result.drop(result[result.PosIonizable_1 == False].index)
result_new = result.drop(result[result.PosIonizable_2 == False].index)
result_new.index
# Draw the qualified molecular structure
from rdkit.Chem import Draw
draw = [suppl_list[i] for i in result_new.index]
mols=[]
for m in draw:
mol = Chem.MolFromSmiles(m)
mols.append(mol)
img=Draw.MolsToGridImage(mols,molsPerRow=4,subImgSize=(300,300),legends=['' for x in mols],returnPNG=False)
img.save('demo.png')Morgan fingerprint (ECFP)
Chemical fingerprints are made by 1 and 0 The vector of composition , Indicates the presence or absence of a specific structure in the molecule , Morgan fingerprint is also called extension —— Connected fingerprints are a kind of characterized representation that combines several features . They can convert molecules of any size into vectors of fixed length . It's important , It's important , Because many models require the input to have exactly the same size .ECFPs It can accept many molecules of different sizes , And used by the same model .ECFPs It's also easy to compare . utilize ECFPs Coding can be done through Tanimoto Distance to calculate molecular similarity .
Interested in the coding form and principle of Morgan fingerprint , You can see the following two documents (1.DOI:10.1021/ci9803381; 2.DOI:10.1021/ci100050t)
Hey , I'm here again , The same words are repeated , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .
The following part of the code shows a ECFP Small examples of use :
import pandas as pd
import rdkit
from rdkit.Chem import AllChem
from rdkit import Chem,DataStructs
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
data = pd.read_csv('moses_qed_props.csv')
# The read data is single column smiles Coding form
data_1 = data.SMILES.tolist()
# Generate Morgan fingerprint function
def product_fps(data):
""" Pass in smiles List of encoded files """
data = [x for x in data if x is not None]
data_mols = [Chem.MolFromSmiles(s) for s in data]
data_mols = [x for x in data_mols if x is not None]
data_fps = [AllChem.GetMorganFingerprintAsBitVect(x,3,2048) for x in data_mols]
return data_fps
# Calculate the molecular similarity function
def similar(data):
""" Pass in smiles List of encoded files """
fps = product_fps(data)
similarity = []
for i in range(len(fps)):
sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
similarity.extend(sims)
return similarity
# Function call and print
fps = product_fps(data_1)
print(f'fps:{fps[:20]}')
similarity = similar(data_1)
print(f'similarity:{similarity[:20]}')( notes : And smiles Different codes ,ECFP It's not unique )
One article a week rdkit Related articles , Summing up difficulties , If it helps you , I hope you can give me a compliment , Thank you very much .
边栏推荐
- ROS-Errror:Did you forget to specify generate_ messages(DEPENDENCIES ...)?
- Learn more than 4000 words, understand the problem of this pointing in JS, and handwrite to realize call, apply and bind
- Numpy acceleration -- > cupy installation
- Introduction and advanced level of MySQL (12)
- Tonight at 7:30 | is the AI world in the eyes of Lianjie, Jiangmen, Baidu and country garden venture capital continue to be advanced or return to the essence of business
- Shardingsphere's level table practice (III)
- Typescript learning (I)
- 「PHP基础知识」输出圆周率的近似值
- Detailed steps for installing MySQL 8.0 under Linux
- GJB common confused concepts
猜你喜欢

机器学习【Numpy】

复现20字符短域名绕过以及xss相关知识点

Practical guidance for interface automation testing (Part I): what preparations should be made for interface automation

【科技1】

MySQL流程控制之while、repeat、loop循环实例分析

Redis之sentinel哨兵集群怎么部署

国产ERP有没有机会击败SAP ?

力扣刷题之数组序号计算(每日一题7/28)
![[freeswitch development practice] unimrcp compilation and installation](/img/ef/b82326152326293bf98e89da28b887.png)
[freeswitch development practice] unimrcp compilation and installation

"PHP Basics" output approximate value of PI
随机推荐
Shardingsphere's level table practice (III)
VISO fast rendering convolution block
逐步分析类的拆分之案例——五彩斑斓的小球碰撞
Rongyun IM & RTC capabilities on new sites
mycat读写分离配置
Regular expression bypasses WAF
C language programming | exchange binary odd and even bits (macro Implementation)
How close can QA be to business code Direct exposure of defects through codediff
C traps and defects Chapter 3 semantic "traps" 3.2 pointers to non arrays
Introduction and advanced MySQL (XIV)
力扣刷题之分数加减运算(每日一题7/27)
Web uploader cannot upload multiple files
Notes on letter symbol marking of papers
MySQL installation and configuration super detailed tutorial and simple database and table building method
Implement Lmax disruptor queue from scratch (VI) analysis of the principle of disruptor solving pseudo sharing and consumers' elegant stopping
Digital image processing Chapter 10 - image segmentation
[freeswitch development practice] unimrcp compilation and installation
[freeswitch development practice] media bug obtains call voice flow
Reproduce 20 character short domain name bypass and XSS related knowledge points
Producer consumer model of concurrent model