当前位置：网站首页>Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)

Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)

2022-07-29 03:25:00 【Order anything】

Before introducing these three codes , Let's give you a brief introduction rdkit Data formats that can be recognized and mutually converted in ：.smi( namely smiles The form of coding ),.mol( That is, the form of molecular graph ),.sdf( The form of molecular coordinates ),…… These are commonly used , If you encounter some unusual , You can refer to rdkit In the source package Chem modular .

SMILES code

SMILES The full name of the code is ：Simplified Molecular-Input Line-Entry System

Code as name , Actually smiles Coding is a common pattern of defining molecules with text strings .SMILES Strings describe the atoms and bonds of molecules in a simple and intuitive way for chemists .

Compared with other molecular expressions smiles Coding has two advantages ：

1. Uniqueness ： Every SMILES The code corresponds to a unique chemical structure , At the same time, each chemical structure corresponds to SMILES Coding is also unique , The two are one-to-one correspondence .

2. Save a space ：SMILES Another important feature of is , Compared with most other methods of expressing structure , It can save storage space .SMILES It takes up even less space than binary tables 50% to 70%, Besides ,SMILES The compression of is very effective . adopt Ziv-Lempel Compress , The storage memory of the same database can be reduced to its original size 27%.

smiles Please refer to the following blog post for coding rules , In practice , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .

SMILES: A simplified molecular language _xk6891 The blog of -CSDN Blog _smile structure One . What is? SMILES SMILES, The full name is Simplified Molecular Input Line Entry System, It is a linear symbol used to input and represent molecular reactions , It's a kind of ASCII code , Here are some examples : SMILES The information contained may be the same as some extended source data tables ,SMILES The main reason why it is more applicable is that it is a language structure , Not computer data structures . SMILES It's a real language , Although there are only simple words （ Atomic and bond symbols ) And a few grammatical rules . SMILES Structural representations can be used in turn as other languages ...https://blog.csdn.net/xk6891/article/details/116380262?spm=1001.2101.3001.6650.3&utm_medium=distribute.pc_relevant.none-task-blog-2~default~CTRLIST~Rate-3-116380262-blog-124738689.pc_relevant_paycolumn_v3&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2~default~CTRLIST~Rate-3-116380262-blog-124738689.pc_relevant_paycolumn_v3&utm_relevant_index=6（ notes ：smiles Coding can represent chiral isomerism ）

stay rdkit Bag Chem Module , In possession of molecules smiles Under the premise of coding , Can pass smiles Code to get some physical and chemical properties of molecules , See the following code display for the specific process ：

import pandas as pd 
import rdkit 
from rdkit import Chem
from rdkit import rdBase, Chem
from rdkit.Chem import PandasTools, Descriptors, rdMolDescriptors, MolFromSmiles
from rdkit.Chem import QED,Lipinski
from moses.metrics import SA,mol_passes_filters

#  This table has only one column , Molecular smiles code , The title is 0
df = pd.read_csv('smiles.csv')

#  analysis ： Calculating the logP,MW,HBA+HBD,TPSA,NRB
df['logP'] = df['0'].apply(lambda x: Descriptors.MolLogP(Chem.MolFromSmiles(x)))
df['TPSA'] = df['0'].apply(lambda x: Descriptors.TPSA(Chem.MolFromSmiles(x)))
df['MW'] = df['0'].apply(lambda x: Descriptors.MolWt(Chem.MolFromSmiles(x)))
df['HBA'] = df['0'].apply(lambda x: rdMolDescriptors.CalcNumLipinskiHBA(Chem.MolFromSmiles(x)))
df['HBD'] = df['0'].apply(lambda x: rdMolDescriptors.CalcNumLipinskiHBD(Chem.MolFromSmiles(x)))

#  Calculation QED
df['QED'] = df['0'].apply(lambda x:(QED.properties(Chem.MolFromSmiles(x))))

#  Calculation SA
df['SA'] = df['0'].apply(lambda x: SA(Chem.MolFromSmiles(x)))

SMART code

SMART The coding is described above SMILES Language extension , Can be used to create queries . Can be SMART Patterns are viewed as regular expressions similar to those used to search for text ( To put it another way ,smart Coding is equivalent to smiles A fuzzy search of coding ).

SMART Coding is generally used in the following situations ：

Search the molecular database to identify molecules containing specific substructures ;

Arrange a group of molecules on a common substructure , To improve the visual effect ;

Highlight the substructure in the figure

Constrain the substructure during the calculation

about SMART For coding rules, please refer to its official website , And smiles The coding is the same , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .

Daylight>SMARTS Exampleshttps://www.daylight.com/dayhtml_tutorials/languages/smarts/smarts_examples.html#ES_AM The following code block shows a Smart Small examples of coding ：

from rdkit import Chem
from rdkit.Chem import ChemicalFeatures
from rdkit import RDConfig
from rdkit.Chem import Draw
from rdkit.Chem import AllChem
from rdkit.Chem.Pharm2D.SigFactory import SigFactory
from rdkit.Chem.Pharm2D import Generate, Gobbi_Pharm2D
import os
import pandas as pd


#  Some molecules are known smiles code , It can be done by rdkit Deal with it smart code —— Next, imidazole and guanidine are converted into smart Format 
S1 = Chem.MolToSmarts(Chem.MolFromSmiles('C1NCNC1'))
S2 = Chem.MolToSmarts(Chem.MolFromSmiles('NC(=N)N'))
#  Imidazole ：[#6]1-[#7]-[#6]-[#7]-[#6]-1; Guanidine ：[#7]-[#6](=[#7])-[#7]

#  Read in the data 
df = pd.read_csv("smiles_sorted.csv")
# smile Coding form , Single column , Column title by 0
suppl_list = df[0].tolist()
suppl_end =  [Chem.MolFromSmiles(x) for x in suppl_list]

#  Screening functions of guanidine and imidazole groups 
def pharmacophore_smarts(m):
    ''' Pass in mol Format list '''
    #  Definition smart
    PosIonizable_Guanidine = '[#7]-[#6](=[#7])-[#7]'
    PosIonizable_Imidazole = '[#6]1-[#7]-[#6]-[#7]-[#6]-1'
    atomPharma = {}
    PosIonizable_1 = m.HasSubstructMatch(Chem.MolFromSmarts(PosIonizable_Guanidine))     
    PosIonizable_2 = m.HasSubstructMatch(Chem.MolFromSmarts(PosIonizable_Imidazole))
    return PosIonizable_1,PosIonizable_2

#  Screening guanidine and imidazolyl —— The result is DataFrame form 
result = pd.DataFrame([pharmacophore_smarts(m) for m in suppl_end],columns=['PosIonizable_1','PosIonizable_2'])

#  Remove unqualified data 
result_new = result.drop(result[result.PosIonizable_1 == False].index)
result_new = result.drop(result[result.PosIonizable_2 == False].index)
result_new.index

#  Draw the qualified molecular structure 
from rdkit.Chem import Draw
draw = [suppl_list[i] for i in result_new.index]
mols=[]
for m in draw:
    mol = Chem.MolFromSmiles(m)
    mols.append(mol)
img=Draw.MolsToGridImage(mols,molsPerRow=4,subImgSize=(300,300),legends=['' for x in mols],returnPNG=False)
img.save('demo.png')

Morgan fingerprint (ECFP)

Chemical fingerprints are made by 1 and 0 The vector of composition , Indicates the presence or absence of a specific structure in the molecule , Morgan fingerprint is also called extension —— Connected fingerprints are a kind of characterized representation that combines several features . They can convert molecules of any size into vectors of fixed length . It's important , It's important , Because many models require the input to have exactly the same size .ECFPs It can accept many molecules of different sizes , And used by the same model .ECFPs It's also easy to compare . utilize ECFPs Coding can be done through Tanimoto Distance to calculate molecular similarity .

Interested in the coding form and principle of Morgan fingerprint , You can see the following two documents （1.DOI：10.1021/ci9803381; 2.DOI:10.1021/ci100050t）

Hey , I'm here again , The same words are repeated , Even if you don't master the coding rules, it doesn't matter , As long as you can use rdkit That's all right. .

The following part of the code shows a ECFP Small examples of use ：

import pandas as pd 
import rdkit
from rdkit.Chem import AllChem
from rdkit import Chem,DataStructs
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt

data = pd.read_csv('moses_qed_props.csv')
#  The read data is single column smiles Coding form 
data_1 = data.SMILES.tolist()


#  Generate Morgan fingerprint function 
def product_fps(data):
    """ Pass in smiles List of encoded files """
    data = [x for x in data if x is not None]
    data_mols = [Chem.MolFromSmiles(s) for s in data]
    data_mols = [x for x in data_mols if x is not None]
    data_fps = [AllChem.GetMorganFingerprintAsBitVect(x,3,2048) for x in data_mols]
    return data_fps

#  Calculate the molecular similarity function 
def similar(data):
    """ Pass in smiles List of encoded files """
    fps = product_fps(data)
    similarity = []
    for i in range(len(fps)):
        sims = DataStructs.BulkTanimotoSimilarity(fps[i],fps[:i])
        similarity.extend(sims)
    return similarity


#  Function call and print 
fps = product_fps(data_1)
print(f'fps:{fps[:20]}')
similarity = similar(data_1)
print(f'similarity:{similarity[:20]}')

（ notes ： And smiles Different codes ,ECFP It's not unique ）

One article a week rdkit Related articles , Summing up difficulties , If it helps you , I hope you can give me a compliment , Thank you very much .

原网站

版权声明
本文为[Order anything]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130553300119.html

当前位置：网站首页>Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)

Rdkit: introduce smiles code, smart code and Morgan fingerprint (ECFP)

边栏推荐

猜你喜欢

随机推荐