当前位置：网站首页>Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)

Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)

2022-06-23 18:04:00 【Inge】

List of articles

1 introduce
2 Method

1 introduce

1.1 subject

2021： Identification of lysine by integrated method 2- Hydroxyisobutyrylation (Lysine 2-hydroxyisobutyrylation identification with ensemble method)

1.2 summary

Lysine 2- Hydroxyisobutyrylation is a new type of post-translational modification detected in proteomics . This modification research may contribute to the research and drug development of a variety of diseases . In this work , A new 2-hydr_Ensemble Residue identification algorithm , This residue has sequence information at the protein level . This method is compared with typical classification models . Results show HeLa cells 、 Spirococcus 、 Rice seeds , And Saccharomyces cerevisiae AUC The values reach... Respectively 0.9197、0.8192、0.9307, as well as 0.8897. The statistical characteristics of Bayes with two profiles are further used , Find out the potential information from several eigenvectors .

1.3 Bib

@article{
    Bao:2021:104351,
author		=	{
    Wen Zheng Bao and Bin Yang and Bai Tong Chen},
title		=	{
    2-hydr\_ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method},
journal		=	{
    Chemometrics and Intelligent Laboratory Systems},
volume		=	{
    215},
pages		=	{
    104351},
year		=	{
    2021},
doi			=	{
    10.1016/j.chemolab.2021.104351}
}

2 Method

2.1 Modified residues

2.1.1 The correlation coefficient (CC)

Pearson correlation coefficient is a linear correlation coefficient , It is generally used to correct the correlation between two variables of residues . For two gene sequences $X$ and $Y$ , Pearson correlation coefficient The calculation is as follows ：
$\tag{1} R_{X,Y}=\frac{\sum{(X-\overline{X})(Y-\overline{Y})}}{\sqrt{\sum{(X-\overline{X})^2(Y-\overline{Y})^2}}},$ among $\overline{X}$ Express $X$ Average value .

2.1.2 Partial correlation coefficient (PCC)

Partial correlation coefficient is the correlation coefficient of two variables without the influence of other variables . Because the relationship between the two variables is very complex , May be affected by multiple variables , So the partial correlation coefficient is greater than CC Better choice .PCC According to its corresponding CC To define . Make $R$ Express CC matrix , Its inverse matrix is $R^{-1}$ , be PCC The calculation for the ：
$\tag{2} R_{X,Y}'=\frac{R_{X,Y}^{-1}}{\sqrt{R_{X,X}^{-1}R_{Y,Y}^{-1}}}$

2.1.3 Conditional mutual information (CMI)

Mutual information (MI) It can measure the non-linear correlation between modified residues and unmodified residues ：
$\tag{3} I(X,Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)},$ among $p (x)$ yes $x$ Probability 、 $p (x, y)$ It's the joint probability , They can be obtained by Gaussian kernel probability density estimation ：
$\tag{4} p\left(x_{i}\right)=\frac{1}{N} \sum_{j=1}^{N} \frac{1}{(2 \pi)^{n / 2} \sigma_{x}^{n / 2}} \exp \left(-\frac{1}{2}\left(X_{j}-X_{i}\right)^{T} C^{-1}\left(X_{j}-X_{i}\right)\right),$ among $C$ Express $X$ The covariance matrix of 、 $\sigma_x$ Express $C$ Standard deviation , as well as $n$ and $N$ It represents the number of genes and the number of gene expression points respectively . therefore ,MI It can be calculated as ：
$\tag{5} I(X, Y)=\frac{1}{2} \log \left(\frac{\sigma_{X}^{2} \sigma_{Y}^{2}}{|C(X, Y)|}\right),$ among $∣ C (X, Y) ∣$ It's a determinant .
However ,MI There are high estimate problems . therefore , Conditional mutual information (CMI) Proposed ：
$\tag{6} CMI(X, Y |Z) = \sum_{x \in X, y \in Y, z \in Z} p(x, y, z) \log \frac{p(x, y \mid z)}{p(x \mid z) p(y \mid z)}.$ If two genes $X$ $Y$ It's irrelevant , $C M I (X, Y ∣ Z) = 0$ .

2.1.4 Maximum information coefficient (MIC)

Maximum information coefficient (MIC) It is used to measure the linear or nonlinear relationship between two variables , It does not need to make assumptions about the distribution of data . Given a binary set , Where the data elements are ordered tuples $(a, b)$ . $G$ It's a grid . $a$ and $b$ The maximum information gain of all meshes of size is calculated as ：
$\tag{7} I^*(D,a,b)=\max I(D|_G),$ among $I(D|_G)$ Express $D|_G$ Mutual information of , $M (D)$ yes $D$ Characteristic matrix of , It is calculated as ：
$\tag{8} M(D)_{a,b}=\frac{I^*(D,a,b)}{\log(\min(a,b))}.$ $\max(M(D))$ It's genes $a$ and $b$ Between MIC, If the two genes are not related , Their MIC Will be equal to the 0.

2.2 Integration method

In order to improve the accuracy of detecting directly modified residues , A new dual integration method is proposed ：
1） Given contains $m$ Genes and $n$ Gene data set of sample points $D$ , Generate $K$ Data sets $(D^1,D^2,\dots,D^k)$ ;
2） For datasets $D^i$ ,CC、PCC、CMI, as well as MIC It is used to directly calculate the correlation between genes , And get a list of four ranks $G_{CC}^i,G_{PCC}^i,G_{CMI}^i,G_{MIC}^i)$ , And integrate into $G^i$ ;
3） Generate $(G^1,G^2,\dots,G^k)$ ;
4） Integrated as $G$ .