当前位置:网站首页>Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)
Thesis reading (57):2-hydr_ Ensemble: lysine 2-hydroxyisobutyrylation identification with ensemble method (task)
2022-06-23 18:04:00 【Inge】
List of articles
1 introduce
1.1 subject
1.2 summary
Lysine 2- Hydroxyisobutyrylation is a new type of post-translational modification detected in proteomics . This modification research may contribute to the research and drug development of a variety of diseases . In this work , A new 2-hydr_Ensemble Residue identification algorithm , This residue has sequence information at the protein level . This method is compared with typical classification models . Results show HeLa cells 、 Spirococcus 、 Rice seeds , And Saccharomyces cerevisiae AUC The values reach... Respectively 0.9197、0.8192、0.9307, as well as 0.8897. The statistical characteristics of Bayes with two profiles are further used , Find out the potential information from several eigenvectors .
1.3 Bib
@article{
Bao:2021:104351,
author = {
Wen Zheng Bao and Bin Yang and Bai Tong Chen},
title = {
2-hydr\_ensemble: Lysine 2-hydroxyisobutyrylation identification with ensemble method},
journal = {
Chemometrics and Intelligent Laboratory Systems},
volume = {
215},
pages = {
104351},
year = {
2021},
doi = {
10.1016/j.chemolab.2021.104351}
}
2 Method
2.1 Modified residues
2.1.1 The correlation coefficient (CC)
Pearson correlation coefficient is a linear correlation coefficient , It is generally used to correct the correlation between two variables of residues . For two gene sequences X X X and Y Y Y, Pearson correlation coefficient The calculation is as follows :
R X , Y = ∑ ( X − X ‾ ) ( Y − Y ‾ ) ∑ ( X − X ‾ ) 2 ( Y − Y ‾ ) 2 , (1) \tag{1} R_{X,Y}=\frac{\sum{(X-\overline{X})(Y-\overline{Y})}}{\sqrt{\sum{(X-\overline{X})^2(Y-\overline{Y})^2}}}, RX,Y=∑(X−X)2(Y−Y)2∑(X−X)(Y−Y),(1) among X ‾ \overline{X} X Express X X X Average value .
2.1.2 Partial correlation coefficient (PCC)
Partial correlation coefficient is the correlation coefficient of two variables without the influence of other variables . Because the relationship between the two variables is very complex , May be affected by multiple variables , So the partial correlation coefficient is greater than CC Better choice .PCC According to its corresponding CC To define . Make R R R Express CC matrix , Its inverse matrix is R − 1 R^{-1} R−1, be PCC The calculation for the :
R X , Y ′ = R X , Y − 1 R X , X − 1 R Y , Y − 1 (2) \tag{2} R_{X,Y}'=\frac{R_{X,Y}^{-1}}{\sqrt{R_{X,X}^{-1}R_{Y,Y}^{-1}}} RX,Y′=RX,X−1RY,Y−1RX,Y−1(2)
2.1.3 Conditional mutual information (CMI)
Mutual information (MI) It can measure the non-linear correlation between modified residues and unmodified residues :
I ( X , Y ) = ∑ x ∈ X ∑ y ∈ Y p ( x , y ) log p ( x , y ) p ( x ) p ( y ) , (3) \tag{3} I(X,Y)=\sum_{x\in X}\sum_{y\in Y}p(x,y)\log\frac{p(x,y)}{p(x)p(y)}, I(X,Y)=x∈X∑y∈Y∑p(x,y)logp(x)p(y)p(x,y),(3) among p ( x ) p(x) p(x) yes x x x Probability 、 p ( x , y ) p(x,y) p(x,y) It's the joint probability , They can be obtained by Gaussian kernel probability density estimation :
p ( x i ) = 1 N ∑ j = 1 N 1 ( 2 π ) n / 2 σ x n / 2 exp ( − 1 2 ( X j − X i ) T C − 1 ( X j − X i ) ) , (4) \tag{4} p\left(x_{i}\right)=\frac{1}{N} \sum_{j=1}^{N} \frac{1}{(2 \pi)^{n / 2} \sigma_{x}^{n / 2}} \exp \left(-\frac{1}{2}\left(X_{j}-X_{i}\right)^{T} C^{-1}\left(X_{j}-X_{i}\right)\right), p(xi)=N1j=1∑N(2π)n/2σxn/21exp(−21(Xj−Xi)TC−1(Xj−Xi)),(4) among C C C Express X X X The covariance matrix of 、 σ x \sigma_x σx Express C C C Standard deviation , as well as n n n and N N N It represents the number of genes and the number of gene expression points respectively . therefore ,MI It can be calculated as :
I ( X , Y ) = 1 2 log ( σ X 2 σ Y 2 ∣ C ( X , Y ) ∣ ) , (5) \tag{5} I(X, Y)=\frac{1}{2} \log \left(\frac{\sigma_{X}^{2} \sigma_{Y}^{2}}{|C(X, Y)|}\right), I(X,Y)=21log(∣C(X,Y)∣σX2σY2),(5) among ∣ C ( X , Y ) ∣ |C(X, Y)| ∣C(X,Y)∣ It's a determinant .
However ,MI There are high estimate problems . therefore , Conditional mutual information (CMI) Proposed :
C M I ( X , Y ∣ Z ) = ∑ x ∈ X , y ∈ Y , z ∈ Z p ( x , y , z ) log p ( x , y ∣ z ) p ( x ∣ z ) p ( y ∣ z ) . (6) \tag{6} CMI(X, Y |Z) = \sum_{x \in X, y \in Y, z \in Z} p(x, y, z) \log \frac{p(x, y \mid z)}{p(x \mid z) p(y \mid z)}. CMI(X,Y∣Z)=x∈X,y∈Y,z∈Z∑p(x,y,z)logp(x∣z)p(y∣z)p(x,y∣z).(6) If two genes X X X Y Y Y It's irrelevant , C M I ( X , Y ∣ Z ) = 0 CMI(X, Y|Z)=0 CMI(X,Y∣Z)=0.
2.1.4 Maximum information coefficient (MIC)
Maximum information coefficient (MIC) It is used to measure the linear or nonlinear relationship between two variables , It does not need to make assumptions about the distribution of data . Given a binary set , Where the data elements are ordered tuples ( a , b ) (a, b) (a,b). G G G It's a grid . a a a and b b b The maximum information gain of all meshes of size is calculated as :
I ∗ ( D , a , b ) = max I ( D ∣ G ) , (7) \tag{7} I^*(D,a,b)=\max I(D|_G), I∗(D,a,b)=maxI(D∣G),(7) among I ( D ∣ G ) I(D|_G) I(D∣G) Express D ∣ G D|_G D∣G Mutual information of , M ( D ) M(D) M(D) yes D D D Characteristic matrix of , It is calculated as :
M ( D ) a , b = I ∗ ( D , a , b ) log ( min ( a , b ) ) . (8) \tag{8} M(D)_{a,b}=\frac{I^*(D,a,b)}{\log(\min(a,b))}. M(D)a,b=log(min(a,b))I∗(D,a,b).(8) max ( M ( D ) ) \max(M(D)) max(M(D)) It's genes a a a and b b b Between MIC, If the two genes are not related , Their MIC Will be equal to the 0.
2.2 Integration method
In order to improve the accuracy of detecting directly modified residues , A new dual integration method is proposed :
1) Given contains m m m Genes and n n n Gene data set of sample points D D D, Generate K K K Data sets ( D 1 , D 2 , … , D k ) (D^1,D^2,\dots,D^k) (D1,D2,…,Dk);
2) For datasets D i D^i Di,CC、PCC、CMI, as well as MIC It is used to directly calculate the correlation between genes , And get a list of four ranks ( G C C i , G P C C i , G C M I i , G M I C i ) (G_{CC}^i,G_{PCC}^i,G_{CMI}^i,G_{MIC}^i) (GCCi,GPCCi,GCMIi,GMICi), And integrate into G i G^i Gi;
3) Generate ( G 1 , G 2 , … , G k ) (G^1,G^2,\dots,G^k) (G1,G2,…,Gk);
4) Integrated as G G G.
边栏推荐
- Customer service system building tutorial_ Installation and use mode under the pagoda panel_ Docking with official account_ Support app/h5 multi tenant operation
- Also using copy and paste to create test data, try the data assistant!
- First use of kubernetes cronjob
- Tencent three sides: how to duplicate 4billion QQ numbers?
- Nodejs implements multi process
- MySQL的 安装、配置、卸载
- Skills that all applet developers should know: applying applet components
- 2022年T电梯修理考试题库及模拟考试
- JS regular verification time test() method
- 视频异常检测数据集 (ShanghaiTech)
猜你喜欢

Crmeb second open SMS function tutorial
![[tool C] - lattice simulation test 2](/img/a2/0f9641332c9c13493ee8b3e568a294.png)
[tool C] - lattice simulation test 2

论文阅读 (48):A Library of Optimization Algorithms for Organizational Design

论文阅读 (47):DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology..

iMeta | 南农沈其荣团队发布微生物网络分析和可视化R包ggClusterNet

12 initialization of beautifulsoup class
![[win10 vs2019 opencv4.6 configuration reference]](/img/51/62fb26123561b65f127304ede834a2.png)
[win10 vs2019 opencv4.6 configuration reference]
![微信小程序报错[ app.json 文件内容错误] app.json: app.json 未找到](/img/ab/5c27e1bb80ad662d1a220d29c328e0.png)
微信小程序报错[ app.json 文件内容错误] app.json: app.json 未找到

论文阅读 (53):Universal Adversarial Perturbations
![[esp8266-01s] get weather, city, Beijing time](/img/8f/89e6f0d482f482ed462f1ebd53616d.png)
[esp8266-01s] get weather, city, Beijing time
随机推荐
Postgresql_ Optimize SQL based on execution plan
January 5, 2022: there are four kinds of rhythms: AABB, ABAB and ABB
Intelligent supply chain collaborative management solution for logistics industry
Company offensive operation guide
VNC Viewer方式的远程连接树莓派
芯片原厂必学技术之理论篇(4-1)时钟技术、复位技术
Establishment and use of SSL VPN (OpenVPN)
【剑指Offer】46. 把数字翻译成字符串
Tencent Qianfan scene connector: worry and effort saving automatic SMS sending
What is the personal finance interest rate in 2022? How do individuals choose financial products?
MySQL - reasons for using repeatable read
iMeta | 南农沈其荣团队发布微生物网络分析和可视化R包ggClusterNet
Analytic analog-to-digital (a/d) converter
How to use JSON data format
2022年T电梯修理考试题库及模拟考试
How code 39 check bits are calculated
论文阅读 (53):Universal Adversarial Perturbations
MySQL的 安裝、配置、卸載
全局组织结构控制之抢滩登陆
Script to view the execution of SQLSERVER database stored procedures