当前位置:网站首页>Interpretation of the paper: develop a prediction model based on multi-layer deep learning to identify DNA N4 methylcytosine modification
Interpretation of the paper: develop a prediction model based on multi-layer deep learning to identify DNA N4 methylcytosine modification
2022-07-23 12:22:00 【Windy Street】
Developing a Multi-Layer Deep Learning Based Predictive Model to Identify DNA N4-Methylcytosine Modifications
Article address :https://www.frontiersin.org/articles/10.3389/fbioe.2020.00274/full
DOI:https://doi.org/10.3389/fbioe.2020.00274
Periodical :Frontiers in Bioengineering and Biotechnology(2 District )
Influencing factors :5.89
Release time :2020 year 4 month 21 Japan
data :http://server.malab.cn/Deep4mcPred/Download.html
The server :http://server.malab.cn/Deep4mcPred
1. The article summarizes
1. A prediction model based on multi-layer deep learning is proposed :Deep4mcPred. Integrate residual network for the first time (Residual Network) And recurrent neural networks (Recurrent Neural Network) To build a multi-layer deep learning prediction system .
2. The deep learning model does not need specific features when training the prediction model , It can automatically learn advanced features and capture 4mC Site specificity , Conducive to differentiation 4mC site .
3. Deep learning methods are common Compared with the prediction results of traditional machine learning, the benchmark test set is better , indicate Deep4mcPred stay DNA 4mC More effective in site prediction .
4. The attention mechanism introduced into the in-depth learning framework can be used to capture key features .
5. Developed a web server :http://server.malab.cn/deep4mcpred.
2. Preface
With the development of high flux technology , Found in bacteria 4mC, Found in protecting the genome from restrictive modification (R-M) It plays an important role in the invasion of the system .
Previous methods have improved recognition 4mC Site performance , But too few datasets are used , It cannot fully reflect the whole genome and establish a good performance model .
3. data
Chen And others put forward a Golden Benchmark data set , For performance evaluation and comparison . however , The size of the dataset is too small to train in-depth learning models . therefore , The author constructed a larger data set in this study , They strictly follow Chen The data processing program introduced in learning , The purpose is to ensure that the processed data set is the most representative .
(1) Positive samples
Treatment process :
- Collected all 41bp Long sequence of , from methsmrt The database has real 4mC site .
- Deleted the use ModQV The sequence of scores , Instead of calling the default threshold for modifying the location according to the methyl group analysis technical description .
- Used CD-BIT Software ( have 80% The threshold of ) Reduce masculine identity , The potential to avoid performance bias .
Positive samples were collected from three species : Arabidopsis (A. Thilana), Caenorhabditis elegans (C. elegans) And Drosophila melanogaster (D. Melanogaster). Details of positive samples from the three species are listed in table 1. Randomly selected 20,000 A positive sample of model training .
(2) Negative samples
Negative samples are also cytosine centered 41bp Sequence , But not by SMRT Sequencing technology identifies . under these circumstances , The number of negative samples of each species is much larger than the corresponding positive samples . To avoid data imbalance , Randomly select the same number of sequences as the positive samples to form the negative samples .
4. Method
4.1 Sequence characteristics
One-hot code :
“A”:(1,0,0,0)
“G”:(0,1,0,0)
“C”:(0,0,1,0)
“T”:(0,0,0,1)
“N”:(0,0,0,0)
4.2 Deep learning model framework

For a given DNA Sequence , The neural network is composed of four layers : Input layer ,ResNet layer ,LSTM Layer and attention layer , Pictured 1 Shown . The first layer is the input layer . The sequence of the data set consists of One-hot code , And the obtained features are sent to the subsequent ResNet Layer . Through this kind of ResNet Model , It can be based on ordinary CNN Models to build deeper Networks , Used to extract effective global functions , The output eigenvector is used as LSTM Layer of the input . stay LSTM Layer , two-way LSTM The model is used to collect feature information from two directions . In the last layer of attention , Introduce attention mechanisms to integrate LSTM Layer output to get more relevant feature information . Last , Attach a fully connected neural network after the attention model (FC), And perform Softmax Activate the function for prediction .
4.2.1 Residual neural network (ResNet)
With the deepening of convolutional neural network , The worse the optimization effect , The accuracy of test data and training data is reduced . This is because the deepening of the network will cause the problem of gradient explosion and gradient disappearance .
At present, there are solutions to this phenomenon : Normalize the input data and the data of the middle layer , This method can ensure that the network adopts random gradient descent in back propagation (SGD), So as to make the network converge . however , This method is only useful for dozens of layers of networks , When the network goes deeper , This method is useless .
To solve this problem ,ResNet There is ,Reset The internal residual blocks of utilize jump connection , Reduce the gradient disappearance problem caused by depth increase in convolutional neural network .
ResNet There are two kinds of , A two-layer structure , A three-layer structure :
4.2.2 Long and short term memory network (LSTM)
Due to gradient explosion or gradient disappearance ,RNN There's a long-term dependency problem , It's hard to build a long-distance dependency , So a gating mechanism is introduced to control the speed of information accumulation , Including selectively adding new information , And selectively forget the accumulated information . It's more classic based on gating RNN Yes LSTM( Long and short term memory network ) and GRU( Gated loop unit network ).
4.2.3 Attention mechanism (Attention)
Attention mechanism can quickly filter out high-level information from noise , Recently, it has shown great success in many related classification tasks , To take advantage of this , The author's in the model LSTM The attention mechanism is applied behind the layer .
Advantages of attention mechanism :
- Less parameters
The complexity of the model follows CNN、RNN comparison , Less complexity , Fewer parameters . Therefore, the requirements for calculation are smaller .- Fast
Attention It's solved RNN The problem of not parallel computing .Attention The calculation of each step of the mechanism does not depend on the calculation result of the previous step , So it can be with CNN Parallel processing .- The effect is good
stay Attention Before the introduction of the mechanism , There is a problem that everyone has been very distressed : Long distance information will be weakened , It's like a person with weak memory , Can't remember the past, things are the same .Attention It's about choosing the point , Even if the text is longer , Can also grasp the key points from the middle , Don't lose important information .
4.2.4 Softmax
After paying attention to the module, send it to Softmax The vector generated after the layer is used as input for classification .
Softmax Function mapping and neuron output to (0-1) Number between , And reduce the sum to . let me put it another way , The output score of each category can be passed Softmax Convert to relative probability . therefore , The prediction tag can be determined by comparing the prediction probability of each class .
5. result
5.1 Comparison between the proposed method and existing methods

5.2 By integrating the effect of attention mechanism on performance


6. summary
Deep4mCPred It is the first prediction method based on deep learning , Integrated residual network (ResNet) And bidirectional long-term and short-term memory network (BiLSTM) To build a multi-layer deep learning prediction model .
There is no need to specify features when training prediction models , It can automatically learn advanced functions and capture 4mC Characteristics of loci , It is beneficial to distinguish non 4mC He Zhen 4mC site .
The attention mechanism introduced into the in-depth learning framework can be used to capture key features .
边栏推荐
- 利用or-tools来求解路径规划问题(TSP)
- Smart pointer shared_ PTR and unique_ ptr
- 线性规划之Google OR-Tools 简介与实战
- Hardware knowledge 1 -- Schematic diagram and interface type (based on Baiwen hardware operation Daquan video tutorial)
- Solve Sudoku puzzles with Google or tools
- ARM架构与编程7--异常与中断(基于百问网ARM架构与编程教程视频)
- 论文解读:《提高N7-甲基鸟苷(m7G)位点预测性能的迭代特征表示方法》
- 硬件知識1--原理圖和接口類型(基於百問網硬件操作大全視頻教程)
- C语言中,对柔性数组的理解
- 高电压技术重点知识整理
猜你喜欢

论文解读:《提高N7-甲基鸟苷(m7G)位点预测性能的迭代特征表示方法》

How can knowledge map, map data platform and map technology help the rapid development of retail industry

CPC client installation tutorial

数据分析(二)

ARM架构与编程4--串口(基于百问网ARM架构与编程教程视频)

Integrate all lvgl controls into one project (lvgl6.0 version)

硬件知識1--原理圖和接口類型(基於百問網硬件操作大全視頻教程)

NLP自然语言处理-机器学习和自然语言处理介绍(一)

Introduction and practice of Google or tools for linear programming

Check the sandbox file in the real app
随机推荐
Comment se développe le serveur GPU refroidi à l'eau dans le Centre de données dans le cadre de l'informatique est - Ouest?
高分子物理考研概念及要点、考点总结
High level API of propeller to realize face key point detection
对字符串函数的使用和理解(1)
数字经济“双碳”目标下,“东数西算”数据中心为何依靠液冷散热技术节能减排?
论文解读:《提高N7-甲基鸟苷(m7G)位点预测性能的迭代特征表示方法》
What technologies are used in pharmaceutical research and development in the field of life sciences? Cryoelectron microscope? Molecular simulation? IND?
Notes | Baidu flying plasma AI talent Creation Camp: data acquisition and processing (mainly CV tasks)
智能指针shared_ptr和unique_ptr
硬件知识2--协议类(基于百问网硬件操作大全视频教程)
Solution to schema verification failure in saving substantive examination request
论文解读:《开发和验证深度学习系统对黄斑裂孔的病因进行分类并预测解剖结果》
Comparison between pytorch and paddlepaddle -- Taking the implementation of dcgan network as an example
The online seminar on how to help data scientists improve data insight was held on June 8
利用pycaret:低代码,自动化机器学习框架解决回归问题
2021可信隐私计算高峰论坛暨数据安全产业峰会上百家争鸣
液冷数据中心如何构建,蓝海大脑液冷技术保驾护航
实用卷积相关trick
论文解读:《基于注意力的多标签神经网络用于12种广泛存在的RNA修饰的综合预测和解释》
Using pycaret for data mining: association rule mining
