当前位置:网站首页>(5) Introduction to R language bioinformatics -- ORF and sequence analysis
(5) Introduction to R language bioinformatics -- ORF and sequence analysis
2022-07-06 12:21:00 【EricFrenzy】
notes : This blog aims to share personal learning experience , Please forgive me for any irregularities !
The concept is introduced
In the human body , To express DNA Genes on , This gene contains DNA Is transcribed as pre-mRNA After further processing, it becomes mature mRNA,mRNA Then it will be used by ribosomes to synthesize proteins , So as to control the response of organisms . stay mRNA On , Every three bases form a codon , Corresponding to an amino acid . The following figure shows the comparison table of codons and amino acids :
To synthesize a normal protein ,mRNA Both ends of the sequence need to have a starting codon ( Marked with start) And a stop codon ( Marked with stop). But in DNA There are many start and stop codons like this on , Produce many different sequence combinations . In order to be in DNA Find all possible sequence combinations that can be used to make a certain protein , We use open reading frames (ORF,Open Reading Frame) To find all sequences that have the potential to encode proteins .
look for ORF Code implementation of
stay R Find in language ORF The procedure flow of is as follows :
Here is the specific code :
findORF <- function(seq){
# The incoming parameter is DNA Sequence , Pay attention to the direction. It must be 5' To 3'
findStartCodons <- function(seq){
# Find the function of starting codon
startcodons <- numeric(0) # Create an empty function
k <- 1
for(i in 1:(length(seq)-5)){
# Calculate by the first base position of the codon , The last five do not need to be checked , Because the length is too short
if(seq[i] == "a" && seq[i+1] == "t" && seq[i+2] == "g"){
#ATG Corresponding to the starting codon
startcodons[k] <- i # Record location
k <- k + 1 # Position subscript plus one
}
}
return(startcodons) # Return results
}
findStopCodons <- function(seq){
# Find the function that terminates the codon
stopcodons <- numeric(0) # Create an empty function
k <- 1
for(i in 1:(length(seq)-2)){
# Calculate by the first base position of the codon
if((seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "a") || (seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "g") || (seq[i] == "t" && seq[i+1] == "g" && seq[i+2] == "a")){
#TAA TAG TGA Corresponding to the stop codon
stopcodons[k] <- i # Record location
k <- k + 1 # Position subscript plus one
}
}
return(stopcodons) # Return results
}
startcodon <- findStartCodons(seq) # Find all the starting codons
stopcodon <- findStopCodons(seq) # Find all the stop codons
usedStop <- numeric(0) # Record used stop codons
ORFs <- character(0) # Record effective open reading frames
k <- 1
for(i in startcodon){
# Traverse all start codons
for(j in stopcodon){
# Traverse all termination codons
if((j-i)%%3==0 && j > i){
# If in a reading box , That is, the position between the two codons is 3 The integer of
if(j %in% usedStop){
# If the stop codon is used
break # Jump out of this cycle , To the next starting codon
}else if(j-i < 300){
# If the sequence length between codons is too short
break # ditto
}else{
ORFs[k] <- paste(i, "to", j) # Generate string , The recorded results are as follows "1 to 3001"
usedStop[k] <- j # Record used stop codons
k <- k + 1 # Position subscript plus one
break # Jump out of this cycle , To the next starting codon
}
}
}
}
return(ORFs) # Return results
}
This kind of search ORF Our algorithm is relatively simple and fast , But the accuracy will decrease accordingly . stay NCBI Official website There is a more accurate algorithm .
Conclusion
Find ORF after , Can put the ORF Compare with the known sequence in the database , Thus, useful information such as the composition and function of genes in this species can be predicted . Next time we will introduce Needleman-Wunsch This sequence global alignment algorithm , Coming soon ! And any questions or ideas are welcome to leave messages and comments !
边栏推荐
- Missing value filling in data analysis (focus on multiple interpolation method, miseforest)
- ES6 grammar summary -- Part I (basic)
- JS数组常用方法的分类、理解和运用
- A possible cause and solution of "stuck" main thread of RT thread
- Learning notes of JS variable scope and function
- Kaggle competition two Sigma connect: rental listing inquiries (xgboost)
- Cannot change version of project facet Dynamic Web Module to 2.3.
- Bubble sort [C language]
- By v$rman_ backup_ job_ Oracle "bug" caused by details
- Stm32f1+bc20+mqtt+freertos system is connected to Alibaba cloud to transmit temperature and humidity and control LED lights
猜你喜欢
Whistle+switchyomega configure web proxy
Types de variables JS et transformations de type communes
AMBA、AHB、APB、AXI的理解
单片机蓝牙无线烧录
Symbolic representation of functions in deep learning papers
Working principle of genius telephone watch Z3
Comparaison des solutions pour la plate - forme mobile Qualcomm & MTK & Kirin USB 3.0
Analysis of charging architecture of glory magic 3pro
Redis based distributed ID generator
Kconfig Kbuild
随机推荐
Remember an experience of ECS being blown up by passwords - closing a small black house, changing passwords, and changing ports
Arduino JSON data information parsing
Postman 中级使用教程【环境变量、测试脚本、断言、接口文档等】
[esp32 learning-1] construction of Arduino esp32 development environment
ESP8266使用arduino连接阿里云物联网
Whistle+switchyomega configure web proxy
Stm32f1+bc20+mqtt+freertos system is connected to Alibaba cloud to transmit temperature and humidity and control LED lights
E-commerce data analysis -- salary prediction (linear regression)
JS数组常用方法的分类、理解和运用
Classification, understanding and application of common methods of JS array
Keyword inline (inline function) usage analysis [C language]
ORA-02030: can only select from fixed tables/views
Vscode basic configuration
JS regular expression basic knowledge learning
JS正则表达式基础知识学习
MySQL占用内存过大解决方案
C language callback function [C language]
arduino获取随机数
2022.2.12 resumption
Générateur d'identification distribué basé sur redis