当前位置:网站首页>(5) Introduction to R language bioinformatics -- ORF and sequence analysis

(5) Introduction to R language bioinformatics -- ORF and sequence analysis

2022-07-06 12:21:00 EricFrenzy

notes : This blog aims to share personal learning experience , Please forgive me for any irregularities !

The concept is introduced

In the human body , To express DNA Genes on , This gene contains DNA Is transcribed as pre-mRNA After further processing, it becomes mature mRNA,mRNA Then it will be used by ribosomes to synthesize proteins , So as to control the response of organisms . stay mRNA On , Every three bases form a codon , Corresponding to an amino acid . The following figure shows the comparison table of codons and amino acids :
 Comparison table
To synthesize a normal protein ,mRNA Both ends of the sequence need to have a starting codon ( Marked with start) And a stop codon ( Marked with stop). But in DNA There are many start and stop codons like this on , Produce many different sequence combinations . In order to be in DNA Find all possible sequence combinations that can be used to make a certain protein , We use open reading frames (ORF,Open Reading Frame) To find all sequences that have the potential to encode proteins .

look for ORF Code implementation of

stay R Find in language ORF The procedure flow of is as follows :
 flow chart
Here is the specific code :

findORF <- function(seq){
     # The incoming parameter is DNA Sequence , Pay attention to the direction. It must be 5' To 3'

  findStartCodons <- function(seq){
     # Find the function of starting codon 
   startcodons <- numeric(0) # Create an empty function 
   k <- 1
   for(i in 1:(length(seq)-5)){
     # Calculate by the first base position of the codon , The last five do not need to be checked , Because the length is too short 
      if(seq[i] == "a" && seq[i+1] == "t" && seq[i+2] == "g"){
     #ATG Corresponding to the starting codon 
        startcodons[k] <- i # Record location 
        k <- k + 1 # Position subscript plus one 
      }
    }
    return(startcodons) # Return results 
  }

  findStopCodons <- function(seq){
     # Find the function that terminates the codon 
    stopcodons <- numeric(0) # Create an empty function 
    k <- 1
    for(i in 1:(length(seq)-2)){
     # Calculate by the first base position of the codon 
      if((seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "a") || (seq[i] == "t" && seq[i+1] == "a" && seq[i+2] == "g") || (seq[i] == "t" && seq[i+1] == "g" && seq[i+2] == "a")){
     #TAA TAG TGA Corresponding to the stop codon 
        stopcodons[k] <- i # Record location 
        k <- k + 1 # Position subscript plus one 
      }
    }
    return(stopcodons) # Return results 
  }
  
  startcodon <- findStartCodons(seq) # Find all the starting codons 
  stopcodon <- findStopCodons(seq) # Find all the stop codons 
  usedStop <- numeric(0) # Record used stop codons 
  ORFs <- character(0) # Record effective open reading frames 
  k <- 1
  for(i in startcodon){
     # Traverse all start codons 
    for(j in stopcodon){
     # Traverse all termination codons 
      if((j-i)%%3==0 && j > i){
     # If in a reading box , That is, the position between the two codons is 3 The integer of 
        if(j %in% usedStop){
     # If the stop codon is used 
          break # Jump out of this cycle , To the next starting codon 
        }else if(j-i < 300){
     # If the sequence length between codons is too short 
          break # ditto 
        }else{
    
          ORFs[k] <- paste(i, "to", j) # Generate string , The recorded results are as follows "1 to 3001"
          usedStop[k] <- j # Record used stop codons 
          k <- k + 1 # Position subscript plus one 
          break # Jump out of this cycle , To the next starting codon 
        }
      }
    }
  }
  return(ORFs) # Return results 
} 

This kind of search ORF Our algorithm is relatively simple and fast , But the accuracy will decrease accordingly . stay NCBI Official website There is a more accurate algorithm .

Conclusion

Find ORF after , Can put the ORF Compare with the known sequence in the database , Thus, useful information such as the composition and function of genes in this species can be predicted . Next time we will introduce Needleman-Wunsch This sequence global alignment algorithm , Coming soon ! And any questions or ideas are welcome to leave messages and comments !

原网站

版权声明
本文为[EricFrenzy]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060913448110.html