当前位置：网站首页>N methods for obtaining effective length of genes

N methods for obtaining effective length of genes

2022-06-27 23:36:00 【Shengxin skill tree】

Recently, a fan volunteered to share his bioinformatics notes on Jianshu and other platforms on our official account of Shengxin skill tree , Learn from others on the professional stage ！

Very welcome , His previous share is ：Counts FPKM RPKM TPM CPM The transformation of

The following is the nickname of Jianshu : Hey, hey, hey, ha The share of

stay RNAseq In the downstream analysis of , Generally, the raw materials obtained from upstream processing will be processed counts Number converted to FPKM/RPKM or TPM For subsequent presentation or analysis , Its definition and calculation formula are in The previous share is ：Counts FPKM RPKM TPM CPM The transformation of Referred to the .

It should be noted that , In the calculation FPKM/RPKM and TPM when , Gene length generally refers to Gene effective length effective length, That is, the total length of exons or transcripts of the gene , It is more accurate to eliminate the influence of gene length caused by sequencing based on this standard . See the skill tree article of Shengxin ：

The length of the gene | Shengxin rookie group (bio-info-trainee.com)

So here comes the question , In the calculation FPKM/RPKM when , How to obtain the effective length data of each gene ？ I have summarized several ways to obtain the effective length of genes （ Or non redundant total exon length 、 Total transcript length ） Methods , It is now arranged as follows ：

One 、 Obtain the effective length of the gene from the results of the upstream output file

generally speaking ,RNA-seq Get the original counts The most commonly used upstream software for expression matrix is featureCounts and Salmon 了 , In the output of these two types of software , Except for genes （ Or transcripts ） Of counts Out of information , It also contains information about the effective length of genes , Such as featureCounts Of the output file Length This column corresponds to the effective length of the gene .（P.S. I always thought featureCounts Of Length Just simple gene length , Later, after comparison with various methods, it was found that Length This column is already the effective length of genes ... I will also show the results of the comparison of these methods later in the article ）

therefore , The most convenient way is to get it downstream counts Matrix time , The effective length information of gene is also extracted for subsequent gene expression transformation .

1. in the light of featureCounts Output file for

stay R Read from featureCounts Output file for , extract Length And corresponding geneid Information , Again according to counts Medium rowname（geneid） Match sort , You can carry out the follow-up TPM、FPKM The value is calculated .

featurecounts Output file format

rm(list=ls()) 
options(stringsAsFactors = F)  
library(tidyverse) 
# ggplot2 stringer dplyr tidyr readr purrr  tibble forcats 
library(data.table) # Multi core read file  
a1 <- fread('counts.txt', header = T, data.table = F)# load counts, The first column is set to column name  
### counts Matrix construction  
counts <- a1[,7:ncol(a1)] 
# The amount of gene expression in the sample was intercepted counts Part as counts  
rownames(counts) <- a1$Geneid # Take the gene name as the line name  
###  from featurecounts  Original output file counts.txt Extract from Geneid、Length( Length of transcript ), 
geneid_efflen <- subset(a1,select = c("Geneid","Length"))        
colnames(geneid_efflen) <- c("geneid","efflen")   
geneid_efflen_fc <- geneid_efflen # For later comparison  
###  Take out counts in geneid It's the same as efflen 
dim(geneid_efflen) 
efflen <- geneid_efflen[match(rownames(counts),                               
                              geneid_efflen$geneid),"efflen"] ###  Calculation  TPM #TPM (Transcripts Per Kilobase Million)   Transcription per thousand bases, read per million mapping Transcripts 
counts2TPM <- function(count=count, efflength=efflen){   RPK <- count/(efflength/1000)  # Per kilobase reads (“per million” scaling factor)  Length standardization    
                                                      PMSC_rpk <- sum(RPK)/1e6   #RPK Scaling factor per million  (“per million” scaling factor )  Deep standardization   
                                                      RPK/PMSC_rpk                      
                                                     }   
tpm <- as.data.frame(apply(counts,2,counts2TPM)) colSums(tpm)

among geneid_efflen The contents are as follows

geneid_efflen

2. in the light of Salmon Output file for

Salmon The output results of and the explanation of each content are as follows .Salmon Output File Formats - Salmon 1.8.0 documentation It is worth noting that ,salmon It not only gives the effective length of the gene Length（ Length of transcript ）, It also gives EffectiveLength, That is, the effective length of transcripts corrected by considering various factors . It is more recommended to use EffectiveLength Follow up analysis , It results in TPM Values are also based on EffectiveLength Calculated .

Salmon Output result of

Salmon Official interpretation of output results of

We usually use tximport Import salmon Output file for “quant.sf”（ Statistical results of transcripts ） And transcripts id And gene symbol Correspondence file , The following data will be generated ： "abundance" "counts" "length" "countsFromAbundance" tximport Generated Length Namely EffectiveLength, and "abundance" Namely TPM value , We extract Length For subsequent calculation FPKM. Be careful , Because the genes in each sample EffectiveLength There are differences , What we want to extract is actually EffectiveLength Matrix ( Or you can do it every line EffectiveLength Take the mean ).

library(tximport)  #t2s For from gtf Extracted from a file transcript_id and symbol Corresponding relation file of  
t2s <- fread("t2s_vm29_gencode.txt", data.table = F, header = F) 

## establish quant.sf Location path    Import salmon File processing summary  
quantdir <- file.path(getwd(),'salmon'); 
quantdir files <- list.files(pattern="*quant.sf",quantdir,recursive=T); files  # Display all qualified files in the directory 
files <- file.path(quantdir,files);files 
txi_gene <- tximport(files, type = "salmon", tx2gene = t2s) 

## Extract counts/tpm Expression matrix  
counts <- apply(txi_gene$counts,2,as.integer) # take counts Rounding off  
rownames(counts) <- rownames(txi_gene$counts)
tpm <- txi_gene$abundance  ###abundance For genetic Tpm value  

### extract geneid_efflen_mat 
geneid_efflen_mat <- txi_gene$length  ###Length Is the effective length of transcripts of genes  

##  Calculation  TPM 、FPKM  
if (F) { # Directly from txi Of "abundance"   Extract from , Don't run    
  tpm <- data.frame(rownames(counts),row.names = rownames(counts))     
  for (i in 1:ncol(counts)) {       
    count <- counts[,i]        
    efflength <- geneid_efflen_mat[,i]       
    RPK <- count/(efflength/1000)   # Per kilobase reads (reads per million)  Length standardization        
    PMSC_rpk <- sum(RPK)/1e6        #RPK Scaling factor per million  (“per million” scaling factor )  Deep standardization  
    tpm00 <- RPK/PMSC_rpk         
    tpm <- data.frame(tpm,tpm00)      
    rm(tpm00)    
  }    
  tpm <- tpm[,-1];  
  colnames(tpm) <- colnames(counts);  
  head(tpm)       
}      
##  Calculation  fpkm   
if(T){     
  fpkm <- data.frame(rownames(counts),row.names = rownames(counts))    
  for (i in 1:ncol(counts)) {    
    count <- counts[,i]    
    efflength <- geneid_efflen_mat[,i]   
    PMSC_counts <- sum(count)/1e6   #counts Scaling factor per million  (“per million” scaling factor)  Deep standardization     
    FPM <- count/PMSC_counts        # Per million reads/Fragments (Reads/Fragments Per Million)  Length standardization       
    fpkm00 <- FPM/(efflength/1000)     
    fpkm <- data.frame(fpkm,fpkm00)    
    rm(fpkm00)   
  }     
  fpkm <- fpkm[,-1];
  colnames(fpkm) <- colnames(counts) 
}

If you want to extract the effective length of genes in the general sense , Need to use “quant.genes.sf” file （ Statistical results of genes , Need to be in progress salmon Add a parameter to it -g , Followed by gtf file ）, extract Length The information in this column .

a2 <- fread("quant.genes.sf",             
            data.table = F) 
geneid_efflen <- subset(a2, select = c("Name","Length"))
colnames(geneid_efflen) <- c("geneid","efflen")  
geneid_efflen_salmon <- geneid_efflen # For later comparison

Two 、 from gtf Calculate the effective length of the obtained gene in the file

Two kinds of from gtf The method of calculating the effective length of genes in the document （ Sum of non redundant exon lengths ）, Refer to these two articles ： Gene length is not end-start - Simple books (jianshu.com)Htseq Count To Fpkm | KeepNotes blog (bioinfo-scrounger.com) Due to the large amount of data processed , The code runs slowly , So... Is also called in the following code parallel The package performs multi-core operations

1. utilize GenomicFeatures Package import gtf Handle

library(parallel) # Parallel computing   parApply parLapply parSaplly 
cl <- makeCluster(0.75*detectCores())  # Design enable computer 3/4 The core of 

##  utilize GenomicFeatures Package import gtf Handle 
    library(GenomicFeatures)
    txdb <- makeTxDbFromGFF("gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz",
 format="gtf") 
    exons_gene <- exonsBy(txdb, by = "gene") ### Extract gene exons 
    head(exons_gene)

    ## Calculate the total exon length ： use reduce Remove overlapping redundancy ,,width Statistical length , Finally, calculate the total length 
    exons_gene_lens <- parLapply(cl,exons_gene,function(x){sum(width(reduce(x)))}) 
    exons_gene_lens[1:10]
    
    ## Convert to dataframe
    geneid_efflen <- data.frame(geneid=names(exons_gene_lens),
                                efflen=as.numeric(exons_gene_lens))
    geneid_efflen_gtf1 <- geneid_efflen

2. utilize rtracklayer Package import gtf Handle

## utilize rtracklayer package import Import processing  
    gtf <- as.data.frame(rtracklayer::import("gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz"))
    table(gtf$type)

    exon <- gtf[gtf$type=="exon",
                c("start","end","gene_id")]
    exon_bygeneid <- split(exon,exon$gene_id)   # According to each geneid What it contains exon Sort into a list 
    
    efflen <- parLapply(cl,exon_bygeneid,function(x){
      tmp <- apply(x,1,function(y){  y[1]:y[2]  }) # Output exon All elements of the length value             
      length(unique(unlist(tmp))) # Repeat and count exon The number of length elements 
      }) 

    ## Convert to dataframe
    geneid_efflen <- data.frame(geneid=names(efflen),
                                efflen=as.numeric(efflen))
    geneid_efflen_gtf2 <- geneid_efflen

Comparison of the results obtained

Now we can compare the effective length of genes . Let's start with gtf Whether there is any difference between the two methods for obtaining the effective length of genes in the document . You can find , Only a few efflen There are differences , So there is almost no difference between the two methods ：

from gtf Get in file efflen Comparison

Compare again geneid_efflen_gtf1 and geneid_efflen_salmon, Half of them were found efflen It doesn't match ？ On second thought, this is understandable , Because upstream salmon It is the statistics of transcripts in the sample , This shows that half of the genes in the sample do not express all their transcripts ：

salmon and gtf From efflen Compare

then geneid_efflen_gtf1 and geneid_efflen_fc Compare , It was found that they all matched , This explanation featureCounts Of Length Indeed, it is already the effective length of genes we want （ That is, the total length of non redundant exons ）：

featureCounts and gtf From efflen Compare

summary ：

The easiest way to get the effective length of a gene is directly from featureCounts or salmon Extract from the output file of . But it should be noted that ,featureCounts Medium gene effective length Length That is, the total length of non redundant exons of the gene , and salmon Effective length of gene in Length Is the total length of the transcripts of the target gene , Because only some genes in the sample express all types of transcripts , therefore salmon The total length of the transcripts in C .
salmon The output results not only show the effective length of the gene Length（ Total length of transcripts ）, The effective length of transcripts corrected by considering various factors is also given EffectiveLength.Salmon It is more recommended to use EffectiveLength Follow up analysis , It is considered that it can better eliminate the influence of gene length during sequencing , It results in TPM Values are also based on EffectiveLength Calculated , In the follow-up analysis, we can directly use .
Without upstream raw output file , It can also be taken directly from gtf Calculation method in the document , Get the total length of non redundant exons of each gene to get the effective length of the gene . You can also save geneid_efflen For later reading ： write.csv(geneid_efflen,file = "geneid_efflen_vm25_gencode.csv",row.names = F)

Reference material

The length of the gene | Shengxin rookie group (bio-info-trainee.com)

Counts FPKM RPKM TPM The transformation of - Simple books (jianshu.com)

Gene length is not end-start - Simple books (jianshu.com)

Htseq Count To Fpkm | KeepNotes blog (bioinfo-scrounger.com)

Salmon Output File Formats - Salmon 1.8.0 documentation

原网站

版权声明
本文为[Shengxin skill tree]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206272105499600.html