当前位置:网站首页>N methods for obtaining effective length of genes
N methods for obtaining effective length of genes
2022-06-27 23:36:00 【Shengxin skill tree】
Recently, a fan volunteered to share his bioinformatics notes on Jianshu and other platforms on our official account of Shengxin skill tree , Learn from others on the professional stage !
Very welcome , His previous share is :Counts FPKM RPKM TPM CPM The transformation of
The following is the nickname of Jianshu : Hey, hey, hey, ha The share of
stay RNAseq In the downstream analysis of , Generally, the raw materials obtained from upstream processing will be processed counts Number converted to FPKM/RPKM or TPM For subsequent presentation or analysis , Its definition and calculation formula are in The previous share is :Counts FPKM RPKM TPM CPM The transformation of Referred to the .
It should be noted that , In the calculation FPKM/RPKM and TPM when , Gene length generally refers to Gene effective length effective length, That is, the total length of exons or transcripts of the gene , It is more accurate to eliminate the influence of gene length caused by sequencing based on this standard . See the skill tree article of Shengxin :
The length of the gene | Shengxin rookie group (bio-info-trainee.com)
So here comes the question , In the calculation FPKM/RPKM when , How to obtain the effective length data of each gene ? I have summarized several ways to obtain the effective length of genes ( Or non redundant total exon length 、 Total transcript length ) Methods , It is now arranged as follows :
One 、 Obtain the effective length of the gene from the results of the upstream output file
generally speaking ,RNA-seq Get the original counts The most commonly used upstream software for expression matrix is featureCounts and Salmon 了 , In the output of these two types of software , Except for genes ( Or transcripts ) Of counts Out of information , It also contains information about the effective length of genes , Such as featureCounts Of the output file Length This column corresponds to the effective length of the gene .(P.S. I always thought featureCounts Of Length Just simple gene length , Later, after comparison with various methods, it was found that Length This column is already the effective length of genes ... I will also show the results of the comparison of these methods later in the article )
therefore , The most convenient way is to get it downstream counts Matrix time , The effective length information of gene is also extracted for subsequent gene expression transformation .
1. in the light of featureCounts Output file for
stay R Read from featureCounts Output file for , extract Length And corresponding geneid Information , Again according to counts Medium rowname(geneid) Match sort , You can carry out the follow-up TPM、FPKM The value is calculated .
featurecounts Output file format
rm(list=ls())
options(stringsAsFactors = F)
library(tidyverse)
# ggplot2 stringer dplyr tidyr readr purrr tibble forcats
library(data.table) # Multi core read file
a1 <- fread('counts.txt', header = T, data.table = F)# load counts, The first column is set to column name
### counts Matrix construction
counts <- a1[,7:ncol(a1)]
# The amount of gene expression in the sample was intercepted counts Part as counts
rownames(counts) <- a1$Geneid # Take the gene name as the line name
### from featurecounts Original output file counts.txt Extract from Geneid、Length( Length of transcript ),
geneid_efflen <- subset(a1,select = c("Geneid","Length"))
colnames(geneid_efflen) <- c("geneid","efflen")
geneid_efflen_fc <- geneid_efflen # For later comparison
### Take out counts in geneid It's the same as efflen
dim(geneid_efflen)
efflen <- geneid_efflen[match(rownames(counts),
geneid_efflen$geneid),"efflen"] ### Calculation TPM #TPM (Transcripts Per Kilobase Million) Transcription per thousand bases, read per million mapping Transcripts
counts2TPM <- function(count=count, efflength=efflen){ RPK <- count/(efflength/1000) # Per kilobase reads (“per million” scaling factor) Length standardization
PMSC_rpk <- sum(RPK)/1e6 #RPK Scaling factor per million (“per million” scaling factor ) Deep standardization
RPK/PMSC_rpk
}
tpm <- as.data.frame(apply(counts,2,counts2TPM)) colSums(tpm)among geneid_efflen The contents are as follows
geneid_efflen
2. in the light of Salmon Output file for
Salmon The output results of and the explanation of each content are as follows .Salmon Output File Formats - Salmon 1.8.0 documentation It is worth noting that ,salmon It not only gives the effective length of the gene Length( Length of transcript ), It also gives EffectiveLength, That is, the effective length of transcripts corrected by considering various factors . It is more recommended to use EffectiveLength Follow up analysis , It results in TPM Values are also based on EffectiveLength Calculated .
Salmon Output result of
Salmon Official interpretation of output results of
We usually use tximport Import salmon Output file for “quant.sf”( Statistical results of transcripts ) And transcripts id And gene symbol Correspondence file , The following data will be generated : "abundance" "counts" "length" "countsFromAbundance" tximport Generated Length Namely EffectiveLength, and "abundance" Namely TPM value , We extract Length For subsequent calculation FPKM. Be careful , Because the genes in each sample EffectiveLength There are differences , What we want to extract is actually EffectiveLength Matrix ( Or you can do it every line EffectiveLength Take the mean ).
library(tximport) #t2s For from gtf Extracted from a file transcript_id and symbol Corresponding relation file of
t2s <- fread("t2s_vm29_gencode.txt", data.table = F, header = F)
## establish quant.sf Location path Import salmon File processing summary
quantdir <- file.path(getwd(),'salmon');
quantdir files <- list.files(pattern="*quant.sf",quantdir,recursive=T); files # Display all qualified files in the directory
files <- file.path(quantdir,files);files
txi_gene <- tximport(files, type = "salmon", tx2gene = t2s)
## Extract counts/tpm Expression matrix
counts <- apply(txi_gene$counts,2,as.integer) # take counts Rounding off
rownames(counts) <- rownames(txi_gene$counts)
tpm <- txi_gene$abundance ###abundance For genetic Tpm value
### extract geneid_efflen_mat
geneid_efflen_mat <- txi_gene$length ###Length Is the effective length of transcripts of genes
## Calculation TPM 、FPKM
if (F) { # Directly from txi Of "abundance" Extract from , Don't run
tpm <- data.frame(rownames(counts),row.names = rownames(counts))
for (i in 1:ncol(counts)) {
count <- counts[,i]
efflength <- geneid_efflen_mat[,i]
RPK <- count/(efflength/1000) # Per kilobase reads (reads per million) Length standardization
PMSC_rpk <- sum(RPK)/1e6 #RPK Scaling factor per million (“per million” scaling factor ) Deep standardization
tpm00 <- RPK/PMSC_rpk
tpm <- data.frame(tpm,tpm00)
rm(tpm00)
}
tpm <- tpm[,-1];
colnames(tpm) <- colnames(counts);
head(tpm)
}
## Calculation fpkm
if(T){
fpkm <- data.frame(rownames(counts),row.names = rownames(counts))
for (i in 1:ncol(counts)) {
count <- counts[,i]
efflength <- geneid_efflen_mat[,i]
PMSC_counts <- sum(count)/1e6 #counts Scaling factor per million (“per million” scaling factor) Deep standardization
FPM <- count/PMSC_counts # Per million reads/Fragments (Reads/Fragments Per Million) Length standardization
fpkm00 <- FPM/(efflength/1000)
fpkm <- data.frame(fpkm,fpkm00)
rm(fpkm00)
}
fpkm <- fpkm[,-1];
colnames(fpkm) <- colnames(counts)
} If you want to extract the effective length of genes in the general sense , Need to use “quant.genes.sf” file ( Statistical results of genes , Need to be in progress salmon Add a parameter to it -g , Followed by gtf file ), extract Length The information in this column .
a2 <- fread("quant.genes.sf",
data.table = F)
geneid_efflen <- subset(a2, select = c("Name","Length"))
colnames(geneid_efflen) <- c("geneid","efflen")
geneid_efflen_salmon <- geneid_efflen # For later comparison Two 、 from gtf Calculate the effective length of the obtained gene in the file
Two kinds of from gtf The method of calculating the effective length of genes in the document ( Sum of non redundant exon lengths ), Refer to these two articles : Gene length is not end-start - Simple books (jianshu.com)Htseq Count To Fpkm | KeepNotes blog (bioinfo-scrounger.com) Due to the large amount of data processed , The code runs slowly , So... Is also called in the following code parallel The package performs multi-core operations
1. utilize GenomicFeatures Package import gtf Handle
library(parallel) # Parallel computing parApply parLapply parSaplly
cl <- makeCluster(0.75*detectCores()) # Design enable computer 3/4 The core of
## utilize GenomicFeatures Package import gtf Handle
library(GenomicFeatures)
txdb <- makeTxDbFromGFF("gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz",
format="gtf")
exons_gene <- exonsBy(txdb, by = "gene") ### Extract gene exons
head(exons_gene)
## Calculate the total exon length : use reduce Remove overlapping redundancy ,,width Statistical length , Finally, calculate the total length
exons_gene_lens <- parLapply(cl,exons_gene,function(x){sum(width(reduce(x)))})
exons_gene_lens[1:10]
## Convert to dataframe
geneid_efflen <- data.frame(geneid=names(exons_gene_lens),
efflen=as.numeric(exons_gene_lens))
geneid_efflen_gtf1 <- geneid_efflen2. utilize rtracklayer Package import gtf Handle
## utilize rtracklayer package import Import processing
gtf <- as.data.frame(rtracklayer::import("gencode.vM25.chr_patch_hapl_scaff.annotation.gtf.gz"))
table(gtf$type)
exon <- gtf[gtf$type=="exon",
c("start","end","gene_id")]
exon_bygeneid <- split(exon,exon$gene_id) # According to each geneid What it contains exon Sort into a list
efflen <- parLapply(cl,exon_bygeneid,function(x){
tmp <- apply(x,1,function(y){ y[1]:y[2] }) # Output exon All elements of the length value
length(unique(unlist(tmp))) # Repeat and count exon The number of length elements
})
## Convert to dataframe
geneid_efflen <- data.frame(geneid=names(efflen),
efflen=as.numeric(efflen))
geneid_efflen_gtf2 <- geneid_efflenComparison of the results obtained
Now we can compare the effective length of genes . Let's start with gtf Whether there is any difference between the two methods for obtaining the effective length of genes in the document . You can find , Only a few efflen There are differences , So there is almost no difference between the two methods :
from gtf Get in file efflen Comparison
Compare again geneid_efflen_gtf1 and geneid_efflen_salmon, Half of them were found efflen It doesn't match ? On second thought, this is understandable , Because upstream salmon It is the statistics of transcripts in the sample , This shows that half of the genes in the sample do not express all their transcripts :
salmon and gtf From efflen Compare
then geneid_efflen_gtf1 and geneid_efflen_fc Compare , It was found that they all matched , This explanation featureCounts Of Length Indeed, it is already the effective length of genes we want ( That is, the total length of non redundant exons ):
featureCounts and gtf From efflen Compare
summary :
- The easiest way to get the effective length of a gene is directly from featureCounts or salmon Extract from the output file of . But it should be noted that ,featureCounts Medium gene effective length Length That is, the total length of non redundant exons of the gene , and salmon Effective length of gene in Length Is the total length of the transcripts of the target gene , Because only some genes in the sample express all types of transcripts , therefore salmon The total length of the transcripts in C .
- salmon The output results not only show the effective length of the gene Length( Total length of transcripts ), The effective length of transcripts corrected by considering various factors is also given EffectiveLength.Salmon It is more recommended to use EffectiveLength Follow up analysis , It is considered that it can better eliminate the influence of gene length during sequencing , It results in TPM Values are also based on EffectiveLength Calculated , In the follow-up analysis, we can directly use .
- Without upstream raw output file , It can also be taken directly from gtf Calculation method in the document , Get the total length of non redundant exons of each gene to get the effective length of the gene . You can also save geneid_efflen For later reading : write.csv(geneid_efflen,file = "geneid_efflen_vm25_gencode.csv",row.names = F)
Reference material
The length of the gene | Shengxin rookie group (bio-info-trainee.com)
Counts FPKM RPKM TPM The transformation of - Simple books (jianshu.com)
Gene length is not end-start - Simple books (jianshu.com)
Htseq Count To Fpkm | KeepNotes blog (bioinfo-scrounger.com)
Salmon Output File Formats - Salmon 1.8.0 documentation
边栏推荐
猜你喜欢

图的存储结构

This kind of people began to be robbed by VC with a monthly salary of 80000 yuan

First principles (optimal solution theory)

Applet referer

Usage of vivado vio IP

C language character pointer and string initialization

c语言之字符串数组
![[try to hack] kill evaluation](/img/93/e623e25dc4dec1f656227c7651577e.png)
[try to hack] kill evaluation

基于 ESXi 的黑群晖 DSM 7.0.1 安装 VMware Tools

c语言字符指针、字符串初始化问题
随机推荐
【PCL自学:PCLVisualizer】点云可视化工具PCLVisualizer
华为伙伴暨开发者大会2022 | 麒麟软件携手华为共建计算产业,共创数智未来
【AI应用】NVIDIA GeForce RTX 3060的详情参数
Stream + Nacos
Feign implements path escape through custom annotations
Is it safe to use flush mobile phones to speculate in stocks?
[learn FPGA programming from scratch -48]: Vision - development and application of intelligent sensors
How to set the enterprise wechat group robots to send messages regularly?
6G显卡显存不足出现CUDA Error:out of memory解决办法
[sword finger offer] 48 Longest substring without duplicate characters
【AI应用】NVIDIA Tesla V100S-PCIE-32GB的详情参数
抓出那些重复的基因
PAT乙级1013
Stream + Nacos
【IDEA】IDEA 格式化 代码技巧 idea 格式化 会加 <p> 标签
Swing UI container (I)
【AI应用】NVIDIA Tesla V100-PCIE-32GB的详情参数
本机部署一个MongoDB单节点服务器,并启用auth验证、开启oplog
Structure de stockage des graphiques
Small chip chiplet Technology