当前位置:网站首页>【无标题】
【无标题】
2022-07-28 16:29:00 【皮肤科大白】
首先下载序列
这里我选择在GEO官网的GPL平台下载 : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL21827
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
# 注意查看下载文件的大小,检查数据
f='GPL21827_eSet.Rdata'
library(GEOquery)
# 这个包需要注意两个配置,一般来说自动化的配置是足够的。
#Setting options('download.file.method.GEOquery'='auto')
#Setting options('GEOquery.inmemory.gpl'=FALSE)
if(!file.exists(f)){
gset <- getGEO('GPL21827', destdir="." ) ## 平台文件
save(gset,file=f) ## 保存到本地
}
load('GPL21827_eSet.Rdata') ## 载入数据
class(gset)
length(gset)
gset
colnames(Table(gset))
probe2symbol=Table(gset)[,c(1,4)]
all_recs=paste(apply(probe2symbol,1,function(x) paste0('>',x[1],'\n',x[2])),collapse = '\n')
temp <- tempfile() ## 编程技巧,把变量写入临时文件~
write(all_recs, temp)
之所以写出到fastq文件,是因为它可以拿去走比对流程。
然后对人类的参考基因组构建索引并且比对
主要是参考基因组下载会耗费时间,还有构建索引耗时也很可观!
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/qiang/release1/Genomes/'
ref <- file.path(dir,'Homo_sapiens.GRCh38.dna.toplevel.fa')
buildindex(basename="reference_index",reference=ref)
## 是单端数据,fa序列来源于上一个步骤输出的gpl的探针
reads <- temp
align(index="reference_index",readfile1=reads,
output_file="alignResults.BAM",phredOffset=64)
propmapped("alignResults.BAM")
构建大约耗时一个小时,具体如下:
比对速度很快,因为探针序列只有6万多,如下:
读入人类基因组注释文件
也是需要一点点R技巧,可以参考我在生信菜鸟团的博客:http://www.bio-info-trainee.com/3742.html 使用refGenome加上dplyr玩转gtf文件
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/release1/Genomes/'
gtf <- file.path(dir,'Homo_sapiens.GRCh38.82.gtf')
if(!require(refGenome)) install.packages("refGenome")
# create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()
# read GTF file into ensemblGenome object
read.gtf(ens,useBasedir = F,gtf)
class(ens)
# counts all annotations on each seqname
tableSeqids(ens)
# returns all annotations on mitochondria
extractSeqids(ens, 'MT')
# summarise features in GTF file
tableFeatures(ens)
# create table of genes
my_gene <- getGenePositions(ens)
dim(my_gene)
# length of genes
gt=my_gene
my_gene_length <- gt$end - gt$start
my_density <- density(my_gene_length)
plot(my_density, main = 'Distribution of gene lengths')
## 重点是要成为对象
library(GenomicRanges)
my_gr <- with(my_gene, GRanges(seqid, IRanges(start, end),
strand, id = gene_id))
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/release1/Genomes/'
## 这个文件有1.4G哦!!!
gtf <- file.path(dir,'Homo_sapiens.GRCh38.82.gtf')
if(!require(refGenome)) install.packages("refGenome")
# create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()
# read GTF file into ensemblGenome object
read.gtf(ens,useBasedir = F,gtf)
## 耗时2分钟,取决于电脑配置
class(ens)
# counts all annotations on each seqname
tableSeqids(ens)
# returns all annotations on mitochondria
extractSeqids(ens, 'MT')
# summarise features in GTF file
tableFeatures(ens)
# create table of genes
my_gene <- getGenePositions(ens)
dim(my_gene)
# 这样就轻松拿到了所有基因的坐标信息啦
坐标映射
把自己制作好的bam文件的坐标,跟提取自gtf文件的坐标信息对应起来,使用GenomicRanges包自带的函数即可。
值得注意的是把bam文件读入R,并且转为grange对象需要一点点技巧,我在生信菜鸟团博客写过:http://www.bio-info-trainee.com/3740.html
library(Rsamtools)
bamFile="alignResults.BAM"
quickBamFlagSummary(bamFile)
# https://kasperdanielhansen.github.io/genbioconductor/html/Rsamtools.html
bam <- scanBam(bamFile)
bam
names(bam[[1]])
tmp=as.data.frame(do.call(cbind,lapply(bam[[1]], as.character)))
tmp=tmp[tmp$flag!=4,] # 60885 probes
# intersect() on two GRanges objects.
library(GenomicRanges)
my_seq <- with(tmp, GRanges(as.character(rname),
IRanges(as.numeric(pos)-60, as.numeric(pos)+60),
as.character(strand),
id = as.character(qname)))
gr3 = intersect(my_seq,my_gr)
gr3
o = findOverlaps(my_seq,my_gr)
o
lo=cbind(as.data.frame(my_seq[queryHits(o)]),
as.data.frame(my_gr[subjectHits(o)]))
head(lo)
write.table(lo,file = 'GPL21827_probe2ensemb.csv',row.names = F,sep = ',')
边栏推荐
- 【presto 】presto 新版本升级详情
- High speed circuit design practice -- Overview
- Master JVM interview topics and answers offer get soft (with learning roadmap)
- MySQL PgSQL realizes the merging of multiple lines of records into one line, grouping and merging, and dividing with specified characters
- Verilog 每日一题(VL2 异步复位的串联T触发器--牛客网)
- Verilog daily question (vl26 simple stopwatch)
- Use Alibaba cloud's free SSL certificate
- 掌握JVM面试专题和答案Offer拿到手软(附学习路线图)
- Verilog 每日一题 (VL30 RAM的简单实现)
- Visual studio 2012/2015 releases web applications together with.Cs source code
猜你喜欢

High speed circuit design practice -- Overview

No interactive operation of shell script

异步FIFO基本原理(基于Verilog的简单实现)

Shell脚本之AWK

技术面轻松通过,HR:只有三年大厂经验的不值20K

clang format

The actual combat of the beego framework of goweb development: Section III program execution process analysis

Verilog 每日一题 (VL28 加减计数器)

Selection and application of inductors in high speed circuits

Application system log structure of elastic stack
随机推荐
【presto 】presto 新版本升级详情
谈谈你知道的发布上线(二)
Several methods of importing excel file data by C #
面试官:算法刷题实录.pdf我居然答不上来
High speed circuit design practice -- Overview
The practice of beego framework developed by goweb: Section 4 database configuration and connection
ionic中的$ionicPopup连续两个调用alert时需要注意的事项
JDWP未授权快速利用
漫谈测试平台—建设模式探讨
Esp-mqtt-at instruction connects Alibaba cloud Internet of things platform
Vscode uses eslint prettier to format code automatically
AMQP protocol details
Linear algebra and matrix theory (IX)
Convert multidimensional object array to one-dimensional array
阿里P8架构师谈:成为架构师必须学好的七大知识点(含面试题)
Solve the problem of exclusive SQL Server database
Role of Fortress machine
Encountered.Sqlite file processing during Android Development
Awk of shell script
About standard IO buffers