当前位置:网站首页>【无标题】
【无标题】
2022-07-28 16:29:00 【皮肤科大白】
首先下载序列
这里我选择在GEO官网的GPL平台下载 : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL21827
rm(list = ls()) ## 魔幻操作,一键清空~
options(stringsAsFactors = F)
# 注意查看下载文件的大小,检查数据
f='GPL21827_eSet.Rdata'
library(GEOquery)
# 这个包需要注意两个配置,一般来说自动化的配置是足够的。
#Setting options('download.file.method.GEOquery'='auto')
#Setting options('GEOquery.inmemory.gpl'=FALSE)
if(!file.exists(f)){
gset <- getGEO('GPL21827', destdir="." ) ## 平台文件
save(gset,file=f) ## 保存到本地
}
load('GPL21827_eSet.Rdata') ## 载入数据
class(gset)
length(gset)
gset
colnames(Table(gset))
probe2symbol=Table(gset)[,c(1,4)]
all_recs=paste(apply(probe2symbol,1,function(x) paste0('>',x[1],'\n',x[2])),collapse = '\n')
temp <- tempfile() ## 编程技巧,把变量写入临时文件~
write(all_recs, temp)
之所以写出到fastq文件,是因为它可以拿去走比对流程。
然后对人类的参考基因组构建索引并且比对
主要是参考基因组下载会耗费时间,还有构建索引耗时也很可观!
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/qiang/release1/Genomes/'
ref <- file.path(dir,'Homo_sapiens.GRCh38.dna.toplevel.fa')
buildindex(basename="reference_index",reference=ref)
## 是单端数据,fa序列来源于上一个步骤输出的gpl的探针
reads <- temp
align(index="reference_index",readfile1=reads,
output_file="alignResults.BAM",phredOffset=64)
propmapped("alignResults.BAM")
构建大约耗时一个小时,具体如下:
比对速度很快,因为探针序列只有6万多,如下:
读入人类基因组注释文件
也是需要一点点R技巧,可以参考我在生信菜鸟团的博客:http://www.bio-info-trainee.com/3742.html 使用refGenome加上dplyr玩转gtf文件
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/release1/Genomes/'
gtf <- file.path(dir,'Homo_sapiens.GRCh38.82.gtf')
if(!require(refGenome)) install.packages("refGenome")
# create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()
# read GTF file into ensemblGenome object
read.gtf(ens,useBasedir = F,gtf)
class(ens)
# counts all annotations on each seqname
tableSeqids(ens)
# returns all annotations on mitochondria
extractSeqids(ens, 'MT')
# summarise features in GTF file
tableFeatures(ens)
# create table of genes
my_gene <- getGenePositions(ens)
dim(my_gene)
# length of genes
gt=my_gene
my_gene_length <- gt$end - gt$start
my_density <- density(my_gene_length)
plot(my_density, main = 'Distribution of gene lengths')
## 重点是要成为对象
library(GenomicRanges)
my_gr <- with(my_gene, GRanges(seqid, IRanges(start, end),
strand, id = gene_id))
library(Rsubread)
# 推荐从ENSEMBL上面下载成套的参考基因组fa及基因注释GTF文件
dir='~/data/project/release1/Genomes/'
## 这个文件有1.4G哦!!!
gtf <- file.path(dir,'Homo_sapiens.GRCh38.82.gtf')
if(!require(refGenome)) install.packages("refGenome")
# create ensemblGenome object for storing Ensembl genomic annotation data
ens <- ensemblGenome()
# read GTF file into ensemblGenome object
read.gtf(ens,useBasedir = F,gtf)
## 耗时2分钟,取决于电脑配置
class(ens)
# counts all annotations on each seqname
tableSeqids(ens)
# returns all annotations on mitochondria
extractSeqids(ens, 'MT')
# summarise features in GTF file
tableFeatures(ens)
# create table of genes
my_gene <- getGenePositions(ens)
dim(my_gene)
# 这样就轻松拿到了所有基因的坐标信息啦
坐标映射
把自己制作好的bam文件的坐标,跟提取自gtf文件的坐标信息对应起来,使用GenomicRanges包自带的函数即可。
值得注意的是把bam文件读入R,并且转为grange对象需要一点点技巧,我在生信菜鸟团博客写过:http://www.bio-info-trainee.com/3740.html
library(Rsamtools)
bamFile="alignResults.BAM"
quickBamFlagSummary(bamFile)
# https://kasperdanielhansen.github.io/genbioconductor/html/Rsamtools.html
bam <- scanBam(bamFile)
bam
names(bam[[1]])
tmp=as.data.frame(do.call(cbind,lapply(bam[[1]], as.character)))
tmp=tmp[tmp$flag!=4,] # 60885 probes
# intersect() on two GRanges objects.
library(GenomicRanges)
my_seq <- with(tmp, GRanges(as.character(rname),
IRanges(as.numeric(pos)-60, as.numeric(pos)+60),
as.character(strand),
id = as.character(qname)))
gr3 = intersect(my_seq,my_gr)
gr3
o = findOverlaps(my_seq,my_gr)
o
lo=cbind(as.data.frame(my_seq[queryHits(o)]),
as.data.frame(my_gr[subjectHits(o)]))
head(lo)
write.table(lo,file = 'GPL21827_probe2ensemb.csv',row.names = F,sep = ',')
边栏推荐
- 区分ES6的export与Nodejs的module.exports的区别
- Using SQL server agent job to restore the database regularly
- Ant financial mobile testing tool solopi monitoring part of the source code guide.. Continuous update
- Verilog 每日一题(VL8 使用generate…for语句简化代码)
- C language to achieve minesweeping games
- PCA 报错Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) : ‘x‘里有无穷值或遗漏值
- JDWP未授权快速利用
- Zero foundation uses unity3d to develop AR applications and download 3D models remotely
- 技术面轻松通过,HR:只有三年大厂经验的不值20K
- Verilog 每日一题(VL29 单端口RAM)
猜你喜欢

Verilog daily question (vl29 single port RAM)

Asynchronous circuit design -- principle and example of synchronous pulser

Verilog daily question (vl4 shift operation and multiplication)

MySQL detailed learning tutorial (recommended Collection)

一文掌握 JVM 面试要点

AMQP协议详解

渗透测试大杀器kali安装配置

Encrypt the video and upload it to OSS to achieve high concurrent access

掌握JVM面试专题和答案Offer拿到手软(附学习路线图)

Master the key points of JVM interview
随机推荐
LNMP源码编译安装
Using OCR to reverse recognize text content in airtest
Sed of shell programming
[CDH] configure CDH components through clouderamanager and collect JMX information with Prometheus monitoring
js将本地时间与服务器时间同步
Punctual atomic serial port protocol
【CDH】通过 ClouderaManager 配置CDH组件用 prometheus 监控采集JMX信息
Iris framework practice of goweb development: project summary and review
MySQL的触发器
Selection of resistance in high speed circuit
Selection and application of inductors in high speed circuits
The practice of the beego framework for goweb development: Section V project construction and user registration
简单易用的APP专项测试工具iTest4.7.0发布啦
Verilog 每日一题 (VL28 加减计数器)
Verilog daily question (vl26 simple stopwatch)
C # basic interview questions (with answers)
连接设计与测试平台——SystemVerilog 接口知识点总结
渗透测试大杀器kali安装配置
Verilog 每日一题(VL8 使用generate…for语句简化代码)
Several methods of importing excel file data by C #