当前位置:网站首页>RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
RNA SEQ introduction practice (I): upstream data download, format conversion and quality control cleaning
2022-06-28 00:09:00 【Shengxin skill tree】
Two successive orders of seeking talents : Once I brought you 100000 users , But now I wish you closure , as well as Student letter skill tree knowledge sorting Intern Recruitment , Let me get lucky and get acquainted with several excellent friends ! Everyone began to follow my ngs Omics video is used to analyze a series of public data sets , Some of my friends surprised me very much , There is no need for communication and guidance , I finished a real battle silently !
His previous share is :
- Counts FPKM RPKM TPM CPM The transformation of
- To obtain the effective length of genes N Seed formula
Here's what he said to us b Detailed notes from the station transcriptome video course
Overview of this section :
- 1. Find in the article GEO accession number, from NCBI get data SRR Number
- 2. stay linux Use in prefetch Command basis SRR No. Download SRA file
- 3. Use fasterq-dump/fastq-dump The order will SRA The file to FASTQ Format ,pigz Software multithreading compression ( Optional )
- 4. Use fastqc and multiqc Check the quality control of sequencing data 5. Use trim-galore Remove low quality bases and splices
Take on the last section RNA-seq Introduction of actual combat ( zero ):RNA-seq Preparation before the process ——Linux And R Create a new environment
One 、 from NCBI get data SRR Number
Article source of data : Formative pluripotent stem cells show features of epiblast cells poised for gastrulation | Cell Research (nature.com) In the article Data availability Find below GEO accession number: GSE154290
Get into NCBI Official website search GSE154290, Select the corresponding result to enter
find Supplementary file Under the SRA Run Select Options
Common Fields The following describes the basic information of the data , For example, in the table PAIRED Represents double ended sequencing data . In this actual battle, check Found 27 Items Under the RNA_mESCs and RNA_EpiSCs Two data each , Choose again Select Under the Selected Options , download Accession List After copying data SRR Number
Two 、SRA Data download
1. Create and enter test Project folder , take SRR No. paste and import idname file
mkdir test ;cd test
cat > idname
SRR12207279
SRR12207280
SRR12207283
SRR12207284
^C 2. establish SRA Script file for data download
vim 00_prefetch.sh Mainly used sra-tools Medium prefetch Command download sra data
#sh Content ################################
echo -e "\n \n \n prefetch sra !!! \n \n \n "
date
mkdir -p ~/test/raw/sra/
cd ~/test/raw/sra/
pwd
cat ~/test/idname | while read id ; \
do
( prefetch -O ./ $id & )
done 3. Background suspend running script , Operation import log_00 Log files
nohup bash 00_prefetch.sh >log_00 2>&1 &Check the system task operation and test File structure under the project
The task is running smoothly , Wait for the data download to complete , Go temporarily relax Let's go. ヽ( ̄▽ ̄)ノ When cat log_00 The following appears downloaded successfully Word means that the download is complete , Then check the data download , After confirming that the download is complete, you can proceed to the next step of file format conversion
prefetch.log
3、 ... and 、 SRA The file to FASTQ Format
Mainly used sra-tool Medium fasterq-dump The command is converted to fastq, After use pigz Software multithreading is compressed into .gz File saves space ( Skipping ), Reuse fastqc and multiqc Perform quality control and quality control summary of original data ~
fasterq-dump/fastq-dump Common parameters
ditto , First create 01_sra2fq_qc1.sh Script files
vim 01_sra2fq_qc1.sh ###########################################
# Move sra Files under subfolders and delete subfolders
date
echo -e "\n \n \n 111# move files !!! \n \n \n "
cd ~/test/raw/sra/
cat ~/test/idname | while read id do
mv $id/* ./
rm -rf $id/
done
date
echo -e "\n \n \n 111# sra>>>fq !!! \n \n \n "
mkdir -p ~/test/raw/fq/
cd ~/test/raw/fq/
pwd
ls ~/test/raw/sra/*.sra |while read id
do
echo " PROCESS $(basename $id) "
fasterq-dump -3 -e 12 -O ./ $id
pigz -p 12 ~/test/raw/fq/*q
done
date
echo -e " \n \n \n 111# qc 1 !!! \n \n \n "
mkdir ~/test/raw/qc1/
cd ~/test/raw/qc1/
pwd
ls ~/test/raw/fq/* | xargs fastqc -t 12 -o ./
multiqc ./
echo -e " \n 111# ALL Work Done!!! \n "
datefunction 01_sra2fq_qc1.sh Script files
nohup bash 01_sra2fq_qc1.sh >log_01 2>&1 &Waiting for the task to complete , Check it out. raw Data under folder
tree raw
Four 、 Quality control cleaning
1. Raw data quality view
View the previous step qc1 Under folder multiqc_report.html QC summary web page file , It mainly focuses on sequencing quality and sequencing connector , It can be found that the data quality is good , The average mass is 30 above , The joint content is also very low . For detailed content analysis, see : 20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
2. QC cleaning data
The main use of trim-galore Remove low quality bases and splices , For detailed usage, see lncRNA Introduction to the software of assembly process trim-galore Common parameters are as follows :
trim-galore Common parameters
vim 2_cleanfq_qc2.sh ##############################################
echo -e " \n \n \n 222# Clean ! trim_galore !!! \n \n \n"
date
mkdir ~/test/clean/
cd ~/test/clean/
pwd
##########single###########################################################################
#ls ~/test/raw/fq/*.f* | while read id
#do
# trim_galore -q 25 -j 4 --phred33 --length 35 --stringency 3 \
# --gzip -o ~/test/clean/ $id
#done
#
##########paired###########################################################################
#1) First of all, put the papers _1、_2 The path and file name of are stored separately , And then merge it into two columns , Save as config#########
ls ~/test/raw/fq/*_1* >1
ls ~/test/raw/fq/*_2* >2
paste 1 2 >config
cat config | while read id
do
arr=($id)
fq1=${arr[0]}
fq2=${arr[1]}
trim_galore -j 4 -q 25 --phred33 --length 35 --stringency 3 \
--paired --gzip -o ~/test/clean/ $fq1 $fq2
done
###########################################################################################
echo -e "\n \n \n 222# qc2 Check clean Cleaning results !!! \n \n \n"
mkdir ~/test/clean/qc2
cd ~/test/clean/qc2
pwd
ls ~/test/clean/*f*.gz | xargs fastqc -t 12 -o ~/test/clean/qc2
multiqc ./
echo -e " \n 222# ALL Work Done !!! \n "
datenohup bash 2_cleanfq_qc2.sh >log_2 2>&1 &3. Check the data quality after cleaning
see ~/test/clean/qc2 Under the multiqc_report.html QC summary web page file , The base quality is better
Here we are , We finished RNAseq Download of raw data 、 Format conversion and QC cleaning steps , After quality control, it is stored in clean Under folder fastq file , Then you can use these cleaned fastq File for next comparison 、 Count (hisat2+feature_counts or salmon), And finally get what we want counts file
Reference material
20160410 Sequencing analysis —— Use FastQC Do quality control - You know (zhihu.com)
Small L Student letter learning diary -3 How to judge the quality of original data ?- On (qq.com)
Small L Student letter learning diary -4 How to judge the quality of original data ?- Next - You know (zhihu.com)
lncRNA Introduction to the software of assembly process trim-galore
This practical tutorial is based on the video shared by the following student letter skill tree :
【 Shengxin skill tree 】 Analysis of transcriptome sequencing data _ Bili, Bili _bilibili
【 Shengxin skill tree 】GEO Database mining _ Bili, Bili _bilibili
边栏推荐
- Msp430f5529 MCU reads gy-906 infrared temperature sensor
- RecyclerView实现分组效果,多种实现方式
- Zero foundation self-study SQL course | complete collection of SQL basic functions
- Flutter series: Transformers in flutter
- Recyclerview implements grouping effects in a variety of ways
- 使用cef3开发的浏览器不支持flash问题的解决
- ASP.NET仓库进销存ERP管理系统源码 ERP小程序源码
- apipost脚本使用讲解一~全局变量
- 100 questions for an enterprise architect interview
- 【PCL自学:PCLVisualizer】点云可视化工具PCLVisualizer
猜你喜欢

零基础自学SQL课程 | SQL基本函数大全

Zero foundation self-study SQL course | if function

数据仓库入门介绍

A summer party

赛尔笔记|视频文本预训练简述

数仓的字符截取三胞胎:substrb、substr、substring

Cornernet understands from simple to profound
![软件工程作业设计(1): [个人项目] 实现一个日志查看页面](/img/95/0c3f0dde16d220ddecb5758a4c31e7.png)
软件工程作业设计(1): [个人项目] 实现一个日志查看页面

翻译(5): 技术债务墻:一种让技术债务可见并可协商的方法

Zero foundation self-study SQL course | case function
随机推荐
One step forward is excellent, one step backward is ignorant
炼金术(8): 开发和发布的并行
Are the registered accounts of the top ten securities companies safe and risky?
apipost脚本使用讲解一~全局变量
How to select documents for literature review? For example, I can't finish reading more than 200 search results. How to select documents
现代编程语言:Rust (铁锈,一文掌握钢铁是怎样生锈的)
Promise是什么
夏日的晚会
request对象、response对象、session对象
Msp430f5529 MCU reads gy-906 infrared temperature sensor
技术的极限(11): 有趣的编程
[PCL self study: pclvisualizer] point cloud visualization tool pclvisualizer
【PCL自学:Segmentation3】基于PCL的点云分割:区域增长分割
An analysis of C language functions
A summer party
单细胞数据清洗的这5个步骤你会做吗?
华泰证券在网上开户安全吗?
吴恩达《机器学习》课程总结(13)_聚类
炼金术(7): 何以解忧,唯有重构
一文剖析C语言函数