当前位置:网站首页>July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)
July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)
2022-06-30 07:39:00 【Muyiqing】
Catalog
Abstract
After half a year , Finally put WGS The previous analysis uses snakemake Set up. . Don't let me be slow , There are really not many projects , The process is not particularly complicated . Previous shell Scripts can also be used , Therefore, it has not been really built yet . Now there are more and more projects , Considering the improvement of work efficiency , I did it a few days ago 2 individual WGS Project , Sort out the process .
Command line
#vim: set syntax=python
#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "[email protected]"
#__license__ = "SAU"
"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""
configfile:"config.yaml"
import yaml
rule all:
input:
expand("03.snp/{sample}_stat.csv", sample=config["samples"]),# Comparison group mutation output table
expand("{genome}.bwt", genome=config["reference"]),# Reference genome indexing ,bwa Comparison needs
expand("{genome}.fai", genome=config["reference"]),# Reference genome indexing ,GATK Comparison needs
expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),# Reference genome indexing ,GATK Comparison needs
"01.data/data_quality_stat.txt",# Quality control statistics
#############################quality_control######################
rule trim_galore:
input:
R1 = "01.data/{sample}_1.fq.gz",
R2 = "01.data/{sample}_2.fq.gz"
output:
dir = directory("01.data/trim/{sample}"),# Output sample QC folder
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
threads:8
shell:
"trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "
rule fastp:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
output:
temp("01.data/trim/{sample}.json")# Generate temporary files .json, next rule Delete after use .
threads:16
shell:
"fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "
rule fastp_stat:
input:
expand("01.data/trim/{sample}.json",sample=config["samples"])
output:
"01.data/fastp_stat.txt"
script:
"scripts/fastp_stat.py"
# according to json Document statistics of quality control information of all samples , Previous articles have .
rule sort_fastp_stat:
input:
"01.data/fastp_stat.txt"
output:
"01.data/data_quality_stat.txt"
shell:
"sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID Clean_reads Clean_bases ≥Q30 GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
# Format
#################################align#########################
rule bwa_index:
input:
genome = config["reference"]
output:
"{genome}.bwt",
shell:
"bwa index {input.genome}"
rule bwa_map:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
genome_index = expand("{genome}.bwt",genome=config["reference"])
output:
temp("02.align/{sample}_align.sam")
params:
genome = config["reference"],
threads:16
shell:
"bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
# Be careful , Special symbols in double quotation marks (()''%& etc. ) There will be ambiguity , You need to use a backslash symbol to comment .
rule samtools_view:#sam2bam
input:
"02.align/{sample}_align.sam"
output:
temp("02.align/{sample}.tmp.bam")
shell:
"samtools view -bS {input} > {output} "
rule samtools_sort:# Sort
input:
"02.align/{sample}.tmp.bam"
output:
"02.align/{sample}.bam",
shell:
"samtools sort {input} -o {output} "
rule samtools_faidx:# Index
input:
genome = config["reference"]
output:
"{genome}.fai"
shell:
"samtools faidx {input.genome}"
########################GATK_filter##################
rule GATK_dict:# Index
input:
genome = config["reference"]
output:
dict = "02.align/ref/{genome_profix}.dict"
shell:
"gatk CreateSequenceDictionary -R {input.genome} -O {output.dict}"
rule GATK_sort:# Use GATK Yes bam File reorder , duplicate removal .
input:
sample = "02.align/{sample}.bam",
output:
sort_bam = temp("02.align/{sample}_sort.bam")
params:
genome = config["reference"],
shell:
"gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"
rule GATK_dupli:
input:
sample = "02.align/{sample}_sort.bam",
output:
dup_bam = temp("02.align/{sample}_sort_dup.bam")
params:
genome = config["reference"]
log:
metrics = "02.align/log/{sample}_markdup_metrics.txt"
shell:
"gatk MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"
rule samtools_index2:
input:
"02.align/{sample}_sort_dup.bam"
output:
"02.align/{sample}_sort_dup.bam.bai"
shell:
"samtools index {input}"
rule Calling:
input:
sample = "02.align/{sample}_sort_dup.bam",
index = "02.align/{sample}_sort_dup.bam.bai"
output:
sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
params:
genome = config["reference"]
shell:
"gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
# Be careful -ERC Parameters , Species with small genomes may or may not be added , The big genome must be added , And use -L Note according to the chromosome number , Then merge .
rule GVCF2VCF:# Look for mutation sites , The results here include snp and indel
input:
sample = "03.snp/{sample}.g.vcf.gz",
output:
sample_vcf = "03.snp/{sample}.vcf"
params:
genome = config["reference"]
shell:
"gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
rule SNP_calling:# select snp site
input:
sample = "03.snp/{sample}.vcf",
output:
sample_vcf = "03.snp/{sample}_snp.vcf"
params:
genome = config["reference"]
shell:
"gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"
rule VCF_stat:# Yes snp Statistics of loci
input:
snp_vcf = "03.snp/{sample}_snp.vcf"
output:
"03.snp/{sample}_stat.csv"
shell:
"python scripts/snp_frequency.py {input.snp_vcf}"
Some statistics script introduction documents
summary
This process has been proved to be feasible . Of course , The integrity of the process is not enough , such as snp Sites can also be screened , Or the indel It is also calculated . meanwhile , There are still some documents to be provided , such as config.yaml, fastp_stat.py, snp_frequency.py. Some documents can be found in previous articles , Some scripts I will upload the software needed for the whole process to github above . I hope you can test this process and give feedback on the use suggestions , Welcome to the group , When the QR code expires, you can add VX:bbplayer2021 , remarks Apply to join the student information exchange group .
边栏推荐
- Proteus catalog component names and Chinese English cross reference
- Final review -php learning notes 6- string processing
- Shell command, how much do you know?
- DXP software uses shortcut keys
- 期末复习-PHP学习笔记8-mysql数据库
- STM32 infrared communication 3 brief
- Arm debug interface (adiv5) analysis (I) introduction and implementation [continuous update]
- Thread network
- Variable storage unit and pointer
- Assembly learning register
猜你喜欢
Introduction notes to pytorch deep learning (XII) neural network - nonlinear activation
Final review -php learning notes 8-mysql database
C language implementation of chain stack (without leading node)
Video player (II): video decoding
STM32 register
Introduction to ecostruxure (1) IEC61499 new scheme
Cubemx completes STM32F103 dual serial port 485 transceiver transmission
Binary tree related operations (based on recursion, implemented in C language)
Basic knowledge of compiling learning records
線程池——C語言
随机推荐
Similarities and differences of differential signal, common mode signal and single ended signal (2022.2.14)
深度学习——LSTM
How to batch modify packaging for DXP schematic diagram
期末复习-PHP学习笔记5-PHP数组
Thread network
The most convenient serial port screen chip scheme designed at the charging pile in China
Mailbox application routine of running wild fire RT thread
期末复习-PHP学习笔记11-PHP-PDO数据库抽象层.
DXP software uses shortcut keys
Basic theory of four elements and its application
November 22, 2021 [reading notes] - bioinformatics and functional genomics (Chapter 5, section 4, hidden Markov model)
Final review -php learning notes 7-php and web page interaction
LabVIEW program code update is slow
Xiashuo think tank: 28 updates of the planet reported today (including the information of flirting with girls and Han Tuo on Valentine's day)
Deloitte: investment management industry outlook in 2022
Raspberry pie 4B Getting Started Guide
Program acceleration
Halcon: read the camera and binary it
Matter protocol
期末复习-PHP学习笔记7-PHP与web页面交互