当前位置：网站首页>July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)

July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)

2022-06-30 07:39:00 【Muyiqing】

Catalog

Abstract
Command line
summary

Abstract

After half a year , Finally put WGS The previous analysis uses snakemake Set up. . Don't let me be slow , There are really not many projects , The process is not particularly complicated . Previous shell Scripts can also be used , Therefore, it has not been really built yet . Now there are more and more projects , Considering the improvement of work efficiency , I did it a few days ago 2 individual WGS Project , Sort out the process .

Command line

#vim: set syntax=python

#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "[email protected]"
#__license__ = "SAU"

"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""

configfile:"config.yaml"

import yaml

rule all:
    input:
        expand("03.snp/{sample}_stat.csv", sample=config["samples"]),# Comparison group mutation output table 
        expand("{genome}.bwt", genome=config["reference"]),# Reference genome indexing ,bwa Comparison needs 
        expand("{genome}.fai", genome=config["reference"]),# Reference genome indexing ,GATK Comparison needs 
        expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),# Reference genome indexing ,GATK Comparison needs 
        "01.data/data_quality_stat.txt",# Quality control statistics 
#############################quality_control######################

rule trim_galore:
    input:
        R1 = "01.data/{sample}_1.fq.gz",
        R2 = "01.data/{sample}_2.fq.gz"
    output:
        dir = directory("01.data/trim/{sample}"),# Output sample QC folder 
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
    threads:8
    shell:
        "trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "

rule fastp:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
    output:
        temp("01.data/trim/{sample}.json")# Generate temporary files .json, next rule Delete after use .
    threads:16
    shell:
        "fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "

rule fastp_stat:
    input:
        expand("01.data/trim/{sample}.json",sample=config["samples"])
    output:
        "01.data/fastp_stat.txt"
    script:
        "scripts/fastp_stat.py"
        # according to json Document statistics of quality control information of all samples , Previous articles have .

rule sort_fastp_stat:
    input:
        "01.data/fastp_stat.txt"
    output:
        "01.data/data_quality_stat.txt"
    shell:
        "sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID Clean_reads Clean_bases ≥Q30 GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
        # Format 

#################################align#########################

rule bwa_index:
    input:
        genome = config["reference"]
    output:
        "{genome}.bwt",
    shell:
        "bwa index {input.genome}"

rule bwa_map:
    input:
        R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
        R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
        genome_index = expand("{genome}.bwt",genome=config["reference"])
    output:
        temp("02.align/{sample}_align.sam")
    params:
        genome = config["reference"],
    threads:16
    shell:
        "bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
        # Be careful , Special symbols in double quotation marks （()''%& etc. ） There will be ambiguity , You need to use a backslash symbol to comment .

rule samtools_view:#sam2bam
    input:
        "02.align/{sample}_align.sam"
    output:
        temp("02.align/{sample}.tmp.bam")
    shell:
        "samtools view -bS {input} > {output} "

rule samtools_sort:# Sort 
    input:
        "02.align/{sample}.tmp.bam"
    output:
        "02.align/{sample}.bam",
    shell:
        "samtools sort {input} -o {output} "


rule samtools_faidx:# Index 
    input:
        genome = config["reference"]
    output:
        "{genome}.fai"
    shell:
        "samtools faidx {input.genome}"

########################GATK_filter##################

rule GATK_dict:# Index 
    input:
        genome = config["reference"]
    output:
        dict = "02.align/ref/{genome_profix}.dict"
    shell:
        "gatk CreateSequenceDictionary -R {input.genome} -O {output.dict}"

rule GATK_sort:# Use GATK Yes bam File reorder , duplicate removal .
    input:
        sample = "02.align/{sample}.bam",
    output:
        sort_bam = temp("02.align/{sample}_sort.bam")
    params:
        genome = config["reference"],
    shell:
        "gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"

rule GATK_dupli:
    input:
        sample = "02.align/{sample}_sort.bam",
    output:
        dup_bam = temp("02.align/{sample}_sort_dup.bam")
    params:
        genome = config["reference"]
    log:
        metrics = "02.align/log/{sample}_markdup_metrics.txt"
    shell:
        "gatk MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"

rule samtools_index2:
    input:
        "02.align/{sample}_sort_dup.bam"
    output:
        "02.align/{sample}_sort_dup.bam.bai"
    shell:
        "samtools index {input}"

rule Calling:
    input:
        sample = "02.align/{sample}_sort_dup.bam",
        index = "02.align/{sample}_sort_dup.bam.bai"
    output:
        sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
    params:
        genome = config["reference"]
    shell:
        "gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
# Be careful -ERC Parameters , Species with small genomes may or may not be added , The big genome must be added , And use -L Note according to the chromosome number , Then merge .
rule GVCF2VCF:# Look for mutation sites , The results here include snp and indel
    input:
        sample = "03.snp/{sample}.g.vcf.gz",
    output:
        sample_vcf = "03.snp/{sample}.vcf"
    params:
        genome = config["reference"]
    shell:        
        "gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"

rule SNP_calling:# select snp site 
    input:
        sample = "03.snp/{sample}.vcf",
    output:
        sample_vcf = "03.snp/{sample}_snp.vcf"
    params:
        genome = config["reference"]
    shell:                
        "gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"

rule VCF_stat:# Yes snp Statistics of loci 
    input:
        snp_vcf = "03.snp/{sample}_snp.vcf"
    output:
        "03.snp/{sample}_stat.csv"
    shell:
        "python scripts/snp_frequency.py {input.snp_vcf}"

Some statistics script introduction documents

fastp_stat.py
snp_frequency.py

summary

This process has been proved to be feasible . Of course , The integrity of the process is not enough , such as snp Sites can also be screened , Or the indel It is also calculated . meanwhile , There are still some documents to be provided , such as config.yaml, fastp_stat.py, snp_frequency.py. Some documents can be found in previous articles , Some scripts I will upload the software needed for the whole process to github above . I hope you can test this process and give feedback on the use suggestions , Welcome to the group , When the QR code expires, you can add VX：bbplayer2021 , remarks Apply to join the student information exchange group .
Insert picture description here