当前位置:网站首页>July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)
July 30, 2021 [wgs/gwas] - whole genome analysis process (Part I)
2022-06-30 07:39:00 【Muyiqing】
Catalog
Abstract
After half a year , Finally put WGS The previous analysis uses snakemake Set up. . Don't let me be slow , There are really not many projects , The process is not particularly complicated . Previous shell Scripts can also be used , Therefore, it has not been really built yet . Now there are more and more projects , Considering the improvement of work efficiency , I did it a few days ago 2 individual WGS Project , Sort out the process .
Command line
#vim: set syntax=python
#__author__ = "Yang Xin"
#__copyright__ = "Copyright 2021, Wang lab"
#__email__ = "[email protected]"
#__license__ = "SAU"
"""
A WGS analysis workflow using trim_galore, BWA, Samtools and GATK.
"""
configfile:"config.yaml"
import yaml
rule all:
input:
expand("03.snp/{sample}_stat.csv", sample=config["samples"]),# Comparison group mutation output table
expand("{genome}.bwt", genome=config["reference"]),# Reference genome indexing ,bwa Comparison needs
expand("{genome}.fai", genome=config["reference"]),# Reference genome indexing ,GATK Comparison needs
expand("02.align/ref/{genome_profix}.dict", genome_profix=config["profix"]),# Reference genome indexing ,GATK Comparison needs
"01.data/data_quality_stat.txt",# Quality control statistics
#############################quality_control######################
rule trim_galore:
input:
R1 = "01.data/{sample}_1.fq.gz",
R2 = "01.data/{sample}_2.fq.gz"
output:
dir = directory("01.data/trim/{sample}"),# Output sample QC folder
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
threads:8
shell:
"trim_galore -j {threads} -q 25 --phred33 --length 36 -e 0.1 --stringency 3 --paired -o {output.dir} --fastqc {input.R1} {input.R2} "
rule fastp:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz"
output:
temp("01.data/trim/{sample}.json")# Generate temporary files .json, next rule Delete after use .
threads:16
shell:
"fastp -w {threads} -i {input.R1} -I {input.R2} -j {output} "
rule fastp_stat:
input:
expand("01.data/trim/{sample}.json",sample=config["samples"])
output:
"01.data/fastp_stat.txt"
script:
"scripts/fastp_stat.py"
# according to json Document statistics of quality control information of all samples , Previous articles have .
rule sort_fastp_stat:
input:
"01.data/fastp_stat.txt"
output:
"01.data/data_quality_stat.txt"
shell:
"sort 01.data/fastp_stat.txt > 01.data/data_quality_stat.txt;sed -i '1i Sample_ID Clean_reads Clean_bases ≥Q30 GC_Content' 01.data/data_quality_stat.txt;rm -rf 01.data/fastp_stat.txt"
# Format
#################################align#########################
rule bwa_index:
input:
genome = config["reference"]
output:
"{genome}.bwt",
shell:
"bwa index {input.genome}"
rule bwa_map:
input:
R1 = "01.data/trim/{sample}/{sample}_1_val_1.fq.gz",
R2 = "01.data/trim/{sample}/{sample}_2_val_2.fq.gz",
genome_index = expand("{genome}.bwt",genome=config["reference"])
output:
temp("02.align/{sample}_align.sam")
params:
genome = config["reference"],
threads:16
shell:
"bwa mem -R \'@RG\\tID:foo\\tSM:bar\\tLB:Abace\' -t {threads} {params.genome} {input.R1} {input.R2} > {output}"
# Be careful , Special symbols in double quotation marks (()''%& etc. ) There will be ambiguity , You need to use a backslash symbol to comment .
rule samtools_view:#sam2bam
input:
"02.align/{sample}_align.sam"
output:
temp("02.align/{sample}.tmp.bam")
shell:
"samtools view -bS {input} > {output} "
rule samtools_sort:# Sort
input:
"02.align/{sample}.tmp.bam"
output:
"02.align/{sample}.bam",
shell:
"samtools sort {input} -o {output} "
rule samtools_faidx:# Index
input:
genome = config["reference"]
output:
"{genome}.fai"
shell:
"samtools faidx {input.genome}"
########################GATK_filter##################
rule GATK_dict:# Index
input:
genome = config["reference"]
output:
dict = "02.align/ref/{genome_profix}.dict"
shell:
"gatk CreateSequenceDictionary -R {input.genome} -O {output.dict}"
rule GATK_sort:# Use GATK Yes bam File reorder , duplicate removal .
input:
sample = "02.align/{sample}.bam",
output:
sort_bam = temp("02.align/{sample}_sort.bam")
params:
genome = config["reference"],
shell:
"gatk SortSam -I {input.sample} -O {output.sort_bam} -R {params.genome} -SO coordinate --CREATE_INDEX true"
rule GATK_dupli:
input:
sample = "02.align/{sample}_sort.bam",
output:
dup_bam = temp("02.align/{sample}_sort_dup.bam")
params:
genome = config["reference"]
log:
metrics = "02.align/log/{sample}_markdup_metrics.txt"
shell:
"gatk MarkDuplicates -I {input.sample} -O {output.dup_bam} -M {log.metrics}"
rule samtools_index2:
input:
"02.align/{sample}_sort_dup.bam"
output:
"02.align/{sample}_sort_dup.bam.bai"
shell:
"samtools index {input}"
rule Calling:
input:
sample = "02.align/{sample}_sort_dup.bam",
index = "02.align/{sample}_sort_dup.bam.bai"
output:
sample_vcf = temp("03.snp/{sample}.g.vcf.gz")
params:
genome = config["reference"]
shell:
"gatk HaplotypeCaller -ERC GVCF -R {params.genome} -I {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
# Be careful -ERC Parameters , Species with small genomes may or may not be added , The big genome must be added , And use -L Note according to the chromosome number , Then merge .
rule GVCF2VCF:# Look for mutation sites , The results here include snp and indel
input:
sample = "03.snp/{sample}.g.vcf.gz",
output:
sample_vcf = "03.snp/{sample}.vcf"
params:
genome = config["reference"]
shell:
"gatk GenotypeGVCFs -R {params.genome} -V {input.sample} -O {output.sample_vcf} --annotate-with-num-discovered-alleles true"
rule SNP_calling:# select snp site
input:
sample = "03.snp/{sample}.vcf",
output:
sample_vcf = "03.snp/{sample}_snp.vcf"
params:
genome = config["reference"]
shell:
"gatk SelectVariants -R {params.genome} -V {input.sample} -O {output.sample_vcf} --select-type-to-include SNP"
rule VCF_stat:# Yes snp Statistics of loci
input:
snp_vcf = "03.snp/{sample}_snp.vcf"
output:
"03.snp/{sample}_stat.csv"
shell:
"python scripts/snp_frequency.py {input.snp_vcf}"
Some statistics script introduction documents
summary
This process has been proved to be feasible . Of course , The integrity of the process is not enough , such as snp Sites can also be screened , Or the indel It is also calculated . meanwhile , There are still some documents to be provided , such as config.yaml, fastp_stat.py, snp_frequency.py. Some documents can be found in previous articles , Some scripts I will upload the software needed for the whole process to github above . I hope you can test this process and give feedback on the use suggestions , Welcome to the group , When the QR code expires, you can add VX:bbplayer2021 , remarks Apply to join the student information exchange group .
边栏推荐
- Introduction notes to pytorch deep learning (XII) neural network - nonlinear activation
- Analysys analysis: online audio content consumption market analysis 2022
- ADC basic concepts
- Analysis of cross clock transmission in tinyriscv
- 1、 Output debugging information: makefile file debugging information $(warning "tests" $(mkfile\u path)); makefile file path
- Binary tree related operations (based on recursion, implemented in C language)
- C language - student achievement management system
- Minecraft 1.16.5 module development (50) guide book
- right four steps of SEIF SLAM
- Final review -php learning notes 7-php and web page interaction
猜你喜欢

Examen final - notes d'apprentissage PHP 3 - Déclaration de contrôle du processus PHP

Inversion Lemma

Minecraft 1.16.5模组开发(五十) 书籍词典 (Guide Book)

String application -- string violent matching (implemented in C language)

Introduction notes to pytorch deep learning (11) neural network pooling layer

Commands and permissions for directories and files
![2021-10-29 [microbiology] qiime2 sample pretreatment form automation script](/img/4d/3a3d645a27c3561c3ebe20dcd8e142.jpg)
2021-10-29 [microbiology] qiime2 sample pretreatment form automation script
![November 16, 2021 [reading notes] - macro genome analysis process](/img/c4/4c74ff1b4049f5532c871eb00d5ae7.jpg)
November 16, 2021 [reading notes] - macro genome analysis process
![November 19, 2021 [reading notes] a summary of common problems of sneakemake (Part 2)](/img/f8/ca1874eb999dc2bbb3c1392d0b72bc.jpg)
November 19, 2021 [reading notes] a summary of common problems of sneakemake (Part 2)

期末复习-PHP学习笔记1
随机推荐
25岁,从天坑行业提桶跑路,在经历千辛万苦转行程序员,属于我的春天终于来了
LabVIEW程序代码更新缓慢
Basic knowledge points
Matter protocol
Similarities and differences of differential signal, common mode signal and single ended signal (2022.2.14)
At the age of 25, I started to work in the Tiankeng industry with buckets. After going through a lot of hardships to become a programmer, my spring finally came
Shell command, how much do you know?
Account command and account authority
Intersection of two lines
Sublime text 3 configuring the C language running environment
2021 China Enterprise Cloud index insight Report
STM32 register
Local unloading traffic of 5g application
为什么大学毕业了还不知道干什么?
Three software installation methods
深度学习——词汇表征
Lodash filter collection using array of values
How to batch modify packaging for DXP schematic diagram
期末复习-PHP学习笔记4-PHP自定义函数
STM32 control LED lamp