Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Last update: Dec 29, 2022

Related tags

Overview

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis. You write a high level configuration file specifying your inputs and analysis parameters. This input drives a parallel run that handles distributed execution, idempotent processing restarts and safe transactional steps. bcbio provides a shared community resource that handles the data processing component of sequencing analysis, providing researchers with more time to focus on the downstream biology.

Features

Community developed: We welcome contributors with the goal of overcoming the biological, algorithmic and computational challenges that face individual developers working on complex pipelines in quickly changing research areas. See our users page for examples of bcbio-nextgen deployments, and the developer documentation for tips on contributing.
Installation: A single installer script prepares all third party software, data libraries and system configuration files.
Automated validation: Compare variant calls against common reference materials or sample specific SNP arrays to ensure call correctness. Incorporation of multiple approaches for alignment, preparation and variant calling enable unbiased comparisons of algorithms.
Distributed: Focus on parallel analysis and scaling to handle large population studies and whole genome analysis. Runs on single multicore computers, in compute clusters using IPython parallel, or on the Amazon cloud. See the parallel documentation for full details.
Multiple analysis algorithms: bcbio-nextgen provides configurable variant calling (small and copy number), RNA-seq, ATAC-seq, , BS-Seq, SC RNA-seq, and small RNA pipelines.

Quick start

Install bcbio-nextgen with all tool dependencies and data files:

wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio --tooldir=/usr/local \
      --genomes hg38 --aligners bwa --aligners bowtie2

producing an editable system configuration file referencing the installed software, data and system information.

Automatically create a processing description of sample FASTQ and BAM files from your project, and a CSV file of sample metadata:
```
bcbio_nextgen.py -w template freebayes-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq
```
This produces a sample description file containing pipeline configuration options.

Run analysis, distributed across 8 local cores:

cd project1/work
bcbio_nextgen.py ../config/project1.yaml -n 8

Documentation

See the full documentation and longer analysis-based articles. We welcome enhancements or problem reports using GitHub and discussion on the biovalidation mailing list.

Contributors

Miika Ahdesmaki, AstraZeneca
Luca Beltrame, IRCCS "Mario Negri" Institute for Pharmacological Research, Milan, Italy
Christian Brueffer, Lund University, Lund, Sweden
Alla Bushoy, AstraZeneca
Guillermo Carrasco, Science for Life Laboratory, Stockholm
Nick Carriero, Simons Foundation
Brad Chapman, Harvard Chan Bioinformatics Core
Saket Choudhary, University Of Southern California
Peter Cock, The James Hutton Institute
Matthias De Smet, Center for Medical Genetics, Ghent University Hospital, Belgium
Matt Edwards, MIT
Mario Giovacchini, Science for Life Laboratory, Stockholm
Karl Gutwin, Biogen
Jeff Hammerbacher, Icahn School of Medicine at Mount Sinai
Oliver Hofmann, University of Melbourne Centre for Cancer Research
John Kern
Rory Kirchner, Harvard Chan Bioinformatics Core
Tetiana Khotiainsteva, Ardigen
Kerrin Mendler, AstraZeneca
Sergey Naumenko, Harvard Chan Bioinformatics Core
Jakub Nowacki, AstraZeneca
John Morrissey, Harvard Chan Bioinformatics Core
Lorena Pantano, Harvard Chan Bioinformatics Core
Brent Pedersen, University of Colorado Denver
James Porter, The University of Chicago
Vlad Saveliev, Center for Algorithmic Biotechnology, St. Petersburg University
Valentine Svensson, Science for Life Laboratory, Stockholm
Paul Tang, UCSF
Stephen Turner, University of Virginia
Roman Valls, Science for Life Laboratory, Stockholm
Kevin Ying, Garvan Institute of Medical Research, Sydney, Australia
Steffen Möller, University of Rostock, Germany
WimSpee

License

The code is freely available under the MIT license.

Comments

torque is hanging indefinitely

Hey there -

In trying the updates for #386 we have killed our development install with 756be0ac - any job we try to run be in human, rat, mouse, or the broken dogs all hang indefinitely with torque. The nodes get checked out and the engine and clients look to be running via qstat or showq - however nothing is happening on the nodes when I look at top or ps aux. There are plenty of free nodes so this doesn't seem to a queue issue The jobs all hang until they hit the timeout and that's all I get. I dont see anything in the logs/ipython logs - Engines appear to have started successfully... I've rubbed my eyes and wiped my work dirs a few times to no avail. I checked and indeed running -t local works.... Any suggestions or additional info I can provide?

Thanks!

opened by caddymob 67
Scalpel InDel calling support

Looks like vcf support has been added to Scalpel recently: http://sourceforge.net/p/scalpel/code/ci/master/tree/

Opening this ticket while I'm looking into testing Scalpel and integrating it within bcbio, bear with me

opened by mjafin 65
Problems with logs and joint VCF file generation in latest dev build
Hello,

After upgrading to the latest development version, the logs and joint VCF file generation don't seem to work properly anymore. Debug messages don't get printed anymore (neither on stdout, nor in the log file), and the bcbio-nextgen-debug.log file is pretty much identical with bcbio-nextgen.log. The only difference is the resource requests messages which appear in the debug log:

[2018-08-02T10:10Z] Resource requests: bwa, sambamba, samtools; memory: 3.00, 3.00, 3.00; cores: 16, 16, 16 [2018-08-02T10:10Z] Configuring 2 jobs to run, using 16 cores each with 48.1g of memory reserved for each job

[2018-08-02T10:10Z] Resource requests: gatk, gatk-haplotype, picard; memory: 3.50, 3.00, 3.00; cores: 1, 16, 16 [2018-08-02T10:10Z] Configuring 32 jobs to run, using 1 cores each with 3.50g of memory reserved for each job

[2018-08-02T10:18Z] Resource requests: bcbio_variation, fastqc, gatk, gatk-vqsr, gemini, kraken, preseq, qsignature, sambamba, samtools; memory: 3.00, 3.00, 3.50, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00, 3.00; cores: 16, 16, 1, 16, 16, 16, 16, 16, 16, 16 [2018-08-02T10:18Z] Configuring 2 jobs to run, using 16 cores each with 56.1g of memory reserved for each job

The multi-sample <batch>-gatk-haplotype-joint-annotated.vcf.gz did not get generated, even though the sample-specific VCF files are where they should be.

Furthermore, bcbio-nextgen-commands.log is completely empty.

To test all of this, I've run a simple variant calling job that worked flawlessly a few days ago, before upgrading Bcbio-nextgen.
opened by amizeranschi 60
RFC / RFE: LOH analysis in tumor-normal samples

GIven the interest in studies that involve tumor heterogeneity / subclonality, currently bcbio offers "out of the box" support for both somatic variants and CNVs. A useful metric that can be combined (and already used by some tools, like CNVkit's plotting) is LOH, which (to my knowledge) is not yet handled.

I admit I'm not sure if there is support already for this in bcbio. I know that back in the days I baked VarScan support to actually remove LOH calls from the VCF as they weren't truly somatic calls.

The biggest problem here is how to actually and reliably extract these information. MuTect[2] might have these in the REJECTed calls (but how to distinguish them?), VarScan 2 calls them (might just be needed to move them away and elsewhere) and I'm not sure how FreeBayes and VarDict handle them.

Or are there any other tools more suited for this purpose?

I'm willing to put the money where my mouth is in this case as we're starting to explore this in my institution and having bcbio do that would greatly streamline things.

opened by lbeltrame 59
Trio pipeline
@chapmanb

I would like to run a trio analysis in whole exome samples. Can I use all callers (strelka2, deepvariant. vardict, gatk etc) for a trio analysis with samples having the same batch name? Can I use the ensemble method?

I am also trying to do CNV analysis in this trio. Can I add all svcallers? Do all work with single germline sample?

It would also be nice to specify in the documentation:

Which callers can be used for Germline Variant Calling Which callers can only be used for Somatic (Tumor-Normal) Variant Calling Which callers can be used for Germline SV Calling Which callers can only be used for Somatic (Tumor-Normal) SV Calling Which callers can be user for Trio analysis
opened by kokyriakidis 56
canfam3 dbSNP - ensembl 75

greetings! Can we add the canine dbSNP vcf to the variation resources in 9dcb447, please? I realize recallibration will not be available but getting rsIDs sure would be nice :)

The vcf can be obtained here: ftp://ftp.ensembl.org/pub/release-75/variation/vcf/canis_familiaris/Canis_familiaris.vcf.gz

Only thing is the canine genome for bcbio has "chr" prefixes on contigs where the dbSNP does not... I seem to recal you have a ensembl <--> ucsc conversion method from when we added the rn5 genome, so hoping this is easy without just awk'in on a 'chr' :)

Thanks!

opened by caddymob 56

Incorrect CNVkit output

I’ve used the cnvkit a few times, but this particular sample results in stating everything is at a loss.

This is the head T1.cns file produced by bcbio (i removed the gene column for clarity)

chromosome | start | end | gene | log2 | baf | depth | probes | weight
-- | -- | -- | -- | -- | -- | -- | -- | --
chr1 | 10044 | 3783855 | removed | -1.86927 | 0.402385 | 3.47617 | 3006 | 404.754
chr1 | 3786057 | 12808529 | removed | -2.93556 | 0.446113 | 1.56538 | 8470 | 1231.48
chr1 | 12810479 | 14874986 | removed | -4.6436 | 0.433824 | 0.933951 | 1538 | 225.929
chr1 | 14876201 | 16524732 | removed | -2.55335 | 0.439711 | 1.89716 | 1584 | 229.988
chr1 | 16525961 | 16962159 | removed | -0.14769 | 0.263566 | 4.08855 | 297 | 42.2095
chr1 | 16962525 | 46822896 | removed | -3.13869 | 0.444444 | 1.5238 | 26862 | 3916.87
chr1 | 46824333 | 51700063 | removed | -4.49549 | 0.449153 | 0.982765 | 3712 | 547.092

note that all of the log2 values are quite negative (-call.cns is similar)

and this is the result of running cnvkit manually:

chromosome | start | end | gene | log2 | depth | probes | weight
-- | -- | -- | -- | -- | -- | -- | --
chr1 | 65409 | 7106529 | X | -0.06439 | 105.197 | 1521 | 522.478
chr1 | 7107029 | 1.22E+08 | X |   |   |   |  
chr1 | 1.22E+08 | 1.25E+08 | X | 1.19374 | 3.32145 | 16 | 7.01406
chr1 | 1.44E+08 | 1.52E+08 | X |   |   |   |  
chr1 | 1.52E+08 | 1.52E+08 | X | 0.595503 | 239.102 | 74 | 21.886
chr1 | 1.52E+08 | 2.48E+08 | X |   |   |   |  
chr1 | 2.48E+08 | 2.49E+08 | X | 0.304078 | 156.569 | 184 | 63.7607
chr2 | 41359 | 93085490 | X |   |   |   |  
chr2 | 94573375 | 1.79E+08 | X |   |   |   |  
chr2 | 1.79E+08 | 1.79E+08 | X | 0.168907 | 216.203 | 540 | 173.011
chr2 | 1.79E+08 | 2E+08 | X | 0.031598 | 92.5111 | 1717 | 642.207

while some of the columns are 0, the results are much more close to accurate

this is the manual command, which i don't think is particularly unique and uses the bcbio generated bam files

cnvkit.py batch final/T1/T1-ready.bam --normal final/N1/N1-ready.bam -p 8 --targets ../S04380110_Padded_hg38_trimmed.bed --fasta /mnt/biodata/genomes/Hsapiens/hg38/seq/hg38.fa --output-dir ./cnvkit/ --diagram –scatter

opened by choosehappy 52

Using UMIs in the bcbio smallRNA pipeline
Hi,

This is somewhat similar to #2070. We have sing end .fastq files with the following format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

am I then able to add umi_type: fastq_name to the bcbio .yaml config and run through the small RNA pipeline? Is there a better way of doing this?

All advice gratefully received.
opened by mxhp75 51
AssertionErrror multisample joint calling

Hi Brad and group,

Recently we did a run bcbio on multisample joint calling and its failing. When we do single sample joint calling it works. Attached are the sample yaml and bcbio error files. Its complaining about coordinates, but I am not sure how did it work for single sample.

Attached are the bcbio err file and sample file.

Thanks,

bcbio.stderr.txt sample.yaml.txt

opened by DiyaVaka 51
$RFC: allele fraction thresholds for paired analyses$

RFC: allele fraction thresholds for paired analyses

MuTect and VarScan has a threshold setting (--tumor_f_pretest) to select sites with at least a certain fraction of non-REF alleles. Something similar is in VarScan (minimum frequency to call an allele as heterozygote). MuTect has no preset, VarScan has 0.1 by default.

I'm wondering if (hence the RFC) this could be handled in the algorithm parameters, or at least harmonized between the two callers. Selecting a proper "frequency" (quotes, because you can't really call it frequency when you have just a sample pair) is important for validation.

Opinions? Pro, contra?
discussion

opened by lbeltrame 49
error in bcbio structural variant calling

Hi Brad,

Thanks for your help. I want to call structural variants, but get an error: the parallel, svtyper, cnvnator_wrapper.py, cnvnator-multi, annotate_rd.py are not found in PATH, like this:

[2014-10-27 23:05] Uncaught exception occurred Traceback (most recent call last): File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 20, in run _do_run(cmd, checks, log_stdout) File "/public/software/bcbio-nextgen/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 93, in _do_run raise subprocess.CalledProcessError(exitcode, error_msg) CalledProcessError: Command 'set -o pipefail; speedseq sv -v -B ...... Sourcing executables from /public/software/bcbio-nextgen/tools/bin/speedseq.config ... which: no parallel in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-nextgen/anaconda/bin:.....) which: no svtyper in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator_wrapper.py in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio.... which: no cnvnator-multi in (/public/software/bcbio-nextgen/tools/bin:/public/software/bcbio-.... which: no annotate_rd.py in((/public/software/bcbio-nextgen/tools/bin:/....) Calculating alignment stats... sambamba-view: (Broken pipe) Traceback (most recent call last): File "/public/software/bcbio-nextgen/tools/share/lumpy-sv/pairend_distro.py", line 12, in import numpy as np ImportError: No module named numpy

How can I fix this, thanks again.

Shangqian

opened by shang-qian 47

recalibrate=true fails, Unsupported class file major version 55

Version info

bcbio version: 1.2.9
OS name and version: Ubuntu 18.04.5 LTS

To Reproduce Exact bcbio command you have used:

bcbio_nextgen.py ../config/config.yaml -n 8

Your yaml configuration file:

details:
- algorithm:
    aligner: bwa
    exclude_regions: [lcr]
    mark_duplicates: true
    recalibrate: true
    variantcaller: [mutect2, strelka2, varscan, vardict]
    variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
  analysis: variant2
  description: Patient70-normal
  files:
    - normal_1.fq.gz
    - normal_2.fq.gz
  genome_build: GRCh37
  metadata:
    batch: Patient70
    phenotype: normal
- algorithm:
    aligner: bwa
    mark_duplicates: true
    recalibrate: true
    remove_lcr: true
    variantcaller: [mutect2, strelka2, varscan, vardict]
    variant_regions: /media/gpudrive/apps/bcbio/genomes/Hsapiens/GRCh37/coverage/capture_regions/Exome-NGv3.bed
  analysis: variant2
  description: Patient70-tumor
  files:
    - tumor_1.fq.gz
    - tumor_2.fq.gz
  genome_build: GRCh37
  metadata:
    batch: Patient70
    phenotype: tumor
upload:
    dir: ../final

Log files (could be found in work/log) Here are the important parts of the log I guess

[2022-12-25T18:24Z] GATK: BaseRecalibratorSpark
[2022-12-25T18:25Z] 18:25:59.390 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
[2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - The Genome Analysis Toolkit (GATK) v4.2.6.1
[2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - For support and documentation go to https://software.broadinstitute.org/gatk/
[2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Executing as [email protected] on Linux v4.15.0-197-generic amd64
[2022-12-25T18:25Z] 18:25:59.391 INFO  BaseRecalibratorSpark - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1-internal+0-adhoc..src
[2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - Start Date/Time: December 25, 2022 at 6:25:03 PM UTC
[2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
[2022-12-25T18:25Z] 18:25:59.392 INFO  BaseRecalibratorSpark - ------------------------------------------------------------
[2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - HTSJDK Version: 2.24.1
[2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Picard Version: 2.27.1
[2022-12-25T18:25Z] 18:25:59.393 INFO  BaseRecalibratorSpark - Built for Spark Version: 2.4.5

...
[2022-12-25T18:36Z] java.lang.IllegalArgumentException: Unsupported class file major version 55
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
[2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
[2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
[2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
[2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
[2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
[2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
[2022-12-25T18:36Z]     at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
[2022-12-25T18:36Z]     at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
[2022-12-25T18:36Z]     at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
[2022-12-25T18:36Z]     at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
[2022-12-25T18:36Z]     at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
[2022-12-25T18:36Z]     at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355)
[2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:307)
[2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:306)
[2022-12-25T18:36Z]     at scala.collection.immutable.List.foreach(List.scala:392)
[2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:306)
[2022-12-25T18:36Z]     at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
[2022-12-25T18:36Z]     at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
[2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2100)
[2022-12-25T18:36Z]     at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:990)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.collect(RDD.scala:989)
[2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner$.sketch(Partitioner.scala:309)
[2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:171)
[2022-12-25T18:36Z]     at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:151)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:62)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions$$anonfun$sortByKey$1.apply(OrderedRDDFunctions.scala:61)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
[2022-12-25T18:36Z]     at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:61)
[2022-12-25T18:36Z]     at org.apache.spark.api.java.JavaPairRDD.sortByKey(JavaPairRDD.scala:936)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.utils.spark.SparkUtils.sortUsingElementsAsKeys(SparkUtils.java:165)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.sortSamRecordsToMatchHeader(ReadsSparkSink.java:207)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSink.writeReads(ReadsSparkSink.java:107)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:374)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.writeReads(GATKSparkTool.java:362)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.tools.spark.ApplyBQSRSpark.runTool(ApplyBQSRSpark.java:90)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:546)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
[2022-12-25T18:36Z]     at org.broadinstitute.hellbender.Main.main(Main.java:289)
[2022-12-25T18:36Z] 22/12/25 18:36:31 INFO ShutdownHookManager: Shutdown hook called
Using GATK jar /pathto/bcbio/anaconda/envs/java/share/gatk4-4.2.6.1-1/gatk-package-4.2.6.1-local.jar

opened by asalimih 0

Add --cloudbiolinux argument

Fixes the following issue: https://github.com/bcbio/bcbio-nextgen/issues/3689

The problem originated from this commit: https://github.com/bcbio/bcbio-nextgen/commit/d61e77825f46548101db9b64776269f8e96ee220

opened by amizeranschi 0
[main_samview] fail to read the header from "filename.sam".

Hello, I am getting the following error when trying to run samtools in a sam file:

[main_samview] fail to read the header from "20201032.sam". srun: error: node2-092: task 0: Exited with exit code 1

But when i checked the sam file (using head) it does contain the headers, so can be happening?
@SQ SN:1 LN:278617202 @SQ SN:2 LN:250202058 @SQ SN:3 LN:226089100

my script is as follow:

#!/bin/bash

#SBATCH --job-name=samtools #SBATCH --time=72:00:00 #SBATCH --partition=serial #SBATCH --ntasks=1 #SBATCH --mem-per-cpu=100GB #SBATCH [email protected] #SBATCH --mail-type=fail,end #SBATCH --error=%u.%J.err #SBATCH --output=%u.%J.out

load all modules needed for the current run

module purge # clean the current env module add slurm # we always need this one

Activate the environment

module add TOOLS python/miniconda-3.9 module add bio/samtools/1.16.1/gcc/9.2.0 source activate ngs-tools

echo "Starting at date" echo "Running on hosts: $SLURM_NODELIST" echo "Current working directory is pwd"

srun samtools view -bh 20201032.sam > SRR519926.bam
samtools sort 20201032.bam > SRR519926.sorted.bam
samtools index 20201032.sorted.bam

Save results and final clean up

source deactivate

echo "Finished at date"

opened by gabyrudd22 0

Error with bcbio_setup_genome.py: AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

Hi,

I'm getting an error when trying to create a custom genome. Here's the command I'm running and the error it produces:

$ bcbio_setup_genome.py -f GWHBDNW00000000.genome.fasta -g GWHBDNW00000000.gff --gff3 -i bwa seq -n GWHBDNW00000000 -b build1 --buildversion None
Traceback (most recent call last):
  File "/data/share/bcbio_nextgen/anaconda/bin/bcbio_setup_genome.py", line 249, in <module>
    cbl = get_cloudbiolinux(args, REMOTES)
  File "/data/share/bcbio_nextgen/anaconda/lib/python3.7/site-packages/bcbio/install.py", line 807, in get_cloudbiolinux
    cloudbiolinux_remote = remotes["cloudbiolinux"] % args.cloudbiolinux
AttributeError: 'Namespace' object has no attribute 'cloudbiolinux'

opened by amizeranschi 0

Bringing back Docker support, possibly as a replacement for the various Conda environments

Inspired by a recent comment from @gabeng, I wanted to ask if it would be a great deal of effort to bring back Docker support and the creation of new Bcbio Docker images.

One alternative to reviving bcbio-nextgen-vm (although perhaps more laborious) could be to have the possibility to replace Conda environments with several Docker containers in bcbio-nextgen itself, as they do for example in nf-core/sarek. Given how often Conda has been breaking bcbio installs during the last couple of years, it could be worth the effort to replace it, or at least offer the possibility of using Docker containers as an alternative. And this could also pave the way for Kubernetes support at some point.

Here's a list of the Docker images currently on my system, after a few variant calling experiments with the above pipeline:

$ docker image ls
REPOSITORY                                                                 TAG                                          IMAGE ID       CREATED         SIZE
nfcore/snpeff                                                              5.1.R64-1-1                                  0462080aa43c   2 weeks ago     1.4GB
nfcore/vep                                                                 106.1.R64-1-1                                e5c98f96ae89   2 weeks ago     1.22GB
quay.io/biocontainers/mulled-v2-d9e7bad0f7fbc8f4458d5c3ab7ffaaf0235b59fb   551156018e5580fb94d44632dfafbc9c27005a0e-0   5703dbdd3100   2 weeks ago     1.01GB
quay.io/biocontainers/mulled-v2-780d630a9bb6a0ff2e7b6f730906fd703e40e98f   3bdd798e4b9aed6d3e1aaa1596c913a3eeb865cb-0   c4f4a546ff1b   3 weeks ago     1.26GB
quay.io/biocontainers/mulled-v2-fe8faa35dbf6dc65a0f7f5d4ea12e31a79f73e40   219b6c272b25e7e642ae3ff0bf0c5c81a5135ab4-0   a3d569a08aa5   3 weeks ago     133MB
quay.io/biocontainers/gatk4                                                4.3.0.0--py36hdfd78af_0                      0f8cc7afc8e6   7 weeks ago     966MB
quay.io/biocontainers/bcftools                                             1.16--hfe4b78e_1                             7ec55dde74af   8 weeks ago     198MB
quay.io/biocontainers/samtools                                             1.16.1--h6899075_1                           09cd4486af55   8 weeks ago     62MB
quay.io/biocontainers/freebayes                                            1.3.6--hbfe0e7f_2                            9c664cb1521f   2 months ago    326MB
quay.io/biocontainers/tiddit                                               3.3.2--py310hc2b7f4b_0                       e9c7cf6b37d7   2 months ago    350MB
quay.io/biocontainers/multiqc                                              1.13--pyhdfd78af_0                           747595fd0a8e   2 months ago    431MB
google/deepvariant                                                         1.4.0                                        decb60cd33cb   6 months ago    5.72GB
quay.io/biocontainers/sra-tools                                            2.11.0--pl5321ha49a11a_3                     58aa27074b50   9 months ago    379MB
quay.io/biocontainers/mosdepth                                             0.3.3--hdfd78af_1                            14b81386a558   10 months ago   22.5MB
quay.io/biocontainers/fastp                                                0.23.2--h79da9fb_0                           371123966d85   12 months ago   52MB
quay.io/biocontainers/mulled-v2-5f89fe0cd045cb1d615630b9261a1d17943a9b6a   6a9ff0e76ec016c3d0d27e0c0d362339f2d787e6-0   8bb307eced25   14 months ago   387MB
quay.io/biocontainers/python                                               3.9--1                                       34c2b9e3810c   17 months ago   191MB
quay.io/biocontainers/cnvkit                                               0.9.9--pyhdfd78af_0                          65c84d95fbda   18 months ago   1.12GB
quay.io/biocontainers/tabix                                                1.11--hdfd78af_0                             171149a492ea   19 months ago   94.3MB
quay.io/biocontainers/manta                                                1.6.0--h9ee0642_1                            0be19048fb6e   20 months ago   200MB
quay.io/bcbio/bcbio-vc                                                     latest                                       196407441ba3   23 months ago   5.89GB
quay.io/biocontainers/gawk                                                 5.1.0                                        1f25a9f620a3   2 years ago     38.6MB
quay.io/biocontainers/vcftools                                             0.1.16--he513fc3_4                           edbf7b8881c0   2 years ago     48MB
quay.io/biocontainers/fastqc                                               0.11.9--0                                    9d444341a7b2   2 years ago     531MB
quay.io/biocontainers/bwa                                                  0.7.17--hed695b0_7                           5c6028c4ea33   2 years ago     109MB

opened by amizeranschi 0

installing from scratch using ether conda or mamba fails on perl version?

problem, installing from scratch on our HPC fails while trying to install a perl packages:

Encountered problems while solving:
  - package perl-encode-locale-1.05-pl526_6 requires perl >=5.26.2,<5.26.3.0a0, but none of the providers can be installed

I have perl 5.26.3 as default on the system or 5.30.2 or 5.34.1 loadable as module. thanks for any advice in advance

Version info

bcbio version (bcbio_nextgen.py --version): 1.2.9
OS name and version (lsb_release -ds):Red Hat Enterprise Linux 8.4

To Reproduce Exact bcbio command you have used:

/usr/local/bin/python3
 ../bcbio_nextgen_install.py \
	-u stable \
	--datatarget variation \
	--datatarget rnaseq \
	--cores 32 \
	--tooldir ${TOOLDIR} \
	--isolate \
	--distribution "centos" \
	--mamba \
	--genomes ${GENOMES} \
	${DATADIR}

Your yaml configuration file:

Log files (could be found in work/log) Please attach (10MB max): bcbio-nextgen-commands.log, and bcbio-nextgen-debug.log.

Checking required dependencies
Installing isolated base python installation
Installing mamba
Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Retrieving notices: ...working... done
Installing conda-build

                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.24.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['conda-build', 'mamba=0.24.0']

r/noarch                                                      No change
r/linux-64                                                    No change
pkgs/r/linux-64                                               No change
pkgs/r/noarch                                                 No change
ursky/linux-64                                                No change
ursky/noarch                                                  No change
pkgs/main/noarch                                   817.9kB @ 901.1kB/s  0.3s
bioconda/linux-64                                    4.5MB @   4.1MB/s  1.2s
pkgs/main/linux-64                                   5.0MB @   3.9MB/s  1.3s
conda-forge/noarch                                  10.3MB @   4.4MB/s  2.5s
bioconda/noarch                                      4.1MB @   1.6MB/s  1.7s
conda-forge/linux-64                                27.8MB @   4.0MB/s  7.3s

Pinned packages:
  - python 3.7.*


Transaction

  Prefix: /projects/0/lwc2020006/software/bcbio/1.2.9/data/anaconda

  All requested packages already installed

Installing bcbio-nextgen
--2022-11-21 14:05:48--  https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/requirements-conda.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20 [text/plain]
Saving to: ‘requirements-conda.txt’

requirements-conda.txt       100%[============================================>]      20  --.-KB/s    in 0s

2022-11-21 14:05:48 (2.23 MB/s) - ‘requirements-conda.txt’ saved [20/20]


                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.24.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['bcbio-nextgen']

r/linux-64                                                  Using cache
r/noarch                                                    Using cache
conda-forge/linux-64                                        Using cache
conda-forge/noarch                                          Using cache
bioconda/linux-64                                           Using cache
bioconda/noarch                                             Using cache
pkgs/main/linux-64                                          Using cache
pkgs/main/noarch                                            Using cache
pkgs/r/linux-64                                             Using cache
pkgs/r/noarch                                               Using cache
ursky/noarch                                                  No change
ursky/linux-64                                                No change

Pinned packages:
  - python 3.7.*


Transaction

  Prefix: /projects/0/lwc2020006/software/bcbio/1.2.9/data/anaconda

  All requested packages already installed

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Retrieving notices: ...working... done
Installing data and third party dependencies
Upgrading bcbio
--2022-11-21 14:06:42--  https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/requirements-conda.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20 [text/plain]
Saving to: ‘bcbio-update-requirements.txt’

bcbio-update-requirements.tx 100%[============================================>]      20  --.-KB/s    in 0s

2022-11-21 14:06:43 (424 KB/s) - ‘bcbio-update-requirements.txt’ saved [20/20]

Upgrade of bcbio-nextgen code complete.
Upgrading third party tools to latest versions
--2022-11-21 14:06:50--  https://github.com/chapmanb/cloudbiolinux/archive/master.tar.gz
Resolving github.com (github.com)... 140.82.121.4
Connecting to github.com (github.com)|140.82.121.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://codeload.github.com/chapmanb/cloudbiolinux/tar.gz/refs/heads/master [following]
--2022-11-21 14:06:51--  https://codeload.github.com/chapmanb/cloudbiolinux/tar.gz/refs/heads/master
Resolving codeload.github.com (codeload.github.com)... 140.82.121.9
Connecting to codeload.github.com (codeload.github.com)|140.82.121.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/x-gzip]
Saving to: ‘STDOUT’

     0K ........ ........ ........ ........ ........ ........ 6.55M
  3072K ........ ........ ........ .......                    32.6M=0.5s

2022-11-21 14:06:51 (9.56 MB/s) - written to stdout [5195149]

Reading packages from /gpfs/work2/0/lwc2020006/software/bcbio/1.2.9/tmpbcbio-install/cloudbiolinux/contrib/flavor/ngs_pipeline_minimal/packages-conda.yaml
Checking for problematic or migrated packages in default environment
Initalling initial set of packages for default environment with mamba
# Installing into conda environment default: age-metasv, atropos, bamtools, bamutil, bbmap, bcftools=1.13, bedops, bio-vcf, biobambam=2.0.87, bowtie, break-point-inspector, bwa, cage, cnvkit, coincbc, cramtools, deeptools, express, fastp, fastqc, geneimpacts, genesplicer, gffcompare, goleft, grabix, gsort, gsutil, gvcfgenotyper, h5py=3.3, hdf5=1.10, hisat2, hmmlearn, htseq, impute2, kallisto=0.46, kraken, ldc, macs2, maxentscan, mbuffer, minimap2, mintmap, mirdeep2, mirtop, moreutils, multiqc, multiqc-bcbio, ngs-disambiguate, novoalign, oncofuse, pandoc, parallel, pbgzip, peddy, pizzly, pythonpy, qsignature, rapmap, rtg-tools, sailfish, salmon, samblaster, samtools=1.13, scalpel, seq2c<2016, seqbuster, seqcluster, seqtk, sickle-trim, simple_sv_annotation, singlecell-barcodes, snap-aligner=1.0dev.97, snpeff=5.0, solvebio, spades, star=2.6.1d, stringtie, subread, survivor, tdrmapper, tophat-recondition, trim-galore, ucsc-bedgraphtobigwig, ucsc-bedtobigbed, ucsc-bigbedinfo, ucsc-bigbedsummary, ucsc-bigbedtobed, ucsc-bigwiginfo, ucsc-bigwigsummary, ucsc-bigwigtobedgraph, ucsc-bigwigtowig, ucsc-fatotwobit, ucsc-gtftogenepred, ucsc-liftover, ucsc-wigtobigwig, umis, vardict-java, vardict<=2015, variantbam, varscan, vcfanno, viennarna, vqsr_cnn, wham, ipyparallel=6.3.0, ipython-cluster-helper=0.6.4=py_0, ipython=7.29.0, ipython_genutils=0.2.0=py37_0, traitlets=4.3.3, anaconda-client, awscli, bzip2, ncurses, nodejs, p7zip, readline, s3gof3r, xz, perl-app-cpanminus, perl-archive-extract, perl-archive-zip, perl-bio-db-sam, perl-cgi, perl-dbi, perl-encode-locale, perl-file-fetch, perl-file-sharedir, perl-file-sharedir-install, perl-ipc-system-simple, perl-lwp-protocol-https, perl-lwp-simple, perl-sanger-cgp-battenberg, perl-statistics-descriptive, perl-time-hires, perl-vcftools-vcf, bioconductor-annotate, bioconductor-apeglm, bioconductor-biocgenerics, bioconductor-biocinstaller, bioconductor-biocstyle, bioconductor-biostrings, bioconductor-biovizbase, bioconductor-bsgenome.hsapiens.ucsc.hg19, bioconductor-bsgenome.hsapiens.ucsc.hg38, bioconductor-bubbletree, bioconductor-cn.mops, bioconductor-copynumber, bioconductor-degreport, bioconductor-deseq2, bioconductor-dexseq, bioconductor-dnacopy, bioconductor-genomeinfodb, bioconductor-genomeinfodbdata, bioconductor-genomeinfodbdata, bioconductor-genomicranges, bioconductor-iranges, bioconductor-limma, bioconductor-org.hs.eg.db, bioconductor-purecn>=2.0.1, bioconductor-rhdf5, bioconductor-rtracklayer, bioconductor-rtracklayer, bioconductor-summarizedexperiment, bioconductor-titancna, bioconductor-txdb.hsapiens.ucsc.hg19.knowngene, bioconductor-txdb.hsapiens.ucsc.hg38.knowngene, bioconductor-tximport, bioconductor-vsn, r-base=4.1.1=hb67fd72_0, r-chbutils, r-deconstructsigs, r-devtools, r-dplyr, r-dt, r-ggdendro, r-ggplot2, r-ggrepel, r-gplots, r-gsalib, r-janitor, r-knitr, r-optparse, r-pheatmap, r-plyr, r-pscbs, r-reshape, r-rmarkdown, r-rsqlite, r-sleuth, r-snow, r-stringi, r-tidyverse, r-viridis, r-wasabi, r=4.1=r41hd8ed1ab_1004, xorg-libxt
Encountered problems while solving:
  - package perl-encode-locale-1.05-pl526_6 requires perl >=5.26.2,<5.26.3.0a0, but none of the providers can be installed

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... cdfailed with initial frozen solve. Retrying with flexible solve.
Solving environment: ...working... ^[[A^[[A^[[A






Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.


Your yaml configuration file:


**Log files (could be found in work/log)**
Please attach (10MB max):  `bcbio-nextgen-commands.log`, and `bcbio-nextgen-debug.log`.

opened by LeoWelter 6

Releases(v1.2.9)

v1.2.9(Dec 15, 2021)
Fix vcf header bug: T/N SAMPLE lines are back - needed for import to SolveBio

add strandedness: auto for -l A option in salmon

report 10x more peaks in CHIP/ATAC-seq - use 0.05 qvalue

fix misleading RNA-seq duplicated reads statistics: thanks @sib-bcf

reorganize conda environments

snpEff 5.0

strandedness: auto

document WGBS pipeline steps

make --local an option, not default in bismark alignment - too slow

bcbioRNASeq update to 0.3.44

pureCN update to 2.0.1

octopus update to 0.7.4

Source code(tar.gz)
Source code(zip)
v1.2.8(Apr 14, 2021)
Set ENCODE library complexity flags properly for ChIP-seq. Thanks to @mistrm82.

Fix greylisted peaks not being propagated to the output directory. Thanks to @mistrm82.

Better error message when no sample barcodes are found for single-cell RNA-seq.

Better trimming for 2 wgbs kits

enable setting parameters for deduplicate_bismark

custom threading for bismark via yaml

reproducible WGBS user story with the data from Encode

While consensus peak calling, keep the highest scoring peak instead of calling the summit for the highest scoring peak and expanding the peak to 250 bases.

Enable consensus peak calling for broad peaks. Thanks to @mistrm82 and @yoonsquared for pointing out this was missing.

Re-enable ATAC-seq tests, they work now.

svprioritize for mm10

purecn_Dx.R - mutational signatures - still requires a manual update of deconstructsigs or release of it

make sure purecn uses sv_regions bed to call variants

fix misleading disambiguation fastqc read statistics (total, hg38, mm10)

wgbs: nebemseq kit: add --maxins 1000 and --local to bismark align

WGBS: sorted indexed deduplicated bam for ready.bam

print error message when aligner: false and hla typing is on

make sure that mark_duplicates is false with collapsed UMI input

Source code(tar.gz)
Source code(zip)
v1.2.7(Feb 23, 2021)
RNASeq: Add gene body coverage plots to multiqc report.

Restore ability to opt out of contamination checking via tools_off.

Properly invoke threading for verifybamid2.

Fix circular import issue when using bcbio functions outside of the main bcbio script.

Enable setting custom PureCN options via YAML file.

Source code(tar.gz)
Source code(zip)
v1.2.6(Feb 5, 2021)
RNASeq: Fail more gracefully if SummarizedExperiment object cannot be created.

Fixes to handle DRAGEN BAM files from the first stage of UMI processing.

Fix issue with double-annotating with dbSNP. Separating out somatic variant annotation into it's own vcfanno configuration.

Source code(tar.gz)
Source code(zip)
v1.2.5(Jan 9, 2021)
1.2.5 (01 January 2021)

Joint calling for RNA-seq variant calling requires setting jointcaller to bring it in line with the configuration options for variant calling.

Allow pre-aligned BAMs and gVCFs for RNA-seq joint variant calling. Thanks to @WimSpree for the feature.

Allow CollectSequencingArtifacts to be turned off via tools_off: [collectsequencingartifacts].

Fix getiterator -> iter deprecation in ElementTree. Thanks to @smoe.

Add SummarizedExperiment object from RNA-seq runs, a simplified version of the bcbioRNASeq object.

Add umi_type: dragen. This enables bcbio to run with first-pass, pre-consensus called UMI BAM files from DRAGEN.

Turn off inferential replicate loading when creating the gene x sample RNA-seq count matrix. This allows loading of thousands of RNA-seq samples.

Only make isoform to gene file from express if we have run express.

Allow "no consensus peaks found" as a valid endpoint of a ChIP-seq analysis.

Allow BCBIO_TEST_DIR environment variable to control where tests end up.

Collect OxoG and other sequencing artifacts due to damage.

Round tximport estimated counts.

Turn off consensus peak calling for broad peaks. Thanks to @lbeltrame and @LMannarino for diagnosing the broad-peaks-run-forever bug.

Source code(tar.gz)
Source code(zip)
v1.2.4(Sep 21, 2020)
1.2.4 (21 September 2020)

Remove deprecated --genomicsdb-use-vcf-codec option as this is now the default.

Add bismark output to MultiQC.

Fix PS genotype field from octopus to have the correct type.

Edit VarDict headers to report VCFv4.2, since htsjdk does not fully support VCFv4.3 yet.

Attempt to speed up bismark by implementing the parallelization strategy suggested here: https://github.com/FelixKrueger/Bismark/issues/96

Add --enumerate option to OptiType to report the top 10 calls and scores, to make it easier to decide how confident we are in a HLA call.

Performance improvements when HLA calling during panel sequencing. This skips running bwa-kit during the initial mapping for consensus UMI detection, greatly speeding up panel sequencing runs.

Allow custom options to be passed to featureCounts.

Fix race condition when running tests.

Add TOPMed as a datatarget.

Add predicted transcript and peptide output to arriba.

Add mm10 as a supported genome for arriba.

Skip bcbioRNASeq for more than 100 samples.

Add rRNA_pseudogene as a rRNA biotype.

Add --genomicsdb-use-vcf-codec when running GenotypeGVCF. See https://gatk.broadinstitute.org/hc/en-us/articles/360040509751- GenotypeGVCFs#--genomicsdb-use-vcf-codec for a discussion. Thanks to @amizeranschi for finding the issue and posting the solution.

update VEP to v100

Add consensus peak calling using https://bedops.readthedocs.io/en/latest/content/usage-examples/master-list.html to collapse overlapping peaks.

Pre-filter consensus peaks by removing peaks with FDR > 0.05 before performing consensus peak calling.

Add support for Qiagen's Qiaseq UPX 3' transcriptome kit for DGE. Support for 96 and 384 well configurations by specifying umi_type: qiagen-upx-96 or umi_type: qiagen-upx-384.

Add consensus peak counting using featureCounts.

Skip using autosomal-reference when calling ataqv for mouse/human, as this has a problem with ataqv (see https://github.com/ParkerLab/ataqv/issues/10) for discussion and followup.

Add pre-generated ataqv HTML report to upload directory.

Support single-end reads for ATAC-seq.

Move featureCount output files to featureCounts directory in project directory.

Remove RNA and reads in peak stats from MultiQC table when they are not calculated for a pipeline.

Only show somatic variant counts in the general stats table, if germline variants are calculated.

Add kit parameter for setting options for pipelines via just listing the kit. Currently only implemented for WGBS.

Source code(tar.gz)
Source code(zip)
v1.2.3(Apr 7, 2020)
Hotfix for not being able to upgrade from stable distribution.

Source code(tar.gz)
Source code(zip)
v1.2.2(Apr 5, 2020)
Fix for not properly looking up R environment variables in the base environment.

Remove --use-new-qual-calculator which was eliminated in GATK 4.1.5.0.

Ensure header is not written for a Series. In pandas 0.24.0 the default for header was changed from False to True so we have to set it explictly now.

Remove unused Dockerfile. Thanks to @matthdsm.

ATAC-seq: Skip peak-calling on fractions with < 1000 reads.

Source code(tar.gz)
Source code(zip)
v1.2.1(Mar 25, 2020)
Update ChIP and ATAC bowtie2 runs to use --very-sensitive.

Properly pad TSS BED file for ataqv TSS enrichment metrics.

Skip bcbioRNASeq if there are less than three samples.

Run joint-calling with single cores to save resources.

Re-support PureCN.

Skip segments with no informative SNPs when creating the LOH VCF file from PureCN output.

Fix for duplicated output for mosdepth in quality control report.

Fix for missing rRNA statistics.

Source code(tar.gz)
Source code(zip)
v1.2.0(Feb 7, 2020)
Fix for bismark not being a supported aligner.

Run ataqv (https://github.com/ParkerLab/ataqv) to calculate additional ATAQ-seq quality control metrics.

Workaround for some bcbioRNASeq plots failing with many samples when interesting_groups is not set.

Add known_fusions parameter for passing in known fusions to arriba.

Fix for tx2gene not working properly on some GTF files.

Sort MACS2 output with UNIX sort to avoid memory issues.

Run RiP on full peak file for ATAC-seq.

Run ataqv on unfiltered BAM file with the full peak file.

Run peddy on the population variant file, not the individual sample level file if joint calling was done.

Add STAR to MultiQC metrics.

Throw an error if STAR is run on a genome with alts.

Don't run bcbioRNASeq if there is only one sample. Thanks to @kmendler for the suggestion.

Improve arriba sensitivity by setting --peOverlapNbasesMin 10 and --alignSplicedMateMapLminOverLmate 0.5 when running STAR (see https://github.com/suhrig/arriba/issues/41).

Make TPM and counts files from tximport automatically.

Use --keepDuplicates when making the Salmon index. This keeps transcripts that are identical in the index instead of randomly choosing one. This helps when comparing to other ways of quantifying the transcripts, ensuring all of the transcripts are represented.

Remove unnecessary "quant" subdirectory for Salmon runs. This allows MultiQC to properly name the samples.

Ensure STAR log file is propagated to the upload directory.

Fix issue with memory not being specified properly when running bcbio_prepare_samples.py.

Run tximport automatically and store TPM in project/date/tpm and counts in project/date/counts.

Calculate ENCODE quality flags for ATAC-seq. See https://www.encodeproject.org/data-standards/terms/#library for a description of what the metrics mean.

Fix for command line being too long while joint genotyping thousands of samples.

Fix for command line being too long when running the CWL workflow with cromwell.

Source code(tar.gz)
Source code(zip)
v1.1.9(Dec 6, 2019)
Fix for get VEP cache.

Support Picard's new syntax for ReorderSam (REFERENCE -> SEQUENCE_DICTIONARY).

Remove mitochondrial reads from ChIP/ATAC-seq calling.

Add documentation describing ATAC-seq outputs.

Add ENCODE library complexity metrics for ATAC/ChIP-seq to MultiQC report (see https://www.encodeproject.org/data-standards/terms/#library for a description of the metrics)

Add STAR sample-specific 2-pass. This helps assign a moderate number of reads per genes. Thanks to @naumenko-sa for the intial implementation and push to get this going.

Index transcriptomes only once for pseudo/quasi aligner tools. This fixes race conditions that can happen.

Add --buildversion option, for tracking which version of a gene build was used. This is used during bcbio_setup_genome.py. Suggested formats are source_version, so Ensembl_94, EnsemblMetazoa_25, FlyBase_26, etc.

Sort MACS2 bedgraph files before compressing. Thanks to @LMannarino for the suggestion.

Check for the reserved field sample in RNA-seq metadata and quit with a useful error message. Thanks to @marypiper for suggesting this.

Split ATAC-seq BAM files into nucleosome-free and mono/di/tri nucleosome files, so we can call peaks on them separately.

Call peaks on NF/MN/DN/TN regions separately for each caller during ATAC-seq.

Allow viral contamination to be assasyed on non tumor/normal samples.

Ensure EBV coverage is calculated when run on genomes with it included as a contig.

Source code(tar.gz)
Source code(zip)
v1.1.8(Oct 29, 2019)
Add antibody configuration option. Setting a specific antibody for ChIP-seq will use appropriate settings for that antibody. See the documentation for supported antibodies.

Add use_lowfreq_filter for forcing vardict to report variants with low allelic frequency, useful for calling somatic variants in panels with high coverage.

Fix for checking for pre-existing inputs with python3.

Add keep_duplicates option for ChIP/ATAC-seq which does not remove duplicates before peak calling. Defaults to False.

Add keep_multimappers for ChIP/ATAC-seq which does not remove multimappers before peak calling. Defaults to False.

Remove ethnicity as a required column in PED files.

Source code(tar.gz)
Source code(zip)
v1.1.7(Oct 11, 2019)
1.1.7

hot fix for dataclasses not being supported in 3.6. Use namedtuple instead.

Source code(tar.gz)
Source code(zip)
v1.1.6(Oct 10, 2019)
GATK ApplyBQSRSpark: avoid StreamClosed issue with GATK 4.1+

RNA-seq: fixes for cufflinks preparation due to python3 transition.

RNA-seq: output count tables from tximport for genes and transcripts. These are in bcbioRNASeq/results/date/genes/counts and bcbioRNASeq/results/data/transcripts/counts.

qualimap (RNA-seq): disable stranded mode for qualimap, as it gives incorrect results with the hisat2 aligner and for RNA-seq just setting it to unstranded

Add quantify_genome_alignments option to use genome alignments to quantify with Salmon.

Add --validateMappings flag to Salmon read quantification mode.

VEP cache is not installing anymore from bcbio run

Add support for Salmon SA method when STAR alignments are not available (for hg38).

Add support for the new read model for filtering in Mutect2. This is experimental, and a little flaky, so it can optionally be turned on via: tools_on: mutect2_readmodel. Thanks to @lbeltrame for implementing this feature and doing a ton of work debugging.

Swap pandas from_csv call to read_csv.

Make STAR respect the transcriptome_gtf option.

Prefix regular expression with r. Thanks to @smoe for finding all of these.

Add informative logging messages at beginning of bcbio run. Includes the version and the configuration files being used.

Swap samtools mpileup to use bcftools mpileup as samtools mpileup is being deprecated (https://github.com/samtools/samtools/releases/tag/1.9).

Ensure locale is set to one supporting UTF-8 bcbio-wide. This may need to get reverted if it introduces issues.

Added hg38 support for STAR. We did this by taking hg38 and removing the alts, decoys and HLA sequences.

Added support for the arriba fusion caller.

Added back missing programs from the version provenance file. Fixed formatting problems introduced by switch to python3.

Added initial support for whole genome bisulfite sequencing using bismark. Thanks to @hackdna for implementing this and @jnhutchinson for drafting the initial pipeline. This is a work in progress in collaboration with @gcampanella, who has a similar implementation with some extra features that we will be merging in soon.

qualimap for RNA-seq runs on the downsampled BAM files by default. Set tools_on: [qualimap_full] to run on the full BAM files.

Add STAR junction files to the files captured at the end of a run.

Source code(tar.gz)
Source code(zip)

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

Related tags

Overview

Features

Quick start

Documentation

Contributors

License

Comments

load all modules needed for the current run

Activate the environment

Save results and final clean up

Releases(v1.2.9)

v1.2.9(Dec 15, 2021)

v1.2.8(Apr 14, 2021)

v1.2.7(Feb 23, 2021)

v1.2.6(Feb 5, 2021)

v1.2.5(Jan 9, 2021)

1.2.5 (01 January 2021)

v1.2.4(Sep 21, 2020)

1.2.4 (21 September 2020)

v1.2.3(Apr 7, 2020)

v1.2.2(Apr 5, 2020)

v1.2.1(Mar 25, 2020)

v1.2.0(Feb 7, 2020)

v1.1.9(Dec 6, 2019)

v1.1.8(Oct 29, 2019)

v1.1.7(Oct 11, 2019)

1.1.7

v1.1.6(Oct 10, 2019)

Owner

Blue Collar Bioinformatics

Book on Julia for Data Science

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

Veusz scientific plotting application

Open Delmic Microscope Software

A simple computer program made with Python on the brachistochrone curve.

Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.

Read-only mirror of https://gitlab.gnome.org/GNOME/pybliographer

Metaflow is a human-friendly Python/R library that helps scientists and engineers build and manage real-life data science projects

CONCEPT (COsmological N-body CodE in PyThon) is a free and open-source simulation code for cosmological structure formation

3D medical imaging reconstruction software

SCICO is a Python package for solving the inverse problems that arise in scientific imaging applications.

Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica

A mathematica expression evaluator with PokemonTypes

AnuGA for the simulation of the shallow water equation

🍊 :bar_chart: :bulb: Orange: Interactive data analysis

artisan: visual scope for coffee roasters

Doing bayesian data analysis - Python/PyMC3 versions of the programs described in Doing bayesian data analysis by John K. Kruschke

PsychoPy is an open-source package for creating experiments in behavioral science.

collection of interesting Computer Science resources