A tool for batch processing large fasta files and accompanying metadata table to upload to repositories via API

Overview

Fasta Uploader

A tool for batch processing large fasta files and accompanying metadata table to repositories via API

The python fasta_uploader.py script breaks large fasta files (e.g. 500mb) and related (one-to-one) tab-delimited sample contextual data into smaller batches of 1000 or some specified # of records which can then be uploaded to a given sequence repository if an API endpoint is selected. Currently there is one option for the API interface: VirusSeq.

This tool is developed by the SFU Centre for Infectious Disease Epidemiology and One Health in conjunction with VirusSeq and it works well with DataHarmonizer!

Authors: Damion Dooley, Nithu Sara John

Details

Given a fasta file and a sample metadata file with a column that matches to fasta file record identifiers, break both into respective sets of smaller batches of records which are submitted to an API for processing.

Processing is three step:

  1. Construct batches of files. Since only two files are read and parsed in one go, processing of them is reliable after that point, so no further error reporting is needed during the batch file generation process.

    1. Importantly, if rerunning fasta_batch_submit.py, this step will be skipped unless -f --force parameter is run. Currently input files are still required in this case.
  2. IF API option is included, submit each *.queued.fasta batch to API, wait for it to finish or error out (capture error report) and proceed to next batch.

    1. Some types of error trigger sudden death, i.e. sys.exit() because they would also occur in subsequent API batch calls. For example missing tabular data column names will trigger an exit. Once resolved, rerun with -r to force regeneration of output files.
    2. There is an option to just try submitting one of the batches, e.g. the first one, via "-n 0" parameter. This allows error debugging of just the first batch. Once error patterns are determined, those that apply to remaining source contextual data can be applied, and first batch removed from source fasta and contextual data files, and the whole batching can be redone using -r reset, or by manually deleting the output files and rerunning.
  3. The processing status of existing API requests is reported from the API server end. Some may be queued by the API server, others may have been processed successfully, and others may have line-by-line errors in field content that are converted by fasta_uploader.py into new [output file batch.#].queued.fasta and [output batch.#].queued.tsv files which can be edited and then submitted back to the API by rerunning the program with the same command line parameters.

Requires Biopython and Requests modules

  • "pip install biopython"
  • "pip install requests"

Usage

Run the command in a folder with the appropriate input files, and output files can be generated there too. Rerun it in the same folder to incrementally fix any submission errors and then restart submission.

python fasta_uploader.py [options]

Options:

-h, --help
  Show this help message and exit.
-f FASTA_FILE, --fasta=FASTA_FILE
  Provide a fasta file name.
-m METADATA_FILE, --metadata=METADATA_FILE
  Provide a COMMA .csv or TAB .tsv delimited sample contextual data file name.
-b BATCH, --batch=BATCH
  Provide number of fasta records to include in each batch. Default is 1000.
-o OUTPUT_FILE, --output=OUTPUT_FILE
  Provide an output file name/path.
-k KEY_FIELD, --key=KEY_FIELD
  Provide the metadata field name to match to fasta record identifier.
-n BATCH_NUMBER, --number=BATCH_NUMBER
  Process only given batch number to API instead of all batches.

Parameters involved in optional API call:

-a API, --api=API     
  Provide the target API to send data too.  A batch submission job will be initiated for it. Default is "VirusSeq_Portal".
-u API_TOKEN, --user=API_TOKEN
  An API user token is required for API access.
-d, --dev
  Test against a development server rather than live one.  Provide an API endpoint URL.
 -s, --short
  Report up to given # of fasta record related errors for each batch submission.  Useful for taking care of repeated errors first based on first instance.
 -r, --reset
  Regenerate all batch files and begin API resubmission process even if batch files already exist under given output file pattern.

For example:

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

This will convert consensus_final.fasta and related final set 1.csv contextual data records into batches of 1000 records by default, and will begin submitting each batch to the VirusSeq portal.

python fasta_uploader.py -f "consensus_final.fasta" -m "final set 1.csv" -n 0 -k "fasta header name" -a VirusSeq_Portal -u ENTER_API_KEY_HERE

Like the above but only first batch is submitted so that one can see any errors, and if they apply to all batches, can fix them in original "final set 1.csv" file. Once batch 0 is fixed, all its records can be removed from the consensus_final.fasta and final set 1.csv.csv files, and the whole job can be resubmitted.

Owner
Centre for Infectious Disease and One Health
Hsiao Laboratory at Simon Fraser University
Centre for Infectious Disease and One Health
Python codes for the server and client end that facilitates file transfers. (Using AWS EC2 instance as the server)

Server-and-Client-File-Transfer Python codes for the server and client end that facilitates file transfers. I will be using an AWS EC2 instance as the

Amal Farhad Shaji 2 Oct 13, 2021
A python script to pull the transactions of an Algorand wallet and put them into a CSV file.

AlgoCSV A python script to pull the transactions of an Algorand wallet and put them into a CSV file. Dependancies: Requests Main features: Groups: Com

21 Jun 25, 2022
This project is a set of programs that I use to create a README.md file.

🤖 codex-readme 📜 codex-readme What is it? This project is a set of programs that I use to create a README.md file. How does it work? It reads progra

Tom Dörr 224 Jan 07, 2023
Import Python modules from any file system path

pathimp Import Python modules from any file system path. Installation pip3 install pathimp Usage import pathimp

Danijar Hafner 2 Nov 29, 2021
FUSE filesystem Python scripts for Nintendo console files

ninfs (formerly fuse-3ds) is a FUSE program to extract data from Nintendo game consoles. It works by presenting a virtual filesystem with the contents of your games, NAND, or SD card contents, and yo

Ian Burgwin 343 Jan 02, 2023
Simple Python File Manager

This script lets you automatically relocate files based on their extensions. Very useful from the downloads folder !

Aimé Risson 22 Dec 27, 2022
gitfs is a FUSE file system that fully integrates with git - Version controlled file system

gitfs is a FUSE file system that fully integrates with git. You can mount a remote repository's branch locally, and any subsequent changes made to the files will be automatically committed to the rem

Presslabs 2.3k Jan 08, 2023
shred - A cross-platform library for securely deleting files beyond recovery.

shred Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https:

4 Sep 04, 2021
Yadl - it is a simple library for working with both dotenv files and environment variables.

Yadl Yadl - it is a simple library for working with both dotenv files and environment variables. Features Validation of whitespaces. Validation of num

Ivan Kapranov 3 Oct 19, 2021
FileGenerator - File Generator for sites that accepts documents

File Generator for sites that accepts documents This code generates files as per

Shaunak 2 Mar 19, 2022
Automatically generates a TypeQL script for doing entity and relationship insertions from a .csv file, so you don't have to mess with writing TypeQL.

Automatically generates a TypeQL script for doing entity and relationship insertions from a .csv file, so you don't have to mess with writing TypeQL.

3 Feb 09, 2022
Simple addon to create folder structures in blender.

BlenderCreateFolderStructure Simple Add-on to create a folder structure in Blender. Installation Download BlenderCreateFolderStructure.py Open Blender

Dominik Strasser 2 Feb 21, 2022
A Python library that provides basic functions to read / write Aseprite format files

A Python library that provides basic functions to read / write Aseprite format files

Joe Trewin 1 Jan 13, 2022
This program can help you to move and rename many files at once

This program can help you to rename and save many files in a folder in seconds, but don't give the same name to files, it can delete both files.

João Assalim 1 Oct 10, 2022
Extract an archive file (zip file or tar file) stored on AWS S3

S3 Extract Extract an archive file (zip file or tar file) stored on AWS S3. Details Downloads archive from S3 into memory, then extract and re-upload

Evan 1 Dec 14, 2021
Quick and dirty FAT12 filesystem to ZIP file converter

Quick and Dirty FAT12 Filesystem Converter This is a really crappy Python script I wrote to convert a semi-compatible FAT12 filesystem from my HP150's

Tube Time 2 Feb 12, 2022
Python module that parse power builder file (PBD) and analyze code

PowerBuilder-decompile Python module that parse power builder file (PBD) and analyze code (Incomplete) this tool is composed of: pbd_dump.py pbd file

Samy Sultan 8 Dec 15, 2022
Python's Filesystem abstraction layer

PyFilesystem2 Python's Filesystem abstraction layer. Documentation Wiki API Documentation GitHub Repository Blog Introduction Think of PyFilesystem's

pyFilesystem 1.8k Jan 02, 2023
A python wrapper for libmagic

python-magic python-magic is a Python interface to the libmagic file type identification library. libmagic identifies file types by checking their hea

Adam Hupp 2.3k Dec 29, 2022
Small-File-Explorer - I coded a small file explorer with several options

Petit explorateur de fichier / Small file explorer Pour la première option (création de répertoire) / For the first option (creation of a directory) e

Xerox 1 Jan 03, 2022