orfipy is a tool written in python/cython to extract ORFs in an extremely and fast and flexible manner

Overview

Build Status PyPI - Python Version install with bioconda install with bioconda PyPI Downloads publication

Introduction

orfipy is a tool written in python/cython to extract ORFs in an extremely and fast and flexible manner. Other popular ORF searching tools are OrfM and getorf. Compared to OrfM and getorf, orfipy provides the most options to fine tune ORF searches. orfipy uses multiple CPU cores and is particularly faster for data containing multiple smaller fasta sequences such as de-novo transcriptome assemblies. Please read the paper here.

Please cite as: Urminder Singh, Eve Syrkin Wurtele, orfipy: a fast and flexible tool for extracting ORFs, Bioinformatics, 2021;, btab090, https://doi.org/10.1093/bioinformatics/btab090

Installation

Install latest stable version

pip install orfipy

Or install via conda

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

conda create -n orfipy -c bioconda orfipy

Install the development version from source

git clone https://github.com/urmi-21/orfipy.git
cd orfipy
pip install .

or use pip

pip install git+git://github.com/urmi-21/orfipy.git

Examples

Details of orfipy algorithm are in the paper. Please go through the SI if you are interested to know differences between orfipy and other ORF finder tools and how to set orfipy parameters to match the output of other tools.

Below are some usage examples for orfipy

To see full list of options use the command:

orfipy -h

Input

orfipy version 0.0.3 and above, supports sequences in Fasta/Fastq format (orfipy uses pyfastx). Input files can be in .gz format.

Extract ORF sequences and write ORF sequences in orfs.fa file

orfipy input.fasta --dna orfs.fa --min 10 --max 10000 --procs 4 --table 1 --outdir orfs_out

Use standard codon table but use only ATG as start codon

orfipy input.fa.gz --dna orfs.fa --start ATG

Note: Users can also provide their own translation table, as a .json file, to orfipy using --table option. Example of json file containing a valid translation table is here

See available codon tables

orfipy --show-table

Extract ORFs BED file

orfipy input.fasta --bed orfs.bed --min 50 --procs 4
or
orfipy input.fasta --min 50 --procs 4 > orfs.bed 

Extract ORFs BED12 file

Note: Add --include-stop for orfipy output to be consistent with Transdecoder.Predict output .bed file.

orfipy testseq.fa --min 100 --bed12 of.bed --partial-5 --partial-3 --include-stop

Extract ORFs peptide sequences using default translation table

orfipy input.fasta --pep orfs_peptides.fa --min 50 --procs 4

API

Users can directly import the ORF search algorithm, written in cython, in their python ecosystem.

>>> import orfipy_core 
>>> seq='ATGCATGACTAGCATCAGCATCAGCAT'
>>> for start,stop,strand,description in orfipy_core.orfs(seq,minlen=3,maxlen=1000):
...     print(start,stop,strand,description)
... 
0 9 + ID=Seq_ORF.1;ORF_type=complete;ORF_len=9;ORF_frame=1;Start:ATG;Stop:TAG

orfipy_core.orfs function can take following arguments

  • seq: Required input sequence (str)
  • name ['Seq'] Name (str)
  • minlen [0] min length (int)
  • maxlen [1000000] max length (int)
  • strand ['b'] Strand to use, (b)oth, (f)wd or (r)ev (char)
  • starts [['TTG','CTG','ATG']] Start codons to use (list)
  • stops=['TAA','TAG','TGA'] Stop codons to use (list)
  • include_stop [False] Include stop codon in ORF (bool)
  • partial3 [False] Report ORFs without a stop (bool)
  • partial5 [False] Report ORFs without a start (bool)
  • between_stops [False] Report ORFs defined as between stops (bool)

Comparison with getorf and OrfM

Comparison of orfipy features and performance with getorf and OrfM. Tools were run on different data and ORFs were output to both nucleotide and peptide Fasta files (fasta), only peptide Fasta (peptide) and BED (bed). For details see the publication and SI

  • orfipy is most flexible, particularly faster for data containing multiple smaller fasta sequences such as de-novo transcriptome assemblies or collection of microbial genomes.
  • OrfM is fast (faster for Fastq), uses less memory, but ORF search options are limited
  • getorf is memory efficient but slower, no Fastq support. Provides some flexibility in ORF searches.

Funding

This work is funded in part by the National Science Foundation award IOS 1546858, "Orphan Genes: An Untapped Genetic Reservoir of Novel Traits". This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562 (Bridges HPC environment through allocations TG-MCB190098 and TG-MCB200123 awarded from XSEDE and HPC Consortium).

Comments
  • Compatibility with lower-case fasta sequences (A weird bug)

    Compatibility with lower-case fasta sequences (A weird bug)

    Hello Urminder!

    Something strange happens to me when I try to run orfipy with a particular genome.

    At the end of the program, it does not return the predicted ORFs.

    $ orfipy cdiff.fasta orfipy version 0.0.3 Using translation table: Standard (transl_table=1) start: ['TTG', 'CTG', 'ATG'] stop: ['TAA', 'TAG', 'TGA'] Setting chunk size 714 MB. Procs 45 Logs will be saved to: orfipy_cdiff.fasta_out/orfipy_2021_03_04_13_16_06.031643.log Processing 8597268 bytes Processed 1 sequences in 0.39 seconds

    I tested using prodigal and it works without problems. Is the only genome that I have this problem and I can't understand why.

    Can you reproduce the error? Best regards! Enzo.

    cdiff.fasta.gz

    opened by EnzoAndree 7
  • cannot install from bioconda

    cannot install from bioconda

    I created a new environment to install orfipy, but still have conflicts error.

    Collecting package metadata (current_repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
    Collecting package metadata (repodata.json): done
    Solving environment: failed with initial frozen solve. Retrying with flexible solve.
    Solving environment: |
    Found conflicts! Looking for incompatible packages.
    This can take several minutes.  Press CTRL-C to abort.
    failed
    
    UnsatisfiableError:
    
    opened by lijing28101 5
  • conda installation issue

    conda installation issue

    Hi,

    Small thing, when running orfipy on a gzip fastq file I ran into this:

    Traceback (most recent call last):
      File "/home/ben/e/orfipy-0.0.2/bin/orfipy", line 11, in <module>
        sys.exit(main())
      File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/orfipy/__main__.py", line 338, in main
        orfipy.findorfs.main(infile,
      File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/orfipy/findorfs.py", line 479, in main
        seqs = Fasta(infasta)
      File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/pyfaidx/__init__.py", line 996, in __init__
        self.faidx = Faidx(
      File "/home/ben/e/orfipy-0.0.2/lib/python3.8/site-packages/pyfaidx/__init__.py", line 354, in __init__
        raise ImportError(
    ImportError: BioPython >= 1.73 must be installed to read block gzip files.
    

    I suppose this can be fixed by specifying a biopython version constraint in the conda definition? Thanks.

    opened by wwood 4
  • Overlap ORFs density threshold

    Overlap ORFs density threshold

    Hello, have you developed a way to allow a certain size of overlaps between ORFs in order to maximize the density of longest ORFs along scaffolds?

    Tanks you very much for this soft by the way, very efficient and easy to use.

    All the best

    Ben

    opened by BenjaminGuinet 3
  • Not all ORFs found?

    Not all ORFs found?

    I'm trying to understand how this tool works.

    Here is a sequence of ~1 kb length:

    region_of_interest GACTCGGTGCTATGTTCTGAATATTTCTGACTTGCATTTTTAATGGAGATAAAATGAAGCATTTAATACATGACGTAGATGAAGACATGAATGAAACTACAGACAAACTTAACTCTTCTCTCATTCTTCCTTTCAGTAAGGACTATGAGTTCTGTTCAAATGGCGTTTATTTCTATTGTGGAAAGATGGGTTCAGGTAAGACATTTAATTTAATTCGTCATATACTCATAACAGAACGTTTAGGAAATGACTCATATTATGACCAAATCATTATATCAGCAACTTCAGACTCTATGGACTCAACAGCGAAAACATTTATGTCAAAAGTTCAAGCCTCTGTCGTTAAAGTTCCAGACAGTGAACTCATTGAATTTCTTCAACGTTACATTCGACGTAAGAGGAAATATTATGCCATCGTTGAATTTATACAGTCAGGAATGCAAAAGACTTCTGAGGAGATGGAAAGAATTATTGACAAACACCACTTACGTCAGTACTCAGGAGTTTACGATATGAAACGACTGACAAACTACATTCTATCAAAACTTTCAAAATACCCCTTCAAAAAATATCCTTCAAACACTCTGCTCGTTTGCGACGACTTCGCTGGTAAAGGTTTAGTGTCAAAACCAGACTCACCATTAGCTAATATCATTACTAAAGTCAGACATTACCACTTAACTGTAGCAATACTTATGCAAACATGGAGGTTTTTAGCTTTAAACATAAAACGTCTCATAACTGACTTCGTTATCTTTCAAGGTTTCTCACGTTATGATATTGAACTCATTTGGAAACAGTCAGGTATAACATTACCTTTTGAAGAAATTTGGGAAGCATATAAGTCTCTCATCTCTCCTCGTTCATACCTTGAGATTCATATCATGACTAATACCATTAAAGTCAAAAATATTCCATGGGAACGACCAACATTGTTTTAAAGTTTAACCTTCAATTGACTGA

    In the IGV genome viewer, where ATG = green bar and stop codon = red, the forward sequence appears thus with 3-frame translation:

    igv_snapshot

    I note 10 start sites in frame 3, following the first STOP. Each of these, I thought, would constitute an alternate ORF, all ending in the same downstream STOP

    But the output of orfipy is : // $ orfipy new_seq.fa --min 100 --max 100000 --procs 4 | sort -k2,2n orfipy version 0.0.4 Using translation table: Standard (transl_table=1) start: ['TTG', 'CTG', 'ATG'] stop: ['TAA', 'TAG', 'TGA'] Setting chunk size 12053 MB. Procs 4 Logs will be saved to: orfipy_new_seq.fa_out/orfipy_2021_09_01_15_35_34.526156.log Processed 1 sequences in 0.02 seconds

    region_of_interest 26 938 ID=region_of_interest_ORF.3;ORF_type=complete;ORF_len=912;ORF_frame=3;Start:CTG;Stop:TAA 0 + region_of_interest 90 210 ID=region_of_interest_ORF.1;ORF_type=complete;ORF_len=120;ORF_frame=1;Start:ATG;Stop:TAA 0 + region_of_interest 246 513 ID=region_of_interest_ORF.2;ORF_type=complete;ORF_len=267;ORF_frame=1;Start:ATG;Stop:TGA 0 + region_of_interest 323 431 ID=region_of_interest_ORF.4;ORF_type=complete;ORF_len=108;ORF_frame=-2;Start:CTG;Stop:TGA 0 - region_of_interest 532 667 ID=region_of_interest_ORF.5;ORF_type=complete;ORF_len=135;ORF_frame=-3;Start:CTG;Stop:TAG 0 - // Can you explain why orfipy excluded so many potential ORFs here? And is there an option to force it to report them?

    opened by krabapple 3
  • Python 3.9 support using conda

    Python 3.9 support using conda

    Hello Urminder!

    Amazing work! I didn't think that Cython could reach such speed. I will keep it in mind for my next projects.

    I wanted to report that conda fails to install orfipy when you have Python 3.9 installed. I strongly believe that orfipy should not have problems in Python 3.9.

    Do you plan to enable Python 3.9 support in the conda recipe?

    Best regards! Enzo.

    opened by EnzoAndree 3
  • ImportError: undefined symbol: PySlice_Adjustindices

    ImportError: undefined symbol: PySlice_Adjustindices

    Hello @urmi-21 Thanks for the tool. I installed it successfully. However, it does not work. I get the following output for the command:- orfipy -h

    image

    Can you please suggest a solution?

    opened by VJ-Ulaganathan 1
  • Is there a tool to update gtf/gff file according to orfipy results?

    Is there a tool to update gtf/gff file according to orfipy results?

    Hi,

    Is there a tool to update gtf/gff(generated by stringtie2 or scallop2) file according to orfipy results? Add splice sites, UTRs, CDSs to existing gtf/gff file.

    Best, Kun

    opened by xiekunwhy 1
  • Full length ORF

    Full length ORF

    Hi, I am using the orfipy. It sounds great tool for my recent work. I wonder could it be possible only to get full length orf not the partial orfs? Please let me know if there is any possibility?

    opened by apoosakkannu 1
  • Raises IndexError if no match along the specified strand is found

    Raises IndexError if no match along the specified strand is found

    You can reproduce the problem by running the following code

    import orfipy_core
    seq = '''GTATCGCTGGAGTCGGGTGATCTCCACGGAGACTCGAGTGGTCTCTTCTTGCCGGGAGCCGTCTTCGCCGGGGTTTCCTCTACCAGACCAAAGGGCTCTAGGACCCTCTTTTTGGCCTGGAAAACCGCCTTACCGAGGTTTCCGCCCCAAGACTTATCGTCCTGGAGCTTTTCCTGAAACTCGGAATCGGCGTGGTTGTACTTGAGGTAAGGATTATCCCCCGCCTCAAGTAGCTTGTTGTATTCGAGATCGTGCTCTCGCGCGACCTCGTCCGCCTTATTGACGGGCTGGCCTTTATCAAGGCCGTTGAAGGGTCCGAGGTATTTGTACCCCGGAACTACTAGACCGCGTTGGTCCTGTTTTTGCTGGTTAGCCTTAGGCCGAGGCGCACCTATGGGCGATGCACAACAGGGTTCCGACGGAGTGGGCAATGCCTCGGGAGATTGGCATTGCGATTCCCAGTGGATGGGCGACCGAGTCATCACCAAGTCCACCCGAACCTGGGTGCTGCCCAGCTACAACAACCACATCTACAAGGAAATCAACTCCACCGGCAACGGACTCAACGGCAGCGCCTACTTTGGATACAGTACTCCCTGGGGATATTTCGACTTTAACCGCTTCCACAGCCACTGGAGCCCCCGAGATTGGCAGCGACTCATCAACAACCACTGGGGCTTCAGACCCAAGGCCATGCACGTCAAAATCTTCAACATCCAAGTCAAAGAAGTCACCACCCAGGACCAGACCACCACCGTCGCCTACTTTGGATACAGTACTCCCTGGGGATATTTCGACTTTAACCGCTTCCAC'''
    orfipy_core.orfs(seq, starts=['ATG'])
    orfipy_core.orfs(seq, starts=['ATG'], strand='f')
    

    The following statement orfipy_core.orfs(seq, starts=['ATG']) runs without any errors. However, orfipy_core.orfs(seq, starts=['ATG'], strand='f') throws IndexError. Posting the stacktrace also

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "orfipy/orfipy_core.pyx", line 28, in orfipy_core.orfs
      File "orfipy/orfipy_core.pyx", line 46, in orfipy_core.orfs
    IndexError: list index out of range
    

    Kindly look into this. Thanks in advance :)

    opened by Prakash2403 1
  • Run Multiple Codon Table Numbers

    Run Multiple Codon Table Numbers

    Hello, I'm using orfipy for viral detection. I have a range of codon table numbers I'd like to run. Looks like --table only takes a single integer, and the json file only accepts a single . Please correct me if I'm wrong about this! Would be convenient to add an option to search multiple codon tables at the same time. In its current state, I have to run orfipy multiple times and combine the results. Brett

    opened by brettyout 0
  • Suggestion: deterministic ORF IDs

    Suggestion: deterministic ORF IDs

    Right now ORFs are numbered, which causes problems on subsequent runs with slightly changed parameters such as a different minimum length. Maybe encoding the start and stop position would be a better approach so that the order doesn't affect anything. Thanks!

    enhancement 
    opened by Benjamin-Lee 0
  • Description dictionary

    Description dictionary

    Hello,

    I believe that the description of each ORF would be more accessible as a dictionary, instead of a string that is delimited with ;, :, and =. I am able to convert the string into a desirable dictionary

    {'ID': '1', 'ORF_type': 'complete', 'ORF_len': '912', 'ORF_frame': '1', 'Start': 'TTG', 'Stop': 'TAA'}
    

    through the following code

    import orfipy_core
    
    for start, stop, strand, description in orfipy_core.orfs(mers_sequence.upper()):
        descriptions = {}
        for info in description.split(';'):
            if '=' in info:
                info = info.split('=')
                name, content = info[0], info[1]
                if name == 'ID':
                    content = content.split('.')[1]
            else:
                info = info.split(':')
                name, content = info[0], info[1]
            descriptions[name] = content
    

    however, this functionality would be more conveniently integrated into the basic code of ORFIpy.

    Thank you, Andrew

    enhancement 
    opened by freiburgermsu 0
Releases(v0.0.4)
  • v0.0.4(Jul 20, 2021)

  • v0.0.3(Dec 31, 2020)

    Major changes

    • Switch to pyfastx from pyfaidx
    • Index free strategy to iterate over the inputs
    • Added support for Fastq and gzipped files
    • Better handle large sequences such as whole chromosomes
    • Added basic API for python users
    • Multiple refactors
    • Overall, improved performance
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Nov 7, 2020)

  • v0.0.1(Oct 15, 2020)

Owner
Urminder Singh
PhD candidate at Iowa State University
Urminder Singh
CVE-2021-21985 VMware vCenter Server远程代码执行漏洞 EXP (更新可回显EXP)

CVE-2021-21985 CVE-2021-21985 EXP 本文以及工具仅限技术分享,严禁用于非法用途,否则产生的一切后果自行承担。 0x01 利用Tomcat RMI RCE 1. VPS启动JNDI监听 1099 端口 rmi需要bypass高版本jdk java -jar JNDIIn

r0cky 355 Aug 03, 2022
Log4j exploit catcher, detect Log4Shell exploits and try to get payloads.

log4j_catcher Log4j exploit catcher, detect Log4Shell exploits and try to get payloads. This is a basic python server that listen on a port and logs i

EntropyQueen 17 Dec 20, 2021
domato but as a website

ROFL-FUZZER Ths is Domato, a DOM Fuzzer from Google, but hosted as an website It generates a instance of a newtab on the template given by the user ,

Swapnadeep Som 18 Nov 22, 2021
NIVOS is a hacking tool that allows you to scan deeply , crack wifi, see people on your network

NIVOS is a hacking tool that allows you to scan deeply , crack wifi, see people on your network. It applies to all linux operating systems. And it is improving every day, new packages are added. Than

Error 263 Jan 01, 2023
Spring4Shell - Spring Core RCE - CVE-2022-22965

Spring Core RCE - CVE-2022-22965 After Spring Cloud, on March 29, another heavyweight vulnerability of Spring broke out on the Internet: Spring Core R

Malte Gejr 118 Dec 31, 2022
The self-hostable proxy tunnel

TTUN Server The self-hostable proxy tunnel. Running Running: docker run -e TUNNEL_DOMAIN=Your tunnel domain -e SECURE=True if using SSL ghcr.io/to

Tom van der Lee 2 Jan 11, 2022
Docker Compose based system for running remote browsers (including Flash and Java support) connected to web archives

pywb Remote Browsers This repository provides a simple configuration for deploying any pywb with remote browsers provided by OWT/Shepherd Remote Brows

Webrecorder 10 Jul 28, 2022
😭 WSOB is a python tool created to exploit the new vulnerability on WSO2 assigned as CVE-2022-29464.

😭 WSOB (CVE-2022-29464) 😭 WSOB is a python tool created to exploit the new vulnerability on WSO2 assigned as CVE-2022-29464. CVE-2022-29464 details:

0p 25 Oct 14, 2022
Multi Brute Force Facebook - Crack Facebook With Login - Free For Now

✭ SAKERA CRACK Made With ❤️ By Denventa, Araya, Dapunta Author: - Denventa - Araya Dev - Dapunta Khurayra X ⇨ Fitur Login [✯] Login Cookies ⇨ Ins

Dapunta ID 26 Jan 01, 2023
Exploit and Check Script for CVE 2022-1388

F5-CVE-2022-1388-Exploit Exploit and Check Script for CVE 2022-1388 Usage Check against single host python3 CVE-2022-1388.py -v true -u target_url At

Andy Gill 52 Dec 22, 2022
Fast subdomain scanner, Takes arguments from a Json file ("args.json") and outputs the subdomains.

Fast subdomain scanner, Takes arguments from a Json file ("args.json") and outputs the subdomains. File Structure core/ colors.py db/ wordlist.txt REA

whoami security 4 Jul 02, 2022
This is a simple Port Flooder written in Python 3.

This is a simple Port Flooder written in Python 3. Use this tool to quickly stress test your network devices and measure your router's or server's load.

Júlio Carneiro 4 Feb 20, 2022
Tools for investigating Log4j CVE-2021-44228

Log4jTools Tools for investigating Log4j CVE-2021-44228 FetchPayload.py (Get java payload from ldap path provided in JNDI lookup). Example command: Re

MalwareTech 91 Dec 29, 2022
A Python replicated exploit for Webmin 1.580 /file/show.cgi Remote Code Execution

CVE-2012-2982 John Hammond | September 4th, 2021 Checking searchsploit for Webmin 1.580 I only saw a Metasploit module for the /file/show.cgi Remote C

John Hammond 25 Dec 08, 2022
Tool-X is a kali linux hacking Tool installer.

Tool-X is a kali linux hacking Tool installer. Tool-X developed for termux and other Linux based systems. using Tool-X you can install almost 370+ hacking tools in termux app and other linux based di

Rajkumar Dusad 4.2k May 29, 2022
An automated, reliable scanner for the Log4Shell (CVE-2021-44228) vulnerability.

Log4JHunt An automated, reliable scanner for the Log4Shell CVE-2021-44228 vulnerability. Video demo: Usage Here the help usage: $ python3 log4jhunt.py

RedHunt Labs 39 Nov 21, 2022
Experimental musig2 python code, not for production use!

musig2-py Experimental musig2 python code, not for production use! This is just for testing things out. All public keys are encoded as 32 bytes, assum

Samuel Dobson 14 Jul 08, 2022
LdapRelayScan - Check for LDAP protections regarding the relay of NTLM authentication

LDAP Relay Scan A tool to check Domain Controllers for LDAP server protections r

315 Dec 18, 2022
This is a Crypto asset tracker that I built to aid my personal journey in cryptocurrencies.

Wallet Tracker This is a Crypto asset tracker that I built to aid my personal journey in cryptocurrencies. build docker build -t wallet-tracker . run

2 Mar 21, 2022
GRR Rapid Response: remote live forensics for incident response

GRR Rapid Response is an incident response framework focused on remote live forensics. Build Type Status Tests End-to-end Tests Windows Templates Linu

Google 4.3k Jan 05, 2023