SeqLike - flexible biological sequence objects in Python

Last update: Dec 23, 2022

Overview

SeqLike - flexible biological sequence objects in Python

Introduction

A single object API that makes working with biological sequences in Python more ergonomic. It'll handle anything like a sequence.

Built around the Biopython SeqRecord class, SeqLikes abstract over the semantics of molecular biology (DNA -> RNA -> AA) and data structures (strings, Seqs, SeqRecords, numerical encodings) to allow manipulation of a biological sequence at the level which is most computationally convenient.

Code samples and examples

Build data-type agnostic functions

def f(seq: SeqLikeType, *args):
	seq = SeqLike(seq, seq_type="nt").to_seqrecord()
	# ...

Streamline conversion to/from ML friendly representations

prediction = model(aaSeqLike('MSKGEELFTG').to_onehot())
new_seq = ntSeqLike(generative_model.sample(), alphabet="-ACGTUN")

Interconvert between AA and NT forms of a sequence

Back-translation is conveniently built-in!

aa is well defined s_nt.aa()[0:3].nt() # ATGTCTAAA, works because SeqLike now has both reps s_nt[:-1].aa() # TypeError, len(s_nt) not a multiple of 3 s_aa = aaSeqLike("MSKGE") s_aa.nt() # AttributeError, aa->nt is undefined w/o codon map s_aa = aaSeqLike(s_aa, codon_map=random_codon_map) s_aa.nt() # now works, backtranslated to e.g. ATGTCTAAAGGTGAA s_aa[:1].nt() # ATG, codon_map is maintained">

s_nt = ntSeqLike("ATGTCTAAAGGTGAA")
s_nt[0:3] # ATG
s_nt.aa()[0:3] # MSK, nt->aa is well defined
s_nt.aa()[0:3].nt() # ATGTCTAAA, works because SeqLike now has both reps
s_nt[:-1].aa() # TypeError, len(s_nt) not a multiple of 3

s_aa = aaSeqLike("MSKGE")
s_aa.nt() # AttributeError, aa->nt is undefined w/o codon map
s_aa = aaSeqLike(s_aa, codon_map=random_codon_map)
s_aa.nt() # now works, backtranslated to e.g. ATGTCTAAAGGTGAA
s_aa[:1].nt() # ATG, codon_map is maintained

Easily plot multiple sequence alignments

seqs = [s for s in SeqIO.parse("file.fasta", "fasta")]
df = pd.DataFrame(
    {
        "names": [s.name for s in seqs],
        "seqs": [aaSeqLike(s) for s in seqs],
    }
)
df["aligned"] = df["seqs"].seq.align()
df["aligned"].seq.plot()

Flexibly build and parse numerical sequence representations

# Assume you have a dataframe with a column of 10 SeqLikes of length 90
df["seqs"].seq.to_onehot().shape # (10, 90, 23), padded if needed

To see more in action, please check out the docs!

Getting Started

pip install seqlike

Authors

Support

Questions about usage should be posed on Stack Overflow with the #seqlike tag.
Bug reports and feature requests are managed using the Github issue tracker.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Nasos Dousis}

_{andrew giessel}

_{Max Wall}

_{Eric Ma}

_{Mihir Metkar}

_{Marcus Caron}

This project follows the all-contributors specification. Contributions of any kind welcome!

Comments

Mutation class #57
This PR is ready for review.

In this PR, we add in Mutation and MutationSet classes. The intent is to represent mutations made to a SeqLike. I've taken inspiration from multiple places, but the biggest one has been the discussion on #57.

The use cases for this Mutation class are primarily two-fold:

Adding them to a SeqLike returns another mutated SeqLike.

Subtracting (diffing) a SeqLike from another yields a MutationSet.

While working through use case 2, I noticed how it's a bit tricky:

adding a Mutation or MutationSet to a SeqLike results in a new SeqLike,

but subtracting the new SeqLike from the original SeqLike might not necessarily result in the original MutationSet,

yet adding the new MutationSet to the original SeqLike will give back a SeqLike identical to the new SeqLike.

An example of this phenomena is shown in a new notebook, docs/notebooks/mutations.ipynb. For reviewing purposes, that is probably the go-to notebook for understanding what Mutation and MutationSet classes can do; the rest are implementation details.

Would love to get feedback on this PR -- especially if there are semantics that I haven't yet thought of.

TODO list of what's left:

[x] FIX TEST: SeqLike classes have two more class methods. Therefore, 77 is correct. Reference here.

[x] Add .positions class method, which returns a list of positions to mutate.

[x] Switch out magical_parse() for __new__() under Mutation.
opened by ericmjl 6
[BUG] `seq_type` handling during instantiation

I was playing around with the library (it was really nice and great job!) and hit this bug: instantiation like SeqLike('ATCGATC') or SeqLike('ATCGATC', None) will fail. From the doc: 1) seq_like is supposedly optional but is required in the implementation (code) and 2) logic regarding seq_like==None only occurs when sequence is a SeqLike (code) as opposed to a native type and thus seems not very useful? (There's a def determine__type() (code) that I think is intended for the job but it's not used.)

In addition, some general comments regarding dispatch, feel free to ignore them if they are out of context: def _construct_seqlike share the same signature for alphabet and codon_map, and def determine__type_and_alphabet share same signature for seq_type and sequence. Thus it seems clearer to extract the only signature that varies? E.g. for the latter case we only dispatch functions based on alphabet (or even directly use if alphabet==None: alphabet=determine_alphabet(_type, sequence) rather than dispatching). Current setup for these two functions seems to create duplicate logics and is error prone (types for sequence here should be the same?)

Again, happy to file a PR for anything specific above.

opened by pagpires 5
noqa in documentation

Hi team, thanks for creating this tool, it looks really nice!

Just a question regarding #noqa: DAR201: it's generated in the doc in quite a few places (e.g. https://modernatx.github.io/seqlike/reference/seqlike/#seqlike.SeqLike.SeqLike.deepcopy--noqa-dar101). I've tested that it can successfully be ommited via wrapping with  (ref). Curious if the team likes the fix? Happy to submit a MR.

opened by pagpires 5
Adding Damien Farrell as a contributor

As mentioned by @ndousis in an earlier email thread, some of @dmnfarrell's code made it into this library. We would like to acknowledge @dmnfarrell as a contributor in the codebase.

@dmnfarrell is this something you would be amenable to? I would essentially ask the all-contributors bot to add your contribution in.

opened by ericmjl 4
Prepending a string to a SeqLike
Prepending a string to SeqLike results unexpectedly in an appended version:

In [1]: from seqlike import SeqLike In [2]: "ACTG" + SeqLike("TTTT", "nt", id="test") Out[2]: *** NT: SeqRecord(seq=Seq('TTTTACTG'), id='test', name='<unknown name>', description='<unknown description>', dbxrefs=[])

This should either (1) result in an error ("cannot prepend a string to a SeqLike"), or (2) yield a new SeqLike with the metadata of the parent SeqLike and the correctly-ordered sequence string ("ACTGTTTT"). If the latter, should this new SeqLike be renumbered?
opened by ndousis 3

Feature/create sequence like

As discussed in #49. The naming here is because we inherit from the collections abstract base class of Sequence. Seqs are coupled to Biopython, Sequences are more generic.

Needs new tests and some documentation, just want to pause here for general feedback and workflow testing.

All current tests pass.

edit for usage:

In [1]: from seqlike.SequenceLike import SequenceLike

In [2]: s = SequenceLike(["abc", "abc", "qwe", "asd"])

In [3]: s
Out[3]: ['abc', 'abc', 'qwe', 'asd']

In [4]: s.to_index()
Out[4]: array([0., 0., 2., 1.])

In [5]: s.to_onehot()
Out[5]: 
array([[1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]])

In [6]: s.alphabet
Out[6]: ['abc', 'asd', 'qwe']

In [9]: s2 = SequenceLike([0, 0, 2, 1], alphabet=["abc", "asd", "qwe"], encoding='index')

In [10]: s2
Out[10]: array(['abc', 'abc', 'qwe', 'asd'], dtype=object)

opened by andrewgiessel 3

Dispatch based on SeqLikeType.__args__

Also a quick note: when dispatching for SeqLike, I guess we can use SeqLikeType.__args__ as the type instead of hardcoding a list of potential types, otherwise it's hard to keep them consistent (since I saw there's a plan to include torch.tensor)

Originally posted by @pagpires in https://github.com/modernatx/seqlike/issues/41#issuecomment-1000942209
good first issue help wanted

opened by ericmjl 3
Simplify .aa() interface, return original SeqRecord attributes

Removed **kwargs from SeqLike aa, and hard-coded arguments in the call to self.translate so as to return original attributes like id and name. This is the behavior I would expect from aa().

I left the interface to SeqLike translate as is to maintain flexibility.

Closes #25

opened by ndousis 3
Add missing setup.py deps, add notebook extras, and move test deps to extras
I had a couple import errors (see below) when testing this package out that are related to undeclared dependencies in setup.py. Some were in requirements.txt, but not declared in setup.py's install_requires, so I moved them all to setup.py and removed requirements.txt (so this doesn't happen again :crossed_fingers:).

I also added 2 extras:

seqlike[test]: this has the pytest* deps (which I removed from the main install_requires)

seqlike[notebook]: adds bokeh, which is (currently) required when using ipython/jupyter (I think there were some refactors to make bokeh a lazy import, but it is still imported eagerly/required in notebooks)

Here are the errors I saw:

.../lib/python3.7/site-packages/seqlike/SeqLike.py in <module> 9 from typing import Callable, Optional, Union 10 ---> 11 import lazy_loader as lazy 12 13 from Bio.Seq import Seq ModuleNotFoundError: No module named 'lazy_loader'

.../lib/python3.7/site-packages/seqlike/draw_utils.py in <module> 19 try: 20 get_ipython ---> 21 from bokeh.io import output_notebook 22 23 output_notebook() ModuleNotFoundError: No module named 'bokeh'
opened by JacobHayes 2
add commandline wrapper function for Muscle 3.8 #24
The wrapper function muscle_alignment permits alignment using Muscle 3.8:

from seqlike.alignment_commands import muscle_alignment sequences.seq.align(aligner=muscle_alignment, muscle_arg1=something, muscle_arg2=something)

and addresses #24. All tests pass except tests/test_assets.py::test_free_mono_font_exists. Two notes:

the latest version of Muscle is v5.1 and has a different interface than v3.8; MuscleCommandline is compatible with v3.8.

the preserve_order parameter (preserves original sequence order, as aligner may try to group sequences by similarity) may still be buggy.
opened by ndousis 2
Support arbitrary alphabets
It'd be nice to support arbitrary alphabets for sequences that are not necessarily string-type. For e.g. we may want to do sequence of codons, or sequence of other entities.

Doing so would allow us to access the to_onehot() or to_index() capabilities of SeqLike objects without necessarily being bound to BioPython SeqRecord/Seq objects.

Potential challenges:

We would break the "default to SeqRecords pair" that we assume in SeqLike. A list of codons is neither!

We may need to rearchitect the SeqLike object such that there is a .sequence and .alphabet, which the encoder functions expect (?). .to_*() functions.

We may need a more generic SeqLike object from which our current SeqLikes inherit.

A good concrete first step here is to create an Abstract Base Class for discussion purposes.
enhancement high-priority
opened by ericmjl 2
Objects to represent mutations
I've encountered the situation where we need to represent mutations of a sequence. Having written essentially the same code over and over, I thought it might be good to talk about some of these ideas here.

I was thinking of something along the lines of two classes: a Mutation and a MutationSet. Defining them as classes allows for certain semantics:

s = SeqLike('MKAIL') mut = Mutation('A', 2, 'C') # mut's repr would look like A3C mut2 = Mutation('K', 1, 'R')

We may consider 'addition' to be an application of mutations to a reference sequence:

s2 = s + mut # s2 <-- SeqLike('MKCIL') s2 = s + mut2 # s2 <-- SeqLike('MRAIL')

Mutations can also be offset by position:

mut3 = mut - 1 # mut3 <-- A1C, but would raise an error if wt sequence does not have A at index 1.

If we need to hold multiple mutations together, we might use a MutationSet:

mutations = MutationSet([mut, mut2]) # mutations <-- [K1R, A2C] (automatically sorted by position, then by letter) s2 = s + mutations # s2 <-- SeqLike('MRCIL')

MutationSets could also be offset by position:

mutations + 1 # would give us [K2R, A3C]

We could also consider subtraction of two SeqLikes to give us the 'diff' as a MutationSet:

mutset = s2 - s1 # (left is canonically considered the 'wt') # mutset <-- [R1K, C2A] mutset = s1 - s2 # mutset <-- [K1R, A2C]

I'm not sure what other semantics I might have missed here. Any thoughts?

UPDATE: I changed the position numbers above to reflect Python indexing rules, not canonical indexing. I am sure we could magically handle both, but IMO because SeqLike uses Python indexing rules, Mutations and MutationSets should also use Python indexing rules for positions.
opened by ericmjl 4
Adjusting alignment parameters?
I have a need to adjust alignment parameters; for example, I have encountered something akin to this issue, and the proposed solution from the author of MAFFT is to adjust one of the MAFFT parameters.

Adjusting alignment parameters via the .seq.align() API might be helpful. A few designs for the user-facing API that I can think of include:

# default aligner is MAFFT, so we can pass through the command line options via kwargs. sequences.seq.align(ep=1.59, op=0.0)

# want to use MUSCLE instead of MAFFT from seqlike.AlignCommandLine import MuscleCommandLine as muscle sequences.seq.align(aligner=muscle, muscle_arg1=something, muscle_arg2=something)
enhancement low-priority
opened by ericmjl 1
Add all-contributors bot

We should ensure that all contributions are recognized. Following the all contributors spec, we should use the bot to help recognize contributors of all kinds to the project.

https://allcontributors.org/

opened by ericmjl 26

Releases(v1.3.4)

v1.3.4(Dec 23, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.3.3(Nov 4, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.3.2(Aug 26, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.3.1(Aug 11, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.3.0(May 17, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.2.0(Apr 12, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.9(Apr 12, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.8(Apr 8, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.7(Feb 2, 2022)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.6(Nov 19, 2021)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.5(Nov 15, 2021)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.4(Nov 13, 2021)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)
v1.1.3(Nov 3, 2021)

Contribution details can be found in CHANGELOG.md
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository https://modernatx.github.io/seqlike

A modular single-molecule analysis interface

MOSAIC: A modular single-molecule analysis interface MOSAIC is a single molecule analysis toolbox that automatically decodes multi-state nanopore data

35 Dec 13, 2022

artisan: visual scope for coffee roasters

Artisan Visual scope for coffee roasters WARNING: pre-release builds may not work. Use at your own risk. Summary Artisan is a software that helps coff

705 Jan 05, 2023

Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica

Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica. It is free both as in "free beer" and as in "freedom".

535 Jan 04, 2023

Read-only mirror of https://gitlab.gnome.org/GNOME/pybliographer

Pybliographer Pybliographer provides a framework for working with bibliographic databases. This software is licensed under the GPLv2. For more informa

15 May 07, 2022

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

CKAN: The Open Source Data Portal Software CKAN is the world’s leading open-source data portal platform. CKAN makes it easy to publish, share and work

3.6k Dec 27, 2022

collection of interesting Computer Science resources

137 Dec 22, 2022

PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural network.

1.6k Jan 04, 2023

OPEM (Open Source PEM Fuel Cell Simulation Tool)

Table of contents What is PEM? Overview Installation Usage Executable Library Telegram Bot Try OPEM in Your Browser! MATLAB Issues & Bug Reports Contr

133 Jan 04, 2023

Discontinuous Galerkin finite element method (DGFEM) for Maxwell Equations

DGFEM Maxwell Equations Discontinuous Galerkin finite element method (DGFEM) for Maxwell Equations. Work in progress. Currently, the 1D Maxwell equati

9 Aug 16, 2022

Open Delmic Microscope Software

Odemis Odemis (Open Delmic Microscope Software) is the open-source microscopy software of Delmic B.V. Odemis is used for controlling microscopes of De

32 Dec 14, 2022

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

1.8k Dec 31, 2022

Efficient Python Tricks and Tools for Data Scientists

Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

944 Dec 28, 2022

CS 506 - Computational Tools for Data Science

CS 506 - Computational Tools for Data Science Code, slides, and notes for Boston University CS506 Fall 2021 The Final Project Repository can be found

14 Mar 23, 2022

A simple computer program made with Python on the brachistochrone curve.

Brachistochrone-curve This is a simple computer program made with Python on the brachistochrone curve. I decided to write it after a physics lesson on

1 Dec 16, 2021

Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"

3.2k Jan 08, 2023

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

ReproZip ReproZip is a tool aimed at simplifying the process of creating reproducible experiments from command-line executions, a frequently-used comm

267 Jan 01, 2023

SeqLike - flexible biological sequence objects in Python

Related tags

Overview

SeqLike - flexible biological sequence objects in Python

Introduction

Code samples and examples

Build data-type agnostic functions

Streamline conversion to/from ML friendly representations

Interconvert between AA and NT forms of a sequence

Easily plot multiple sequence alignments

Flexibly build and parse numerical sequence representations

Getting Started

Authors

Support

Contributors ✨

Comments

Releases(v1.3.4)

v1.3.4(Dec 23, 2022)

v1.3.3(Nov 4, 2022)

v1.3.2(Aug 26, 2022)

v1.3.1(Aug 11, 2022)

v1.3.0(May 17, 2022)

v1.2.0(Apr 12, 2022)

v1.1.9(Apr 12, 2022)

v1.1.8(Apr 8, 2022)

v1.1.7(Feb 2, 2022)

v1.1.6(Nov 19, 2021)

v1.1.5(Nov 15, 2021)

v1.1.4(Nov 13, 2021)

v1.1.3(Nov 3, 2021)

Owner

A modular single-molecule analysis interface

artisan: visual scope for coffee roasters

Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica

Read-only mirror of https://gitlab.gnome.org/GNOME/pybliographer

CKAN is an open-source DMS (data management system) for powering data hubs and data portals. CKAN makes it easy to publish, share and use data. It powers catalog.data.gov, open.canada.ca/data, data.humdata.org among many other sites.

collection of interesting Computer Science resources

PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

OPEM (Open Source PEM Fuel Cell Simulation Tool)

Discontinuous Galerkin finite element method (DGFEM) for Maxwell Equations

Open Delmic Microscope Software

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Efficient Python Tricks and Tools for Data Scientists

CS 506 - Computational Tools for Data Science

A simple computer program made with Python on the brachistochrone curve.

Graphic notes on Gilbert Strang's "Linear Algebra for Everyone"

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

Incubator for useful bioinformatics code, primarily in Python and R

A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.

A flexible package manager that supports multiple versions, configurations, platforms, and compilers.

An open-source application for biological image analysis