Structural basis for solubility in protein expression systems

Overview

Structural basis for solubility in protein expression systems

Twitter Follow GitHub repo size

Large-scale protein production for biotechnology and biopharmaceutical applications rely on high protein solubility in expression systems. Solubility has been measured for a significant fraction of E. coli and S. cerevisiae proteomes and these datasets are routinely used to train predictors of protein solubility in different organisms. Thanks to continued advances in experimental structure-determination and modelling, many of these solubility measurements can now be paired with accurate structural models.

The challenge is mentored by Christopher Ing and Mark Fingerhuth.

Aim of the challenge

It is the objective of this project to use our provided dataset of protein structure and solubility value pairs in order to produce a solubility predictor with comparable accuracy to sequence-based predictors reported in the literature. The provided dataset to be used in this project is created by following the dataset curation procedure described in the SOLart paper, and this hackathon project has a similar aim to this manuscript.

The dataset

The process of generating the dataset is described in the SOLArt manuscript. At a high level, all experimentally tested E. coli and S. cerevisiae proteins were matched through Uniprot IDs to known crystallographic structures or high sequence similarity homology models. After balancing the fold types using CATH, a dataset containing a balanced spread of solubility values was produced. The resulting proteins for the training and testing of these models were prepared and disclosed in the supplemental material of this paper as a list of (Uniprot,PDB,Chain,Solubility) pairs. The PDB files were not included in this work so we had to re-extract them from SWISS-MODEL. Whenever a crystallographic structure was present, it was used, assuming high coverage over the Uniprot sequence. In some cases, the original PDB templates used within the original SOLArt paper had been superceded by improved templates, and we opted to take the highest resolution, highest sequence identity, models in our updated dataset. We stripped away all irrelevant chains and heteroatoms.

If issues are identified with individual structures, please refer to the Uniprot ID and manually investigate the best template. In some cases, we needed to improve structure correctness by modelling missing atoms/residues inside the Chemical Computing Group software MOE on a case-by-case basis.

The dataset can be found in the data/ subdirectory - it is already divided into training/ and test/ data. The training/ data comes with solubility_values.csv and solublity_values.yaml (same content just different format) which both contain the solubility target values for all the PDB files provided in that directory. Note that each PDB file is named after the Uniprot identifier of the respective protein and the protein column in the solubility_values.csv also contains the Uniprot identifiers.

The test/ dataset consists of three different subdirectories (protein structures derived from different organisms and with different approaches) and you should NOT use them for any training. Only the yeast_crystal_structs/ directory contains solubility_values.csv and solublity_values.yaml (same content just different format) files which you can use for some local testing & validation. In order to find out your performance on the entire test dataset you need to use the automated benchmarking system (see below).

Example output

Your code should output a file called predictions.csv in the following format:

protein,solubility
P69829,83
P31133,62

whereby the protein column contains the Uniprot ID (corresponds to the filename of the PDB files) and the solubility column contains the predicted solubility value (can be int or float).

Note, that there are three (!) test subsets but you are expected to submit all the predictions in one file (not three) for the benchmarking system to work.

Automated benchmarking system

The continuous integration script in .github/workflows/ci.yml will automatically build the Dockerfile on every commit to the main branch. This docker image will be published as your hackathon submission to https://biolib.com//. For this to work, make sure you set the BIOLIB_TOKEN and BIOLIB_PROJECT_URI accordingly as repository secrets.

To read more about the benchmarking system click here.

Say thanks

Give this repo a star: GitHub Repo stars

Star the ProteinQure org on Github: GitHub Org's stars

Owner
ProteinQure
ProteinQure
urlwatch is intended to help you watch changes in webpages and get notified of any changes.

urlwatch is intended to help you watch changes in webpages and get notified (via e-mail, in your terminal or through various third party services) of any changes.

Thomas Perl 2.5k Jan 08, 2023
Bazel rules to install Python dependencies with Poetry

rules_python_poetry Bazel rules to install Python dependencies from a Poetry project. Works with native Python rules for Bazel. Getting started Add th

Martin Liu 7 Dec 15, 2021
A Python program for calculating the 95%CI for GNSS-derived site velocities

GNSS_Vel_95%CI A Python program for calculating the 95%CI for GNSS-derived site velocities Function_GNSS_95CI.py is a Python function for calculating

<a href=[email protected]"> 4 Dec 16, 2022
Cairo-integer-types - A library for bitwise integer types (e.g. int64 or uint32) in Cairo, with a test suite

The Cairo bitwise integer library (cairo-bitwise-int v0.1.1) The Cairo smart tes

27 Sep 23, 2022
Stop ask your soraka to ult you, just ult yourself

Lollo's auto-ultimate script Are you tired of your low elo friend who can't ult you with soraka when you ask for it? Use Useless Support and just ult

9 Oct 20, 2022
Wordle-solve - Attempting to solve wordle

Wordle Solver Run with python wordle_beater.py. This hardmode wordle solver take

Tom Lockwood 42 Oct 11, 2022
SimplePyBLE - Python bindings for SimpleBLE

The ultimate fully-fledged cross-platform Python BLE library, designed for simplicity and ease of use.

Open Bluetooth Toolbox 27 Aug 28, 2022
Data Science Course at Dept. of Computer Engineering, Chula 2022

2110446 Data Science Course at Chula 2022 Short links for exercises: Week1: Intro to Numpy, Pandas Numpy: https://colab.research.google.com/github/kao

Kao Panboonyuen 17 Nov 27, 2022
디텍션 유틸 모음

Object detection utils 유틸모음 설명 링크 convert convert 관련코드 https://github.com/AI-infinyx/ob_utils/tree/main/convert crawl 구글, 네이버, 빙 등 크롤링 관련 https://gith

codetest 41 Jan 22, 2021
Demo of a WAM Prolog implementation in Python

Prol: WAM demo This is a simplified Warren Abstract Machine (WAM) implementation for Prolog, that showcases the main instructions, compiling, register

Bruno Kim Medeiros Cesar 62 Dec 26, 2022
This repository requires you to solve a problem by writing some basic python code.

Can You Solve a Problem? A beginner friendly repository that requires you to solve familiar problems with python. This could be as simple as implement

Precious Kolawole 11 Nov 30, 2022
a simple functional programming language compiler written in python

Functional Programming Language A compiler for my small functional language. Written in python with SLY lexer/parser generator library. Requirements p

Ashkan Laei 3 Nov 05, 2021
Oblique Strategies for Python

Oblique Strategies for Python

Łukasz Langa 3 Feb 17, 2022
Материалы для курса VK Углубленный Python, весна 2022

VK Углубленный Python, весна 2022 Материалы для курса VK Углубленный Python, весна 2022 Лекции и материалы (слайды, домашки, код с занятий) Введение,

10 Nov 02, 2022
A proof-of-concept package manager for Cairo contracts/libraries

glyph A proof-of-concept package manager for Cairo contracts/libraries. Distribution through pypi. Installation through existing package managers -- p

Sam Barnes 11 Jun 06, 2022
Cross-platform config and manager for click console utilities.

climan Help the project financially: Donate: https://smartlegion.github.io/donate/ Yandex Money: https://yoomoney.ru/to/4100115206129186 PayPal: https

3 Aug 31, 2021
Swubcase - The shitty programming language

What is Swubcase? Swubcase is easy-to-use programming language that can fuck you

5 Jun 19, 2022
🎉 🎉 PyComp - Java Code compiler written in python.

🎉 🎉 PyComp Java Code compiler written in python. This is yet another compiler meant for babcock students project which was created using pure python

Alumona Benaiah 5 Nov 30, 2022
This is the repo for Uncertainty Quantification 360 Toolkit.

UQ360 The Uncertainty Quantification 360 (UQ360) toolkit is an open-source Python package that provides a diverse set of algorithms to quantify uncert

International Business Machines 207 Dec 30, 2022
Extra scripts to improve user experience related to OpenTaiko

OpenTaiko-Utils Extra scripts to improve user experience related to OpenTaiko osu2tja /!\ IMPORTANT NOTE /!\ Converted charts that aren't yours are fo

2 Dec 25, 2022