An easy-to-use feature store

Overview

ByteHub PyPI Latest Release Issues Issues Code style: black

ByteHub logo

An easy-to-use feature store.

💾 What is a feature store?

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

Feature stores allow data scientists and engineers to be more productive by organising the flow of data into models.

The Bytehub Feature Store is designed to:

  • Be simple to use, with a Pandas-like API;
  • Require no complicated infrastructure, running on a local Python installation or in a cloud environment;
  • Be optimised towards timeseries operations, making it highly suited to applications such as those in finance, energy, forecasting; and
  • Support simple time/value data as well as complex structures, e.g. dictionaries.

It is built on Dask to support large datasets and cluster compute environments.

🦉 Features

  • Searchable feature information and metadata can be stored locally using SQLite or in a remote database.
  • Timeseries data is saved in Parquet format using Dask, making it readable from a wide range of other tools. Data can reside either on a local filesystem or in a cloud storage service, e.g. AWS S3.
  • Supports timeseries joins, along with filtering and resampling operations to make it easy to load and prepare datasets for ML training.
  • Feature engineering steps can be implemented as transforms. These are saved within the feature store, and allows for simple, resusable preparation of raw data.
  • Time travel can retrieve feature values based on when they were created, which can be useful for forecasting applications.
  • Simple APIs to retrieve timeseries dataframes for training, or a dictionary of the most recent feature values, which can be used for inference.

Also available as ☁️ ByteHub Cloud: a ready-to-use, cloud-hosted feature store.

📖 Documentation and tutorials

See the ByteHub documentation and notebook tutorials to learn more and get started.

🚀 Quick-start

Install using pip:

pip install bytehub

Create a local SQLite feature store by running:

import bytehub as bh
import pandas as pd

fs = bh.FeatureStore()

Data lives inside namespaces within each feature store. They can be used to separate projects or environments. Create a namespace as follows:

fs.create_namespace(
    'tutorial', url='/tmp/featurestore/tutorial', description='Tutorial datasets'
)

Create a feature inside this namespace which will be used to store a timeseries of pre-prepared data:

fs.create_feature('tutorial/numbers', description='Timeseries of numbers')

Now save some data into the feature store:

dts = pd.date_range('2020-01-01', '2021-02-09')
df = pd.DataFrame({'time': dts, 'value': list(range(len(dts)))})

fs.save_dataframe(df, 'tutorial/numbers')

The data is now stored, ready to be transformed, resampled, merged with other data, and fed to machine-learning models.

We can engineer new features from existing ones using the transform decorator. Suppose we want to define a new feature that contains the squared values of tutorial/numbers:

@fs.transform('tutorial/squared', from_features=['tutorial/numbers'])
def squared_numbers(df):
    # This transform function receives dataframe input, and defines a transform operation
    return df ** 2 # Square the input

Now both features are saved in the feature store, and can be queried using:

df_query = fs.load_dataframe(
    ['tutorial/numbers', 'tutorial/squared'],
    from_date='2021-01-01', to_date='2021-01-31'
)

To connect to ByteHub Cloud, first register for an account, then use:

fs = bh.FeatureStore("https://api.bytehub.ai")

This will allow you to store features in your own private namespace on ByteHub Cloud, and save datasets to an AWS S3 storage bucket.

🐾 Roadmap

  • Tasks to automate updates to features using orchestration tools like Airflow
Owner
ByteHub AI
ByteHub AI
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
Hydrogen (or other pure gas phase species) depressurization calculations

HydDown Hydrogen (or other pure gas phase species) depressurization calculations This code is published under an MIT license. Install as simple as: pi

Anders Andreasen 13 Nov 26, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021
Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

Meltano 625 Jan 02, 2023
A data structure that extends pyspark.sql.DataFrame with metadata information.

MetaFrame A data structure that extends pyspark.sql.DataFrame with metadata info

Invent Analytics 8 Feb 15, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021
Data cleaning tools for Business analysis

Datacleaning datacleaning tools for Business analysis This program is made for Vicky's work. You can use it, too. 数据清洗 该数据清洗工具是为了商业分析 这个程序是为了Vicky的工作而

Lin Jian 3 Nov 16, 2021
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.3k Dec 30, 2022
Scraping and analysis of leetcode-compensations page.

Leetcode compensations report Scraping and analysis of leetcode-compensations page.

utsav 96 Jan 01, 2023
📊 Python Flask game that consolidates data from Nasdaq, allowing the user to practice buying and selling stocks.

Web Trader Web Trader is a trading website that consolidates data from Nasdaq, allowing the user to search up the ticker symbol and price of any stock

Paulina Khew 21 Aug 30, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023
Monitor the stability of a pandas or spark dataframe ⚙︎

Population Shift Monitoring popmon is a package that allows one to check the stability of a dataset. popmon works with both pandas and spark datasets.

ING Bank 403 Dec 07, 2022
In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift.

ETL Pipeline for AWS Project Description In this project, ETL pipeline is build on data warehouse hosted on AWS Redshift. The data is loaded from S3 t

Mobeen Ahmed 1 Nov 01, 2021
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
For making Tagtog annotation into csv dataset

tagtog_relation_extraction for making Tagtog annotation into csv dataset How to Use On Tagtog 1. Go to Project Downloads 2. Download all documents,

hyeong 4 Dec 28, 2021