pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Related tags

Data Analysispyhsmm
Overview

Build Status

Bayesian inference in HSMMs and HMMs

This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

There are also some extensions:

Installing from PyPI

Give this a shot:

pip install pyhsmm

You may need to install a compiler with -std=c++11 support, like gcc-4.7 or higher.

To install manually from the git repo, you'll need cython. Then try this:

python setup.py install

It might also help to look at the travis file to see how to set up a working install from scratch.

Running

See the examples directory.

For the Python interpreter to be able to import pyhsmm, you'll need it on your Python path. Since the current working directory is usually included in the Python path, you can probably run the examples from the same directory in which you run the git clone with commands like python pyhsmm/examples/hsmm.py. You might also want to add pyhsmm to your global Python path (e.g. by copying it to your site-packages directory).

A Simple Demonstration

Here's how to draw from the HDP-HSMM posterior over HSMMs given a sequence of observations. (The same example, along with the code to generate the synthetic data loaded in this example, can be found in examples/basic.py.)

Let's say we have some 2D data in a data.txt file:

$ head -5 data.txt
-3.711962552600095444e-02 1.456401745267922598e-01
7.553818775915704942e-02 2.457422192223903679e-01
-2.465977987699214502e+00 5.537627981813508793e-01
-7.031638516485749779e-01 1.536468304146855757e-01
-9.224669847039665971e-01 3.680035337673161489e-01

In Python, we can plot the data in a 2D plot, collapsing out the time dimension:

import numpy as np
from matplotlib import pyplot as plt

data = np.loadtxt('data.txt')
plt.plot(data[:,0],data[:,1],'kx')

2D data

We can also make a plot of time versus the first principal component:

from pyhsmm.util.plot import pca_project_data
plt.plot(pca_project_data(data,1))

Data first principal component vs time

To learn an HSMM, we'll use pyhsmm to create a WeakLimitHDPHSMM instance using some reasonable hyperparameters. We'll ask this model to infer the number of states as well, so we'll give it an Nmax parameter:

import pyhsmm
import pyhsmm.basic.distributions as distributions

obs_dim = 2
Nmax = 25

obs_hypparams = {'mu_0':np.zeros(obs_dim),
                'sigma_0':np.eye(obs_dim),
                'kappa_0':0.3,
                'nu_0':obs_dim+5}
dur_hypparams = {'alpha_0':2*30,
                 'beta_0':2}

obs_distns = [distributions.Gaussian(**obs_hypparams) for state in range(Nmax)]
dur_distns = [distributions.PoissonDuration(**dur_hypparams) for state in range(Nmax)]

posteriormodel = pyhsmm.models.WeakLimitHDPHSMM(
        alpha=6.,gamma=6., # better to sample over these; see concentration-resampling.py
        init_state_concentration=6., # pretty inconsequential
        obs_distns=obs_distns,
        dur_distns=dur_distns)

(The first two arguments set the "new-table" proportionality constant for the meta-Chinese Restaurant Process and the other CRPs, respectively, in the HDP prior on transition matrices. For this example, they really don't matter at all, but on real data it's much better to infer these parameters, as in examples/concentration_resampling.py.)

Then, we add the data we want to condition on:

posteriormodel.add_data(data,trunc=60)

The trunc parameter is an optional argument that can speed up inference: it sets a truncation limit on the maximum duration for any state. If you don't pass in the trunc argument, no truncation is used and all possible state duration lengths are considered. (pyhsmm has fancier ways to speed up message passing over durations, but they aren't documented.)

If we had multiple observation sequences to learn from, we could add them to the model just by calling add_data() for each observation sequence.

Now we run a resampling loop. For each iteration of the loop, all the latent variables of the model will be resampled by Gibbs sampling steps, including the transition matrix, the observation means and covariances, the duration parameters, and the hidden state sequence. We'll also copy some samples so that we can plot them.

models = []
for idx in progprint_xrange(150):
    posteriormodel.resample_model()
    if (idx+1) % 10 == 0:
        models.append(copy.deepcopy(posteriormodel))

Now we can plot our saved samples:

fig = plt.figure()
for idx, model in enumerate(models):
    plt.clf()
    model.plot()
    plt.gcf().suptitle('HDP-HSMM sampled after %d iterations' % (10*(idx+1)))
    plt.savefig('iter_%.3d.png' % (10*(idx+1)))

Sampled models

I generated these data from an HSMM that looked like this:

Randomly-generated model and data

So the posterior samples look pretty good!

A convenient shortcut to build a list of sampled models is to write

model_samples = [model.resample_and_copy() for itr in progprint_xrange(150)]

That will build a list of model objects (each of which can be inspected, plotted, pickled, etc, independently) in a way that won't duplicate data that isn't changed (like the observations or hyperparameter arrays) so that memory usage is minimized. It also minimizes file size if you save samples like

import cPickle
with open('sampled_models.pickle','w') as outfile:
    cPickle.dump(model_samples,outfile,protocol=-1)

Extending the Code

To add your own observation or duration distributions, implement the interfaces defined in basic/abstractions.py. To get a flavor of the style, see pybasicbayes.

References

@article{johnson2013hdphsmm,
    title={Bayesian Nonparametric Hidden Semi-Markov Models},
    author={Johnson, Matthew J. and Willsky, Alan S.},
    journal={Journal of Machine Learning Research},
    pages={673--701},
    volume={14},
    month={February},
    year={2013},
}

Authors

Matt Johnson, Alex Wiltschko, Yarden Katz, Chia-ying (Jackie) Lee, Scott Linderman, Kevin Squire, Nick Foti.

Owner
Matthew Johnson
research scientist @ Google Brain
Matthew Johnson
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

7.7k Dec 30, 2022
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
Jupyter notebooks for the book "The Elements of Statistical Learning".

This repository contains Jupyter notebooks implementing the algorithms found in the book and summary of the textbook.

Madiyar 369 Dec 30, 2022
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
Business Intelligence (BI) in Python, OLAP

Open Mining Business Intelligence (BI) Application Server written in Python Requirements Python 2.7 (Backend) Lua 5.2 or LuaJIT 5.1 (OML backend) Mong

Open Mining 1.2k Dec 27, 2022
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

67 Dec 27, 2022
An experimental project I'm undertaking for the sole purpose of increasing my Python knowledge

5ePy is an experimental project I'm undertaking for the sole purpose of increasing my Python knowledge. #Goals Goal: Create a working, albeit lightwei

Hayden Covington 1 Nov 24, 2021
In this tutorial, raster models of soil depth and soil water holding capacity for the United States will be sampled at random geographic coordinates within the state of Colorado.

Raster_Sampling_Demo (Resulting graph of this demo) Background Sampling values of a raster at specific geographic coordinates can be done with a numbe

2 Dec 13, 2022
BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings.

BinTuner is a cost-efficient auto-tuning framework, which can deliver a near-optimal binary code that reveals much more differences than -Ox settings. it also can assist the binary code analysis rese

BinTuner 42 Dec 16, 2022
Zipline, a Pythonic Algorithmic Trading Library

Zipline is a Pythonic algorithmic trading library. It is an event-driven system for backtesting. Zipline is currently used in production as the backte

Quantopian, Inc. 15.7k Jan 07, 2023
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
Streamz helps you build pipelines to manage continuous streams of data

Streamz helps you build pipelines to manage continuous streams of data. It is simple to use in simple cases, but also supports complex pipelines that involve branching, joining, flow control, feedbac

Python Streamz 1.1k Dec 28, 2022
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022