Fast 1D and 2D histogram functions in Python

Overview

Azure Status asv

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

Owner
Thomas Robitaille
Thomas Robitaille
Create a visualization for Trump's Tweeted Words Using Python

Data Trump's Tweeted Words This plot illustrates twitter word occurences. We already did the coding I needed for this plot, so I was very inspired to

7 Mar 27, 2022
A simple project on Data Visualization for CSCI-40 course.

Simple-Data-Visualization A simple project on Data Visualization for CSCI-40 course - the instructions can be found here SAT results in New York in 20

Hugo Matousek 8 Oct 27, 2021
A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe which runs both payloads.

Update ! ANONFILE MIGHT NOT WORK ! About A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe w

Vesper 15 Oct 12, 2022
A shimmer pre-load component for Plotly Dash

dash-loading-shimmer A shimmer pre-load component for Plotly Dash Installation Get it with pip: pip install dash-loading-extras Or maybe you prefer Pi

Lucas Durand 4 Oct 12, 2022
Function Plotter: a simple application with GUI to plot mathematical functions

Function-Plotter Function Plotter is a simple application with GUI to plot mathe

Mohamed Nabawe 4 Jan 03, 2022
Machine learning beginner to Kaggle competitor in 30 days. Non-coders welcome. The program starts Monday, August 2, and lasts four weeks. It's designed for people who want to learn machine learning.

30-Days-of-ML-Kaggle 🔥 About the Hands On Program 💻 Machine learning beginner → Kaggle competitor in 30 days. Non-coders welcome The program starts

Roja Achary 145 Jan 01, 2023
An(other) implementation of JSON Schema for Python

jsonschema jsonschema is an implementation of JSON Schema for Python. from jsonschema import validate # A sample schema, like what we'd get f

Julian Berman 4k Jan 04, 2023
Drag’n’drop Pivot Tables and Charts for Jupyter/IPython Notebook, care of PivotTable.js

pivottablejs: the Python module Drag’n’drop Pivot Tables and Charts for Jupyter/IPython Notebook, care of PivotTable.js Installation pip install pivot

Nicolas Kruchten 512 Dec 26, 2022
High-level geospatial data visualization library for Python.

geoplot: geospatial data visualization geoplot is a high-level Python geospatial plotting library. It's an extension to cartopy and matplotlib which m

Aleksey Bilogur 1k Jan 01, 2023
Draw tree diagrams from indented text input

Draw tree diagrams This repository contains two very different scripts to produce hierarchical tree diagrams like this one: $ ./classtree.py collectio

Luciano Ramalho 8 Dec 14, 2022
Some method of processing point cloud

Point-Cloud Some method of processing point cloud inversion the completion pointcloud to incomplete point cloud Some model of encoding point cloud to

Tan 1 Nov 19, 2021
A Python package for caclulations and visualizations in geological sciences.

geo_calcs A Python package for caclulations and visualizations in geological sciences. Free software: MIT license Documentation: https://geo-calcs.rea

Drew Heasman 1 Jul 12, 2022
Editor and Presenter for Manim Generated Content.

Editor and Presenter for Manim Generated Content. Take a look at the Working Example. More information can be found on the documentation. These Browse

Manim Community 149 Dec 29, 2022
📊 Charts with pure python

A zero-dependency python package that prints basic charts to a Jupyter output Charts supported: Bar graphs Scatter plots Histograms 🍑 📊 👏 Examples

Max Humber 54 Oct 04, 2022
A python script and steps to display locations of peers connected to qbittorrent

A python script (along with instructions) to display the locations of all the peers your qBittorrent client is connected to in a Grafana worldmap dash

62 Dec 07, 2022
This is a super simple visualization toolbox (script) for transformer attention visualization ✌

Trans_attention_vis This is a super simple visualization toolbox (script) for transformer attention visualization ✌ 1. How to prepare your attention m

Mingyu Wang 3 Jul 09, 2022
Tools for calculating and visualizing Elo-like ratings of MLB teams using Retosheet data

Overview This project uses historical baseball games data to calculate an Elo-like rating for MLB teams based on regular season match ups. The Elo rat

Lukas Owens 0 Aug 25, 2021
YOPO is an interactive dashboard which generates various standard plots.

YOPO is an interactive dashboard which generates various standard plots.you can create various graphs and charts with a click of a button. This tool uses Dash and Flask in backend.

ADARSH C 38 Dec 20, 2022
matplotlib: plotting with Python

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. Check out our home page for more inform

Matplotlib Developers 16.7k Jan 08, 2023
Visual Python is a GUI-based Python code generator, developed on the Jupyter Notebook environment as an extension.

Visual Python is a GUI-based Python code generator, developed on the Jupyter Notebook environment as an extension.

Visual Python 564 Jan 03, 2023