A Numba-based two-point correlation function calculator using a grid decomposition

Overview

numba-2pcf

tests

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Aside from the 2PCF calculation, the particle_grid module is both simple and fast and may be useful on its own as a way to partition particle sets in 3D.

Installation

$ git clone https://github.com/lgarrison/numba-2pcf.git
$ cd numba-2pcf
$ python -m pip install -e .

Example

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2.
pos = rng.random((N,3), dtype=np.float32)*box

res = numba_2pcf(pos, box, Rmax=0.05, nbin=10)
res.pprint_all()
        rmin                 rmax                 rmid                    xi            npairs 
-------------------- -------------------- -------------------- ----------------------- --------
                 0.0 0.005000000074505806 0.002500000037252903   -0.004519257448573177    65154
0.005000000074505806 0.010000000149011612  0.00750000011175871   0.0020113763064291135   459070
0.010000000149011612  0.01500000022351742 0.012500000186264515    0.000984359247434119  1244770
 0.01500000022351742 0.020000000298023225 0.017500000260770324  -6.616896085054336e-06  2421626
0.020000000298023225  0.02500000037252903 0.022500000335276125  0.00019365366488166558  3993210
 0.02500000037252903  0.03000000044703484 0.027500000409781934   5.769329601057471e-05  5956274
 0.03000000044703484  0.03500000052154064 0.032500000484287736   0.0006815801672250821  8317788
 0.03500000052154064  0.04000000059604645 0.037500000558793545    2.04711840243732e-05 11061240
 0.04000000059604645  0.04500000067055226 0.042500000633299354   9.313641918828885e-05 14203926
 0.04500000067055226  0.05000000074505806  0.04750000070780516 -0.00011690771042793813 17734818

Performance

The goal of this project is not to provide the absolute best performance that given hardware can produce, but it is a goal to provide as good performance as Numba will let us reach (while keeping the code readable). So we pay special attention to things like dtype (use float32 particle inputs when possible!), parallelization, and some early-exit conditions (when we know a pair can't fall in any bin).

As a demonstration that this code provides passably good performance, here's a dummy test of 107 unclustered data points in a 2 Gpc/h box (so number density 1.2e-3), with Rmax=200 Mpc/h and bin width of 1 Mpc/h:

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2000
pos = rng.random((N,3), dtype=np.float32)*box

%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=False, nthread=24)  # 3.5 s
%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=True, nthread=24)  # 1.3 s

So within a factor of 3 of Corrfunc, and we aren't even exploiting the symmetry of the autocorrelation (i.e. we count every pair twice). Not bad!

Testing Against Corrfunc

The code is tested against Corrfunc. And actually, the numba_2pcf() function takes a flag corrfunc=True that calls Corrfunc instead of the Numba implementation to make such testing even easier.

Details

numba_2pcf works a lot like Corrfunc, or any other grid-based 2PCF code: the 3D volume is divided into a grid of cells at least Rmax in size, where Rmax is the maximum radius of the correlation function measurement. Then, we know all valid particle pairs must be in neighboring cells. So the task is simply to loop through each cell in the grid, pairing it with each of its 26 neighbors (plus itself). We parallelize over cell pairs, and add up all the pair counts across threads at the end.

This grid decomposition prunes distant pairwise comparisons, so even though the runtime still formally scales as O(N2), it makes the 2PCF tractable for many realistic problems in cosmology and large-scale structure.

A numba implementation isn't likely to beat Corrfunc on speed, but numba can still be fast enough to be useful (especially when the computation parallelizes well). The idea is that this code provides a "fast enough" parallel implementation while still being highly readable --- the 2PCF implementation is about 150 lines of code, and the gridding scheme 100 lines.

Branches

The particle-jackknife branch contains an implementation of an idea for computing the xi(r) variance based on the variance of the per-particle xi(r) measurements. It doesn't seem to be measuring the right thing, but the code is left for posterity.

Acknowledgments

This repo was generated from @DFM's Cookiecutter Template. Thanks, DFM!

Owner
Lehman Garrison
Flatiron Research Fellow at the Center for Computational Astrophysics
Lehman Garrison
ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

Monish Kumar 49 Nov 28, 2022
Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

Facebook Research 208 Dec 22, 2022
Leverage Twitter API v2 to analyze tweet metrics such as impressions and profile clicks over time.

Tweetmetric Tweetmetric allows you to track various metrics on your most recent tweets, such as impressions, retweets and clicks on your profile. The

Mathis HAMMEL 29 Oct 18, 2022
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
Pyspark Spotify ETL

This is my first Data Engineering project, it extracts data from the user's recently played tracks using Spotify's API, transforms data and then loads it into Postgresql using SQLAlchemy engine. Data

16 Jun 09, 2022
A crude Hy handle on Pandas library

Quickstart Hyenas is a curde Hy handle written on top of Pandas API to allow for more elegant access to data-scientist's powerhouse that is Pandas. In

Peter Výboch 4 Sep 05, 2022
A Python 3 library making time series data mining tasks, utilizing matrix profile algorithms

MatrixProfile MatrixProfile is a Python 3 library, brought to you by the Matrix Profile Foundation, for mining time series data. The Matrix Profile is

Matrix Profile Foundation 302 Dec 29, 2022
This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

1 Dec 28, 2021
MeSH2Matrix - A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

A set of Python codes for the generation of biomedical ontologies from the MeSH keywords of the PubMed scholarly publications

SisonkeBiotik 6 Nov 30, 2022
Pizza Orders Data Pipeline Usecase Solved by SQL, Sqoop, HDFS, Hive, Airflow.

PizzaOrders_DataPipeline There is a Tony who is owning a New Pizza shop. He knew that pizza alone was not going to help him get seed funding to expand

Melwin Varghese P 4 Jun 05, 2022
Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format.

Brady Law 2 Dec 01, 2021
DataPrep — The easiest way to prepare data in Python

DataPrep — The easiest way to prepare data in Python

SFU Database Group 1.5k Dec 27, 2022
Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

Glotzer Group 109 Dec 14, 2022
Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

Oxford Nanopore Technologies 72 Jan 05, 2023
A Python package for Bayesian forecasting with object-oriented design and probabilistic models under the hood.

Disclaimer This project is stable and being incubated for long-term support. It may contain new experimental code, for which APIs are subject to chang

Uber Open Source 1.6k Dec 29, 2022
signac-flow - manage workflows with signac

signac-flow - manage workflows with signac The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, a

Glotzer Group 44 Oct 14, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
Synthetic Data Generation for tabular, relational and time series data.

An Open Source Project from the Data to AI Lab, at MIT Website: https://sdv.dev Documentation: https://sdv.dev/SDV User Guides Developer Guides Github

The Synthetic Data Vault Project 1.2k Jan 07, 2023