In-place Parallel Super Scalar Samplesort (IPS⁴o)

Related tags

Deep Learningips4o
Overview

In-place Parallel Super Scalar Samplesort (IPS⁴o)

This is the implementation of the algorithm IPS⁴o presented in the paper Engineering In-place (Shared-memory) Sorting Algorithms, which contains an in-depth description of its inner workings, as well as an extensive experimental performance evaluation. Here's the abstract:

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account.

Our new comparison-based algorithm In-place Superscalar Samplesort (IPS⁴o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings.

Another surprising result is that IPS⁴o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS²Ra) turns out to be the best algorithm.

Claims to have the -- in some sense -- "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

An initial version of IPS⁴o has been described in our publication on the 25th Annual European Symposium on Algorithms.

Usage

Clone this repository and check out its submodule

git clone --recurse-submodules https://github.com/ips4o/ips4o.git

or use the following commands instead if you want to include this repository as a submodule:

git submodule add https://github.com/ips4o/ips4o.git
git submodule update --recursive --init

IPS⁴o provides a CMake library for simple usage:

add_subdirectory(<path-to-the-ips4o-repository>)
target_link_libraries(<your-target> PRIVATE ips4o)

A minimal working example:

#include "ips4o.hpp"

// sort sequentially
ips4o::sort(begin, end[, comparator]);

// sort in parallel (uses OpenMP if available, std::thread otherwise)
ips4o::parallel::sort(begin, end[, comparator]);

The parallel version of IPS⁴o requires 16-byte atomic compare-and-exchange instructions to run the fastest. Most CPUs and compilers support 16-byte compare-and-exchange instructions nowadays. If the CPU in question does so, IPS⁴o uses 16-byte compare-and-exchange instructions when you set your CPU correctly (e.g., -march=native) or when you enable the instructions explicitly (-mcx16). In this case, you also have to link against GCC's libatomic (-latomic). Otherwise, we emulate some 16-byte compare-and-exchange instructions with locks which may slightly mitigate the performance of IPS⁴o.

If you use the CMake example shown above, we automatically optimize IPS⁴o for the native CPU (e.g., -march=native). You can disable the CMake property IPS4O_OPTIMIZE_FOR_NATIVE to avoid native optimization and you can enable the CMake property IPS4O_USE_MCX16 if you compile with GCC or Clang to enable 16-byte compare-and-exchange instructions explicitly.

IPS⁴o uses C++ threads if not specified otherwise. If you prefer OpenMP threads, you need to enable OpenMP threads, e.g., enable the CMake property IPS4O_USE_OPENMP or add OpenMP to your target. If you enable the CMake property DISABLE_IPS4O_PARALLEL, most of the parallel code will not be compiled and no parallel libraries will be linked. Otherwise, CMake automatically enables C++ threads (e.g., -pthread) and links against TBB and GCC's libatomic. (Only when you compile your code for 16-byte compare-and-exchange instructions you need libatomic.) Thus, you need the Thread Building Blocks (TBB) library to compile and execute the parallel version of IPS⁴o. We search for TBB with find_package(TBB REQUIRED). If you want to execute IPS⁴o in parallel but your TBB library is not accessible via find_package(TBB REQUIRED), you can still compile IPS⁴o with parallel support. Just enable the CMake property DISABLE_IPS4O_PARALLEL, enable C++ threads for your own target and link your own target against your TBB library (and also link your target against libatomic if you want 16-byte atomic compare-and-exchange instruction support).

If you do not set a CMake build type, we use the build type Release which disables debugging (e.g., -DNDEBUG) and enables optimizations (e.g., -O3).

Currently, the code does not compile on Windows.

Licensing

IPS⁴o is free software provided under the BSD 2-Clause License described in the LICENSE file. If you use this implementation of IPS⁴o in an academic setting please cite the paper Engineering In-place (Shared-memory) Sorting Algorithms using the BibTeX entry

@misc{axtmann2020engineering,
  title =	 {Engineering In-place (Shared-memory) Sorting Algorithms},
  author =	 {Michael Axtmann and Sascha Witt and Daniel Ferizovic and Peter Sanders},
  howpublished = {Computing Research Repository (CoRR)},
  year =	 {Sept. 2020},
  archivePrefix ={arXiv},
  eprint =	 {2009.13569},
}
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano https:

9.6k Jan 06, 2023
TART - A PyTorch implementation for Transition Matrix Representation of Trees with Transposed Convolutions

TART This project is a PyTorch implementation for Transition Matrix Representati

Lee Sael 2 Jan 19, 2022
A Real-Time-Strategy game for Deep Learning research

Description DeepRTS is a high-performance Real-TIme strategy game for Reinforcement Learning research. It is written in C++ for performance, but provi

Centre for Artificial Intelligence Research (CAIR) 156 Dec 19, 2022
Labels4Free: Unsupervised Segmentation using StyleGAN

Labels4Free: Unsupervised Segmentation using StyleGAN ICCV 2021 Figure: Some segmentation masks predicted by Labels4Free Framework on real and synthet

70 Dec 23, 2022
Disagreement-Regularized Imitation Learning

Due to a normalization bug the expert trajectories have lower performance than the rl_baseline_zoo reported experts. Please see the following link in

Kianté Brantley 25 Apr 28, 2022
Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Predicitng_viability Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for

Gopalika Sharma 1 Nov 08, 2021
A minimalist implementation of score-based diffusion model

sdeflow-light This is a minimalist codebase for training score-based diffusion models (supporting MNIST and CIFAR-10) used in the following paper "A V

Chin-Wei Huang 89 Dec 20, 2022
Source code for TACL paper "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation".

KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation Source code for TACL 2021 paper KEPLER: A Unified Model for Kn

THU-KEG 138 Dec 22, 2022
Self-supervised Product Quantization for Deep Unsupervised Image Retrieval - ICCV2021

Self-supervised Product Quantization for Deep Unsupervised Image Retrieval Pytorch implementation of SPQ Accepted to ICCV 2021 - paper Young Kyun Jang

Young Kyun Jang 71 Dec 27, 2022
On Uncertainty, Tempering, and Data Augmentation in Bayesian Classification

Understanding Bayesian Classification This repository hosts the code to reproduce the results presented in the paper On Uncertainty, Tempering, and Da

Sanyam Kapoor 18 Nov 17, 2022
Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Official PyTorch code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction. Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe,

stanley 152 Dec 16, 2022
Augmented CLIP - Training simple models to predict CLIP image embeddings from text embeddings, and vice versa.

Train aug_clip against laion400m-embeddings found here: https://laion.ai/laion-400-open-dataset/ - note that this used the base ViT-B/32 CLIP model. S

Peter Baylies 55 Sep 13, 2022
MoveNetを用いたPythonでの姿勢推定のデモ

MoveNet-Python-Example MoveNetのPythonでの動作サンプルです。 ONNXに変換したモデルも同梱しています。変換自体を試したい方はMoveNet_tf2onnx.ipynbを使用ください。 2021/08/24時点でTensorFlow Hubで提供されている以下モデ

KazuhitoTakahashi 38 Dec 17, 2022
Implementation of the CVPR 2021 paper "Online Multiple Object Tracking with Cross-Task Synergy"

Online Multiple Object Tracking with Cross-Task Synergy This repository is the implementation of the CVPR 2021 paper "Online Multiple Object Tracking

54 Oct 15, 2022
RepVGG: Making VGG-style ConvNets Great Again

RepVGG: Making VGG-style ConvNets Great Again (PyTorch) This is a super simple ConvNet architecture that achieves over 80% top-1 accuracy on ImageNet

2.8k Jan 04, 2023
A PyTorch re-implementation of Neural Radiance Fields

nerf-pytorch A PyTorch re-implementation Project | Video | Paper NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis Ben Mildenhall

Krishna Murthy 709 Jan 09, 2023
PyTorch Implementation of CvT: Introducing Convolutions to Vision Transformers

CvT: Introducing Convolutions to Vision Transformers Pytorch implementation of CvT: Introducing Convolutions to Vision Transformers Usage: img = torch

Rishikesh (ऋषिकेश) 193 Jan 03, 2023
Transfer-Learn is an open-source and well-documented library for Transfer Learning.

Transfer-Learn is an open-source and well-documented library for Transfer Learning. It is based on pure PyTorch with high performance and friendly API. Our code is pythonic, and the design is consist

THUML @ Tsinghua University 2.2k Jan 03, 2023
FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction. It uses a customized encoder decoder architecture with spatio-temporal convolutions and channel ga

Tarun K 280 Dec 23, 2022
BESS: Balanced Evolutionary Semi-Stacking for Disease Detection via Partially Labeled Imbalanced Tongue Data

Balanced-Evolutionary-Semi-Stacking Code for the paper ''BESS: Balanced Evolutionary Semi-Stacking for Disease Detection via Partially Labeled Imbalan

0 Jan 16, 2022