Parallel t-SNE implementation with Python and Torch wrappers.

Overview

Multicore t-SNE Build Status

This is a multicore modification of Barnes-Hut t-SNE by L. Van der Maaten with python and Torch CFFI-based wrappers. This code also works faster than sklearn.TSNE on 1 core.

What to expect

Barnes-Hut t-SNE is done in two steps.

  • First step: an efficient data structure for nearest neighbours search is built and used to compute probabilities. This can be done in parallel for each point in the dataset, this is why we can expect a good speed-up by using more cores.

  • Second step: the embedding is optimized using gradient descent. This part is essentially consecutive so we can only optimize within iteration. In fact some parts can be parallelized effectively, but not all of them a parallelized for now. That is why second step speed-up will not be that significant as first step sepeed-up but there is still room for improvement.

So when can you benefit from parallelization? It is almost true, that the second step computation time is constant of D and depends mostly on N. The first part's time depends on D a lot, so for small D time(Step 1) << time(Step 2), for large D time(Step 1) >> time(Step 2). As we are only good at parallelizing step 1 we will benefit most when D is large enough (MNIST's D = 784 is large, D = 10 even for N=1000000 is not so much). I wrote multicore modification originally for Springleaf competition, where my data table was about 300000 x 3000 and only several days left till the end of the competition so any speed-up was handy.

Benchmark

1 core

Interestingly, that this code beats other implementations. We compare to sklearn (Barnes-Hut of course), L. Van der Maaten's bhtsne, py_bh_tsne repo (cython wrapper for bhtsne with QuadTree). perplexity = 30, theta=0.5 for every run. In fact py_bh_tsne repo works at the same speed as this code when using more optimization flags for compiler.

This is a benchmark for 70000x784 MNIST data:

Method Step 1 (sec) Step 2 (sec)
MulticoreTSNE(n_jobs=1) 912 350
bhtsne 4257 1233
py_bh_tsne 1232 367
sklearn(0.18) ~5400 ~20920

I did my best to find what is wrong with sklearn numbers, but it is the best benchmark I could do (you can find test script in python/tests folder).

Multicore

This table shows a relative to 1 core speed-up when using n cores.

n_jobs Step 1 Step 2
1 1x 1x
2 1.54x 1.05x
4 2.6x 1.2x
8 5.6x 1.65x

How to use

Python and torch wrappers are available.

Python

Install

Directly from pypi

pip install MulticoreTSNE

From source

Make sure cmake is installed on your system, and you will also need a sensible C++ compiler, such as gcc or llvm-clang. On macOS, you can get both via homebrew.

To install the package, please do:

git clone https://github.com/DmitryUlyanov/Multicore-TSNE.git
cd Multicore-TSNE/
pip install .

Tested with both Python 2.7 and 3.6 (conda) and Ubuntu 14.04.

Run

You can use it as a near drop-in replacement for sklearn.manifold.TSNE.

from MulticoreTSNE import MulticoreTSNE as TSNE

tsne = TSNE(n_jobs=4)
Y = tsne.fit_transform(X)

Please refer to sklearn TSNE manual for parameters explanation.

This implementation n_components=2, which is the most common case (use Barnes-Hut t-SNE or sklearn otherwise). Also note that some parameters are there just for the sake of compatibility with sklearn and are otherwise ignored. See MulticoreTSNE class docstring for more info.

MNIST example

from sklearn.datasets import load_digits
from MulticoreTSNE import MulticoreTSNE as TSNE
from matplotlib import pyplot as plt

digits = load_digits()
embeddings = TSNE(n_jobs=4).fit_transform(digits.data)
vis_x = embeddings[:, 0]
vis_y = embeddings[:, 1]
plt.scatter(vis_x, vis_y, c=digits.target, cmap=plt.cm.get_cmap("jet", 10), marker='.')
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5)
plt.show()

Test

You can test it on MNIST dataset with the following command:

python MulticoreTSNE/examples/test.py <n_jobs>

Note on jupyter use

To make the computation log visible in jupyter please install wurlitzer (pip install wurlitzer) and execute this line in any cell beforehand:

%load_ext wurlitzer

Memory leakages are possible if you interrupt the process. Should be OK if you let it run until the end.

Torch

To install execute the following command from repository folder:

luarocks make torch/tsne-1.0-0.rockspec

or

luarocks install https://raw.githubusercontent.com/DmitryUlyanov/Multicore-TSNE/master/torch/tsne-1.0-0.rockspec

You can run t-SNE like that:

tsne = require 'tsne'

Y = tsne(X, n_components, perplexity, n_iter, angle, n_jobs)

torch.DoubleTensor type only supported for now.

License

Inherited from original repo's license.

Future work

  • Allow other types than double
  • Improve step 2 performance (possible)

Citation

Please cite this repository if it was useful for your research:

@misc{Ulyanov2016,
  author = {Ulyanov, Dmitry},
  title = {Multicore-TSNE},
  year = {2016},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DmitryUlyanov/Multicore-TSNE}},
}

Of course, do not forget to cite L. Van der Maaten's paper

Owner
Dmitry Ulyanov
Co-Founder at in3D, Phd @ Skoltech
Dmitry Ulyanov
plotly scatterplots which show molecule images on hover!

molplotly Plotly scatterplots which show molecule images on hovering over the datapoints! Required packages: pandas rdkit jupyter_dash ➡️ See example.

150 Dec 28, 2022
LinkedIn connections analyzer

LinkedIn Connections Analyzer 🔗 https://linkedin-analzyer.herokuapp.com Hey hey 👋 , welcome to my LinkedIn connections analyzer. I recently found ou

Okkar Min 5 Sep 13, 2022
PyFlow is a general purpose visual scripting framework for python

PyFlow is a general purpose visual scripting framework for python. State Base structure of program implemented, such things as packages disco

1.8k Jan 07, 2023
A workshop on data visualization in Python with notebooks and exercises for following along.

Beyond the Basics: Data Visualization in Python The human brain excels at finding patterns in visual representations, which is why data visualizations

Stefanie Molin 162 Dec 05, 2022
Blender addon that creates a temporary window of any type from the 3D View.

CreateTempWindow2.8 Blender addon that creates a temporary window of any type from the 3D View. Features Can the following window types: 3D View Graph

3 Nov 27, 2022
📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

📊📈 Serves up Pandas dataframes via the Django REST Framework for use in client-side (i.e. d3.js) visualizations and offline analysis (e.g. Excel)

wq framework 1.2k Jan 01, 2023
A filler visualizer built using python

filler-visualizer 42 filler のログをビジュアライズしてスポーツさながら楽しむことができます! Usage (標準入力でvisualizer.pyに渡せばALL OK) 1. 既にあるログをビジュアライズする $ ./filler_vm -t 3 -p1 john_fill

Takumi Hara 1 Nov 04, 2021
PolytopeSampler is a Matlab implementation of constrained Riemannian Hamiltonian Monte Carlo for sampling from high dimensional disributions on polytopes

PolytopeSampler PolytopeSampler is a Matlab implementation of constrained Riemannian Hamiltonian Monte Carlo for sampling from high dimensional disrib

9 Sep 26, 2022
Streamlit-template - A streamlit app template based on streamlit-option-menu

streamlit-template A streamlit app template for geospatial applications based on

Qiusheng Wu 41 Dec 10, 2022
High-level geospatial data visualization library for Python.

geoplot: geospatial data visualization geoplot is a high-level Python geospatial plotting library. It's an extension to cartopy and matplotlib which m

Aleksey Bilogur 1k Jan 01, 2023
ecoglib: visualization and statistics for high density microecog signals

ecoglib: visualization and statistics for high density microecog signals This library contains high-level analysis tools for "topos" and "chronos" asp

1 Nov 17, 2021
Tools for calculating and visualizing Elo-like ratings of MLB teams using Retosheet data

Overview This project uses historical baseball games data to calculate an Elo-like rating for MLB teams based on regular season match ups. The Elo rat

Lukas Owens 0 Aug 25, 2021
With Holoviews, your data visualizes itself.

HoloViews Stop plotting your data - annotate your data and let it visualize itself. HoloViews is an open-source Python library designed to make data a

HoloViz 2.3k Jan 04, 2023
A Graph Learning library for Humans

A Graph Learning library for Humans These novel algorithms include but are not limited to: A graph construction and graph searching class can be found

Richard Tjörnhammar 1 Feb 08, 2022
Simple CLI python app to show a stocks graph performance. Made with Matplotlib and Tiingo.

stock-graph-python Simple CLI python app to show a stocks graph performance. Made with Matplotlib and Tiingo. Tiingo API Key You will need to add your

Toby 3 May 14, 2022
Fast 1D and 2D histogram functions in Python

About Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, a

Thomas Robitaille 237 Dec 18, 2022
UNMAINTAINED! Renders beautiful SVG maps in Python.

Kartograph is not maintained anymore As you probably already guessed from the commit history in this repo, Kartograph.py is not maintained, which mean

1k Dec 09, 2022
DataVisualization - The evolution of my arduino and python journey. New level of competence achieved

DataVisualization - The evolution of my arduino and python journey. New level of competence achieved

1 Jan 03, 2022
Write python locally, execute SQL in your data warehouse

RasgoQL Write python locally, execute SQL in your data warehouse ≪ Read the Docs · Join Our Slack » RasgoQL is a Python package that enables you to ea

Rasgo 265 Nov 21, 2022
股票行情实时数据接口-A股,完全免费的沪深证券股票数据-中国股市,python最简封装的API接口

股票行情实时数据接口-A股,完全免费的沪深证券股票数据-中国股市,python最简封装的API接口,包含日线,历史K线,分时线,分钟线,全部实时采集,系统包括新浪腾讯双数据核心采集获取,自动故障切换,STOCK数据格式成DataFrame格式,可用来查询研究量化分析,股票程序自动化交易系统.为量化研究者在数据获取方面极大地减轻工作量,更加专注于策略和模型的研究与实现。

dev 572 Jan 08, 2023