Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

  1. settle on some metric,
  2. evaluate that metric at multiple values of K,
  3. use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

  1. Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
  2. In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
  3. Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
  Compute kmeans with k centers.
  Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl
k-means hierarchical
Computational Complexity O(N * k * d * i) O(N^2 * logN)
Sources of Instability random initial means; local minimum; outlier outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

png

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

png

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info
0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results
predictedCluster document
25 0 SAC Capital Starts Anew as Point72
50 0 Zebra Technologies to Acquire Enterprise Busin...
23 0 Fine Tuning: Good Wife just gets better
21 0 Boulder's Wealth May Be A Factor For Lowest Ob...
6 0 Power restored to nuclear plant in Waterford, ...
73 0 Electricity out as Millstone shifts to diesel
59 1 Twitter's head of media Chloe Sladden steps do...
28 1 Twitter's revolving door: media head Chloe Sla...
12 1 Twitter Exec Exodus Continues with Media Chief...
67 2 Sony Xperia C3 arrives with 5MP selfie camera,...
30 2 Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45 2 Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90 2 Sony Xperia Z4 Concept Emerges as Fan Imagines...
78 2 If you hate the word 'selfie' look away now, t...
71 3 Twitter Executive Quits Amid Stalling Growth
47 3 Twitter COO quits, signalling management shake-up
52 3 Twitter Loses a Powerful Executive
31 3 Second Twitter executive quits hours after Row...
20 3 Twitter COO resigns as growth lags
61 3 Twitter COO Rowghani resigns amid lacklustre g...
57 4 'Goodbye Twitter' COO Ali Rowghani, says bye t...
69 4 Twitter chief operating officer resigns as use...
66 4 UPDATE 3-Twitter chief operating officer resig...
86 4 Twitter chief operating officer Ali Rowghani h...
76 4 Ali Rowghani, Twitter's COO, resigns after mon...
49 4 Twitter COO Ali Rowghani Just Announced Via Tw...
13 4 Twitter COO Ali Rowghani Exits
35 4 Second Twitter exec resigns with goodbye tweet...
39 5 Why almost everything you've been told about u...
77 5 Why Fargo Works So Well as a TV Show
0 6 'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4 6 'Mad Men' end in sight for Weiner
36 6 Weiner reflects on the beginning of the end of...
42 7 Giant mystery crater in Siberia has scientists...
85 7 Mysterious giant crater in the earth discovere...
60 7 Massive Crater Discovered in Siberia
92 7 Massive mystery crater at 'end of the world'
16 7 Mysterious crater in Siberia spawns wild Inter...
43 8 Inflation rise stalls wage hopes in the UK
82 8 The Least Obese City in the Country
19 8 Real wages could resume fall as "Easter effect...
55 8 UK Inflation Rise To 1.8% Delays Real Wage Ris...
26 8 Virginia's Governor Challenges Abortion Clinic...
51 8 BREAKING NEWS: Transport costs lead to hike in...
8 8 Cable prices climb 4 times faster than inflati...
79 9 Despite Safety Issues, GM's Sales Still Increa...
17 9 Chrysler Group LLC reports June 2014 US sales ...
40 9 GM June Sales Up 9 Percent, Best June Since 2007
87 9 Ford sales fall, GM barely even; Jeep powers C...
18 10 Gov. McAuliffe Makes Health Announcements
48 10 Microsoft wants Windows XP dead and has announ...
74 10 McAuliffe puts focus on women's health
7 11 Sony makes duckfacing official with Xperia C3,...
54 11 Sony to announce 'Selfie' phone on July 8th wi...
27 11 Sony prepares to launch a smartphone that has ...
91 11 Sony Xperia C3 launches as "world's best selfi...
88 11 Sony unveils Xperia C3 smartphone with LED fla...
11 11 Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44 12 UK CPI rises to 1.8% in April, core CPI hits 2%
75 12 Rising CO2 Levels Will Lower Nutritional Value...
1 12 Here's How Climate Change Will Make Food Less ...
81 12 Rising CO2 levels also make our food less nutr...
80 13 Nutrition in Crops Are Cut down Drastically by...
2 13 Rising carbon dioxide levels reduce nutrients ...
68 13 With carbon dioxide levels up, nutrients in cr...
64 14 Inflation back up: Modest rise to 1.8% in Apri...
83 14 US plants prepare for long-term nuclear waste ...
22 14 Nuclear Plant Operators Deal With Radioactive ...
32 14 US plants prepare long-term nuclear waste stor...
84 15 'Mad Men' takes off on its final flight
3 15 'Mad Men' mixology
5 15 'Mad Men': 7 things to know for Season 7
9 15 Mad Men - the (Blaxploitation) Movie
37 15 TV Review: Mad Men Season 7
46 15 'Mad Men': Season 7 Premiere Guide (Video)
70 15 10 Things You Never Knew About 'Mad Men'!
53 15 'Mad Men' Season 7 Spoilers: Everything We Kno...
72 15 Rich Sommer from AMC's 'Mad Men' Season Premiere
63 16 Fargo (FX) Season Finale 2014 �Morton's Fork�
56 16 Before 'Fargo's' season finale, a sequel (or p...
65 16 'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62 17 Google Glass headsets get new designs in colla...
41 17 Google's first fashionable Glass frames are de...
89 17 Google Glass Still Trying To Look Cool
34 17 Net-a-Porter Embraces Google Glass
15 18 Routine pelvic exams not recommended under new...
14 18 Doctors group nixes routine pelvic exams
38 18 Metro Detroit doctors wary of recommendation a...
10 18 Doctors against having frequent pelvic exams
58 19 Technology stocks falling for 2nd day in a row
24 19 UPDATE 5-JPMorgan profit weaker than expected ...
29 19 JPMorgan profit weaker than expected
33 19 Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

  • Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
  • Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
  • Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).
Owner
John Jung
Senior Machine Learning Engineer
John Jung
A simple E-commerce shop made with Django and Bulma

Interiorshop A Simple E-Commerce app made with Django Instructions Make sure you have python installed Step 1. Open a terminal Step 2. Paste the given

Aditya Priyadarshi 3 Sep 03, 2022
A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Django Countries A Django application that provides country choices for use with forms, flag icons static files, and a country field for models. Insta

Chris Beaven 1.2k Jan 07, 2023
Auth module for Django and GarpixCMS

Garpix Auth Auth module for Django/DRF projects. Part of GarpixCMS. Used packages: django rest framework social-auth-app-django django-rest-framework-

GARPIX CMS 18 Mar 14, 2022
Simple Login Logout System using Django, JavaScript and ajax.

Djanog-UserAuthenticationSystem Technology Use #version Python 3.9.5 Django 3.2.7 JavaScript --- Ajax Validation --- Login and Logout Functionality, A

Bhaskar Mahor 3 Mar 26, 2022
Show how the redis works with Python (Django).

Redis Leaderboard Python (Django) Show how the redis works with Python (Django). Try it out deploying on Heroku (See notes: How to run on Google Cloud

Tom Xu 4 Nov 16, 2021
A django model and form field for normalised phone numbers using python-phonenumbers

django-phonenumber-field A Django library which interfaces with python-phonenumbers to validate, pretty print and convert phone numbers. python-phonen

Stefan Foulis 1.3k Dec 31, 2022
An extremely fast JavaScript and CSS bundler and minifier

Website | Getting started | Documentation | Plugins | FAQ Why? Our current build tools for the web are 10-100x slower than they could be: The main goa

Evan Wallace 34.2k Jan 04, 2023
Coltrane - A simple content site framework that harnesses the power of Django without the hassle.

coltrane A simple content site framework that harnesses the power of Django without the hassle. Features Can be a standalone static site or added to I

Adam Hill 58 Jan 02, 2023
Duckiter will Automatically dockerize your Django projects.

Duckiter Duckiter will Automatically dockerize your Django projects. Requirements : - python version : python version 3.6 or upper version - OS :

soroush safari 23 Sep 16, 2021
Full control of form rendering in the templates.

django-floppyforms Full control of form rendering in the templates. Authors: Gregor Müllegger and many many contributors Original creator: Bruno Renié

Jazzband 811 Dec 01, 2022
Stream Framework is a Python library, which allows you to build news feed, activity streams and notification systems using Cassandra and/or Redis. The authors of Stream-Framework also provide a cloud service for feed technology:

Stream Framework Activity Streams & Newsfeeds Stream Framework is a Python library which allows you to build activity streams & newsfeeds using Cassan

Thierry Schellenbach 4.7k Jan 02, 2023
☄️ Google Forms autofill script

lazrr 'Destroy Them With Lazers' - Knife Party, 2011 Google Forms autofill script Installation: pip3 install -r requirements.txt Usage: python3 lazrr.

Serezha Rakhmanov 12 Jun 04, 2022
Django model mixins and utilities.

django-model-utils Django model mixins and utilities. django-model-utils supports Django 2.2+. This app is available on PyPI. Getting Help Documentati

Jazzband 2.4k Jan 04, 2023
A drop-in replacement for django's ImageField that provides a flexible, intuitive and easily-extensible interface for quickly creating new images from the one assigned to the field.

django-versatileimagefield A drop-in replacement for django's ImageField that provides a flexible, intuitive and easily-extensible interface for creat

Jonathan Ellenberger 490 Dec 13, 2022
Tools to easily create permissioned CRUD endpoints in graphene-django.

graphene-django-plus Tools to easily create permissioned CRUD endpoints in graphene-django. Install pip install graphene-django-plus To make use of ev

Zerosoft 74 Aug 09, 2022
Store model history and view/revert changes from admin site.

django-simple-history django-simple-history stores Django model state on every create/update/delete. This app supports the following combinations of D

Jazzband 1.8k Jan 06, 2023
It takes time to start a Django Project and make it almost production-ready.

It takes time to start a Django Project and make it almost production-ready. A developer needs to spend a lot of time installing required libraries, setup a database, setup cache as well as hiding se

Khan Asfi Reza 1 Jan 01, 2022
A Django Demo Project of Students Management System

Django_StudentMS A Django Demo Project of Students Management System. From NWPU Seddon for DB Class Pre. Seddon simplify the code in 2021/10/17. Hope

2 Dec 08, 2021
🗂️ 🔍 Geospatial Data Management and Search API - Django Apps

Geospatial Data API in Django Resonant GeoData (RGD) is a series of Django applications well suited for cataloging and searching annotated geospatial

Resonant GeoData 53 Nov 01, 2022
Django/Jinja template indenter

DjHTML A pure-Python Django/Jinja template indenter without dependencies. DjHTML is a fully automatic template indenter that works with mixed HTML/CSS

Return to the Source 378 Jan 01, 2023