A simple machine learning package to cluster keywords in higher-level groups.

Overview

Simple Keyword Clusterer

A simple machine learning package to cluster keywords in higher-level groups.

Example:
"Senior Frontend Engineer" --> "Frontend Engineer"
"Junior Backend developer" --> "Backend developer"


Installation

pip install simple_keyword_clusterer

Usage

# import the package
from simple_keyword_clusterer import Clusterer

# read your keywords in list
with open("../my_keywords.txt", "r") as f:
    data = f.read().splitlines()

# instantiate object
clusterer = Clusterer()

# apply clustering
df = clusterer.extract(data)

print(df)

clustering_example

Performance

The algorithm will find the optimal number of clusters automatically based on the best Silhouette Score.

You can specify the number of clusters yourself too

# instantiate object
clusterer = Clusterer(n_clusters=4)

# apply clustering
df = clusterer.extract(data)

For best performance, try to reduce the variance of data by providing the same semantic context
(the job title keywords file should remain coherent, in that it shouldn't contain other stuff like gardening keywords).

If items are clearly separable, the algorithm should still be able to provide a useable output.

Customization

You can customize the clustering mechanism through the files

  • blacklist.txt
  • to_normalize.txt

If you notice that the clustering identifies unwanted groups, you can blacklist certain words simply by appending them in the blacklist.txt file.

The to_normalize.txt file contains tuples that identify a transformation to apply to the keyword. For instance

("back end", "backend), ("front end", "frontend), ("sr", "Senior"), ("jr", "junior")

Simply add your tuples to use this functionality.

Dependencies

  • Scikit-learn
  • Pandas
  • Matplotlib
  • Seaborn
  • Numpy
  • NLTK
  • Tqdm

Make sure to download NLTK English stopwords and punctuation with the command

nltk.download("stopwords")
nltk.download('punkt')

Contact

If you feel like contacting me, do so and send me a mail. You can find my contact information on my website.

Owner
Andrea D'Agostino
Andrea D'Agostino
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
A visual dataflow programming language for sklearn

Persimmon What is it? Persimmon is a visual dataflow language for creating sklearn pipelines. It represents functions as blocks, inputs and outputs ar

Álvaro Bermejo 194 Jan 04, 2023
Reggy - Regressions with arbitrarily complex regularization terms

reggy Regressions with arbitrarily complex regularization terms. Currently suppo

Kim 1 Jan 20, 2022
Bayesian Additive Regression Trees For Python

BartPy Introduction BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1]. Reasons to use BART

187 Dec 16, 2022
Backtesting an algorithmic trading strategy using Machine Learning and Sentiment Analysis.

Trading Tesla with Machine Learning and Sentiment Analysis An interactive program to train a Random Forest Classifier to predict Tesla daily prices us

Renato Votto 31 Nov 17, 2022
Simple data balancing baselines for worst-group-accuracy benchmarks.

BalancingGroups Code to replicate the experimental results from Simple data balancing baselines achieve competitive worst-group-accuracy. Replicating

Facebook Research 29 Dec 02, 2022
Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

Model Search Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale. It aims to help researchers sp

AriesTriputranto 1 Dec 13, 2021
ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions

ParaMonte is a serial/parallel library of Monte Carlo routines for sampling mathematical objective functions of arbitrary-dimensions, in particular, the posterior distributions of Bayesian models in

Computational Data Science Lab 182 Dec 31, 2022
A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

AI Fairness 360 (AIF360) The AI Fairness 360 toolkit is an extensible open-source library containg techniques developed by the research community to h

1.9k Jan 06, 2023
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 03, 2021
Forecasting prices using Facebook/Meta's Prophet model

CryptoForecasting using Machine and Deep learning (Part 1) CryptoForecasting using Machine Learning The main aspect of predicting the stock-related da

1 Nov 27, 2021
MLFlow in a Dockercontainer based on Azurite and Postgres

mlflow-azurite-postgres docker This is a MLFLow image which works with a postgres DB and a local Azure Blob Storage Instance (Azurite). This image is

2 May 29, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy m

Robin 55 Dec 27, 2022
Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

Price Prediction model is used to develop an LSTM model to predict the future market price of Bitcoin and Ethereum.

2 Jun 14, 2022
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
pure-predict: Machine learning prediction in pure Python

pure-predict speeds up and slims down machine learning prediction applications. It is a foundational tool for serverless inference or small batch prediction with popular machine learning frameworks l

Ibotta 84 Dec 29, 2022
Primitives for machine learning and data science.

An Open Source Project from the Data to AI Lab, at MIT MLPrimitives Pipelines and primitives for machine learning and data science. Documentation: htt

MLBazaar 65 Dec 29, 2022
Factorization machines in python

Factorization Machines in Python This is a python implementation of Factorization Machines [1]. This uses stochastic gradient descent with adaptive re

Corey Lynch 892 Jan 03, 2023
PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows.

An open-source, low-code machine learning library in Python 🚀 Version 2.3.5 out now! Check out the release notes here. Official • Docs • Install • Tu

PyCaret 6.7k Jan 08, 2023
Machine Learning for Time-Series with Python.Published by Packt

Machine-Learning-for-Time-Series-with-Python Become proficient in deriving insights from time-series data and analyzing a model’s performance Links Am

Packt 124 Dec 28, 2022