A fast, flexible, and performant feature selection package for python.

Overview

linselect

A fast, flexible, and performant feature selection package for python.

Package in a nutshell

It's built on stepwise linear regression

When passed data, the underlying algorithm seeks minimal variable subsets that produce good linear fits to the targets. This approach to feature selection strikes a competitive balance between performance, speed, and memory efficiency.

It has a simple API

A simple API makes it easy to quickly rank a data set's features in terms of their added value to a given fit. This is demoed below, where we learn that we can drop column 1 of X and still obtain a fit to y that captures 97.37% of its variance.

from linselect import FwdSelect
import numpy as np

X = np.array([[1,2,4], [1,1,2], [3,2,1], [10,2,2]])
y = np.array([[1], [-1], [-1], [1]])

selector = FwdSelect()
selector.fit(X, y)

print selector.ordered_features
print selector.ordered_cods
# [2, 0, 1]
# [0.47368422, 0.97368419, 1.0]

X_compressed = X[:, selector.ordered_features[:2]]

It's fast

A full sweep on a 1000 feature count data set runs in 10s on my laptop -- about one million times faster (seriously) than standard stepwise algorithms, which are effectively too slow to run at this scale. A 100 count feature set runs in 0.07s.

from linselect import FwdSelect
import numpy as np
import time

X = np.random.randn(5000, 1000)
y = np.random.randn(5000, 1)

selector = FwdSelect()

t1 = time.time()
selector.fit(X, y)
t2 = time.time()
print t2 - t1
# 9.87492

Its scores reveal your effective feature count

By plotting fitted CODs against ranked feature count, one often learns that seemingly high-dimensional problems can actually be understood using only a minority of the available features. The plot below demonstrates this: A fit to one year of AAPL's stock fluctuations -- using just 3 selected stocks as predictors -- nearly matches the performance of a 49-feature fit. The 3-feature fit arguably provides more insight and is certainly easier to reason about (cf. tutorials for details).

apple stock plot

It's flexible

linselect exposes multiple applications of the underlying algorithm. These allow for:

  • Forward, reverse, and general forward-reverse stepwise regression strategies.
  • Supervised applications aimed at a single target variable or simultaneous prediction of multiple target variables.
  • Unsupervised applications. The algorithm can be applied to identify minimal, representative subsets of an available column set. This provides a feature selection analog of PCA -- importantly, one that retains interpretability.

Under the hood

Feature selection algorithms are used to seek minimal column / feature subsets that capture the majority of the useful information contained within a data set. Removal of a selected subset's complement -- the relatively uninformative or redundant features -- can often result in a significant data compression and improved interpretability.

Stepwise selection algorithms work by iteratively updating a model feature set, one at a time [1]. For example, in a given step of a forward process, one considers all of the features that have not yet been added to the model, and then identifies that which would improve the model the most. This is added, and the process is then repeated until all features have been selected. The features that are added first in this way tend to be those that are predictive and also not redundant with those already included in the predictor set. Retaining only these first selected features therefore provides a convenient method for identifying minimal, informative feature subsets.

In general, identifying the optimal feature to add to a model in a given step requires building and scoring each possible updated model variant. This results in a slow process: If there are n features, O(n^2) models must be built to carry out a full ranking. However, the process can be dramatically sped up in the case of linear regression -- thanks to some linear algebra identities that allow one to efficiently update these models as features are either added or removed from their predictor sets [2,3]. Using these update rules, a full feature ranking can be carried out in roughly the same amount of time that is needed to fit only a single model. For n=1000, this means we get an O(n^2) = O(10^6) speed up! linselect makes use of these update rules -- first identified in [2] -- allowing for fast feature selection sweeps.

[1] Introduction to Statistical Learning by G. James, et al -- cf. chapter 6.

[2] M. Efroymson. Multiple regression analysis. Mathematical methods for digital computers, 1:191–203, 1960.

[3] J. Landy. Stepwise regression for unsupervised learning, 2017. arxiv.1706.03265.

Classes, documentation, tests, license

linselect contains three classes: FwdSelect, RevSelect, and GenSelect. As the names imply, these support efficient forward, reverse, and general forward-reverse search protocols, respectively. Each can be used for both supervised and unsupervised analyses.

Docstrings and basic call examples are illustrated for each class in the ./docs folder.

An FAQ and a running list of tutorials are available at efavdb.com/linselect.

Tests: From the root directory,

python setup.py test

This project is licensed under the terms of the MIT license.

Installation

The package can be installed using pip, from pypi

pip install linselect

or from github

pip install git+git://github.com/efavdb/linselect.git

Author

Jonathan Landy - EFavDB

Acknowledgments: Special thanks to P. Callier, P. Spanoudes, and R. Zhou for providing helpful feedback.

Reading streams of Twitter data, save them to Kafka, then process with Kafka Stream API and Spark Streaming

Using Streaming Twitter Data with Kafka and Spark Reading streams of Twitter data, publishing them to Kafka topic, process message using Kafka Stream

Rustam Zokirov 1 Dec 06, 2021
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Tsinghua Machine Learning Group 2.2k Dec 28, 2022
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022
Implementation in Python of the reliability measures such as Omega.

OmegaPy Summary Simple implementation in Python of the reliability measures: Omega Total, Omega Hierarchical and Omega Hierarchical Total. Name Link O

Rafael Valero Fernández 2 Apr 27, 2022
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 09, 2023
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 06, 2021
Find exposed data in Azure with this public blob scanner

BlobHunter A tool for scanning Azure blob storage accounts for publicly opened blobs. BlobHunter is a part of "Hunting Azure Blobs Exposes Millions of

CyberArk 250 Jan 03, 2023
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
Desafio 1 ~ Bantotal

Challenge 01 | Bantotal Please read the instructions for the challenge by selecting your preferred language below: Español Português License Copyright

Maratona Behind the Code 44 Sep 28, 2022
An interactive grid for sorting, filtering, and editing DataFrames in Jupyter notebooks

qgrid Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your D

Quantopian, Inc. 2.9k Jan 08, 2023
Generate lookml for views from dbt models

dbt2looker Use dbt2looker to generate Looker view files automatically from dbt models. Features Column descriptions synced to looker Dimension for eac

lightdash 126 Dec 28, 2022
Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

Oxford Nanopore Technologies 72 Jan 05, 2023
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
sportsdataverse python package

sportsdataverse-py See CHANGELOG.md for details. The goal of sportsdataverse-py is to provide the community with a python package for working with spo

Saiem Gilani 37 Dec 27, 2022
PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

PATC: Introduction to Big Data Analytics. Practical Data Analytics for Solving Real World Problems

1 Feb 07, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

Wes McKinney 18.6k Jan 08, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 09, 2023
Senator Trades Monitor

Senator Trades Monitor This monitor will grab the most recent trades by senators and send them as a webhook to discord. Installation To use the monito

Yousaf Cheema 5 Jun 11, 2022
This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

COSMETICS GENERATOR This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics! Remember to put the l

ᴅᴊʟᴏʀ3xᴢᴏ 11 Dec 13, 2022