Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models.

Last update: Dec 27, 2022

Related tags

Machine Learning feature_engine

Overview

Feature Engine

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models. Feature-engine's transformers follow scikit-learn's functionality with fit() and transform() methods to first learn the transforming parameters from data and then transform the data.

Feature-engine features in the following resources:

Blogs about Feature-engine:

Documentation

En Español:

More resources will be added as they appear online!

Current Feature-engine's transformers include functionality for:

Missing Data Imputation
Categorical Variable Encoding
Outlier Capping or Removal
Discretisation
Numerical Variable Transformation
Variable Creation
Variable Selection
Scikit-learn Wrappers

Imputing Methods

MeanMedianImputer
RandomSampleImputer
EndTailImputer
AddMissingIndicator
CategoricalImputer
ArbitraryNumberImputer
DropMissingData

Encoding Methods

OneHotEncoder
OrdinalEncoder
CountFrequencyEncoder
MeanEncoder
WoEEncoder
PRatioEncoder
RareLabelEncoder
DecisionTreeEncoder

Outlier Handling methods

Winsorizer
ArbitraryOutlierCapper
OutlierTrimmer

Discretisation methods

EqualFrequencyDiscretiser
EqualWidthDiscretiser
DecisionTreeDiscretiser
ArbitraryDiscreriser

Variable Transformation methods

LogTransformer
LogCpTransformer
ReciprocalTransformer
PowerTransformer
BoxCoxTransformer
YeoJohnsonTransformer

Scikit-learn Wrapper:

SklearnTransformerWrapper

Variable Creation:

MathematicalCombination
CombineWithReferenceFeature
CyclicalTransformer

Feature Selection:

DropFeatures
DropConstantFeatures
DropDuplicateFeatures
DropCorrelatedFeatures
SmartCorrelationSelection
ShuffleFeaturesSelector
SelectBySingleFeaturePerformance
SelectByTargetMeanPerformance
RecursiveFeatureElimination
RecursiveFeatureAddition

Installing

From PyPI using pip:

pip install feature_engine

From Anaconda:

conda install -c conda-forge feature_engine

Or simply clone it:

git clone https://github.com/feature-engine/feature_engine.git

Usage

>>> import pandas as pd
>>> from feature_engine.encoding import RareLabelEncoder

>>> data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1}
>>> data = pd.DataFrame(data)
>>> data['var_A'].value_counts()

Out[1]:
A    10
B    10
C     2
D     1
Name: var_A, dtype: int64

>>> rare_encoder = RareLabelEncoder(tol=0.10, n_categories=3)
>>> data_encoded = rare_encoder.fit_transform(data)
>>> data_encoded['var_A'].value_counts()

Out[2]:
A       10
B       10
Rare     3
Name: var_A, dtype: int64

See more usage examples in the Jupyter Notebooks in the example folder of this repository, or in the documentation.

Contributing

Details about how to contribute can be found in the Contributing Page

In short:

Local Setup Steps

Fork the repo
Clone your fork into your local computer: git clone https://github.com/ /feature_engine.git
cd into the repo cd feature_engine
Install as a developer: pip install -e .
Create and activate a virtual environment with any tool of choice
Install the dependencies as explained in the Contributing Page
Create a feature branch with a meaningful name for your feature: git checkout -b myfeaturebranch
Develop your feature, tests and documentation
Make sure the tests pass
Make a PR

Thank you!!

Opening Pull Requests

PR's are welcome! Please make sure the CI tests pass on your branch.

Tests

We prefer tox. In your environment:

Run pip install tox
cd into the root directory of the repo: cd feature_engine
Run tox

If the tests pass, the code is functional.

You can also run the tests in your environment (without tox). For guidelines on how to do so, check the Contributing Page.

Documentation

Feature-engine documentation is built using Sphinx and is hosted on Read the Docs.

To build the documentation make sure you have the dependencies installed. From the root directory: pip install -r docs/requirements.txt.

Now you can build the docs: sphinx-build -b html docs build

License

BSD 3-Clause

References

Many of the engineering and encoding functionalities are inspired by this series of articles from the 2009 KDD Competition.

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models.

Related tags

Overview

Feature Engine

Feature-engine features in the following resources:

Blogs about Feature-engine:

Documentation

En Español:

Current Feature-engine's transformers include functionality for:

Imputing Methods

Encoding Methods

Outlier Handling methods

Discretisation methods

Variable Transformation methods

Scikit-learn Wrapper:

Variable Creation:

Feature Selection:

Installing

Usage

Contributing

Local Setup Steps

Opening Pull Requests

Tests

Documentation

License

References

Owner

Soledad Galli

Predicting diabetes over a five year period using logistic regression and the Pima First-Nation dataset

Feature-engine is a Python library with multiple transformers to engineer and select features for use in machine learning models.

A library of sklearn compatible categorical variable encoders

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. ⚡️🧑‍🔧

Flask app to predict daily radiation from the time series of Solcast from Islamabad, Pakistan

A data preprocessing package for time series data. Design for machine learning and deep learning.

ML Kaggle Titanic Problem using LogisticRegrission

K-Means clusternig example with Python and Scikit-learn

Implementation of different ML Algorithms from scratch, written in Python 3.x

AP1 Transcription Factor Binding Site Prediction

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

QML: A Python Toolkit for Quantum Machine Learning

This jupyter notebook project was completed by me and my friend using the dataset from Kaggle

Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

PROTEIN EXPRESSION ANALYSIS FOR DOWN SYNDROME

Machine Learning approach for quantifying detector distortion fields

Intel(R) Extension for Scikit-learn is a seamless way to speed up your Scikit-learn application

A python library for easy manipulation and forecasting of time series.