A library of sklearn compatible categorical variable encoders

Last update: Jan 07, 2023

Related tags

Overview

Categorical Encoding Methods

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

Important Links

Documentation: http://contrib.scikit-learn.org/category_encoders/

Encoding Methods

Unsupervised:

Backward Difference Contrast [2][3]
BaseN [6]
Binary [5]
Count [10]
Hashing [1]
Helmert Contrast [2][3]
Ordinal [2][3]
One-Hot [2][3]
Polynomial Contrast [2][3]
Sum Contrast [2][3]

Supervised:

CatBoost [11]
Generalized Linear Mixed Model [12]
James-Stein Estimator [9]
LeaveOneOut [4]
M-estimator [7]
Target Encoding [7]
Weight of Evidence [8]
Quantile Encoder [13]
Summary Encoder [13]

Installation

The package requires: numpy, statsmodels, and scipy.

To install the package, execute:

$ python setup.py install

pip install category_encoders

conda install -c conda-forge category_encoders

To install the development version, you may use:

pip install --upgrade git+https://github.com/scikit-learn-contrib/category_encoders

Usage

All of the encoders are fully compatible sklearn transformers, so they can be used in pipelines or in your existing scripts. Supported input formats include numpy arrays and pandas dataframes. If the cols parameter isn't passed, all columns with object or pandas categorical data type will be encoded. Please see the docs for transformer-specific configuration options.

Examples

There are two types of encoders: unsupervised and supervised. An unsupervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y = bunch.target
X = pd.DataFrame(bunch.data, columns=bunch.feature_names)

# use binary encoding to encode two categorical features
enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X)

# transform the dataset
numeric_dataset = enc.transform(X)

And a supervised example:

from category_encoders import *
import pandas as pd
from sklearn.datasets import load_boston

# prepare some data
bunch = load_boston()
y_train = bunch.target[0:250]
y_test = bunch.target[250:506]
X_train = pd.DataFrame(bunch.data[0:250], columns=bunch.feature_names)
X_test = pd.DataFrame(bunch.data[250:506], columns=bunch.feature_names)

# use target encoding to encode two categorical features
enc = TargetEncoder(cols=['CHAS', 'RAD'])

# transform the datasets
training_numeric_dataset = enc.fit_transform(X_train, y_train)
testing_numeric_dataset = enc.transform(X_test)

For the transformation of the training data with the supervised methods, you should use fit_transform() method instead of fit().transform(), because these two methods do not have to generate the same result. The difference can be observed with LeaveOneOut encoder, which performs a nested cross-validation for the training data in fit_transform() method (to decrease over-fitting of the downstream model) but uses all the training data for scoring with transform() method (to get as accurate estimates as possible).

Furthermore, you may benefit from following wrappers:

PolynomialWrapper, which extends supervised encoders to support polynomial targets
NestedCVWrapper, which helps to prevent overfitting

Additional examples and benchmarks can be found in the examples directory.

Contributing

Category encoders is under active development, if you'd like to be involved, we'd love to have you. Check out the CONTRIBUTING.md file or open an issue on the github project to get started.

References

Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
Contrast Coding Systems for categorical variables. UCLA: Statistical Consulting Group. From https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/.
Gregory Carey (2003). Coding Categorical Variables. From http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf
Strategies to encode categorical variables with many categories. From https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
Beyond One-Hot: an exploration of categorical variables. From http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/
BaseN Encoding and Grid Search in categorical variables. From http://www.willmcginnis.com/2016/12/18/basen-encoding-grid-search-category_encoders/
Daniele Miccii-Barreca (2001). A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1. From http://dx.doi.org/10.1145/507533.507538
Weight of Evidence (WOE) and Information Value Explained. From https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
Empirical Bayes for multiple sample sizes. From http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
Simple Count or Frequency Encoding. From https://www.datacamp.com/community/tutorials/encoding-methodologies
Transforming categorical features to numerical features. From https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
Andrew Gelman and Jennifer Hill (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. From https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
Carlos Mougan, David Masip, Jordi Nin and Oriol Pujol (2021). Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. https://link.springer.com/chapter/10.1007%2F978-3-030-85529-1_14

Comments

Add Multi-Process Supported HashingEncoder

Add multiple-process supported HashingEncoder called NHashingEncoder. By using multi-process, it's several times faster than HashingEncoder. On i5-8259U, encoding 1,000,000 samples by HashingEncoder takes 720+ seconds while NHashingEncoder with parameter "max_process=4" only takes 230+ seconds. On 16x3.2GHz CPU + 64G Memory linux, encoding 20million samples by HashingEncoder takes over 4 hours while NHashingEncoder with parameter"max_process=8" only takes 20 minutes.

opened by liushulun 30
Ordinal encoder support new handle unknown handle missing
Here is the first pass at making Ordinal Encoder support the fields handle_unknown and handle_missing as described at https://github.com/scikit-learn-contrib/categorical-encoding/issues/92.

Lets go through the fields and their logic.

handle_unknown

value

unknown values go to -1 at transform time

error

throw ValueError if encounter new categories during transform time

return_nan

at transform time, return nan

Ok, now handle_missing has configurations for each setting depending on if nan is present at fit time.

handle_missing

value

Nan present at fit time-> nan is treated as category

Nan not present at fit time -> transform returns -2

return_nan

fit add -2 mapping and at transform return -2 with nan

error

At fit or transform, throw error

Ok, for a total implementation every encoder will have to be changed. What do we want to do avoiding gigantic Pull Requests? Have a long lived feature branch?

Ok thoughts,

I am going to implement cucumber tests for the handle_unknown and .handle_missing because trying to keep it all straight in my head is difficult.

I need to go through inverse transform and check it against every new setting.

My implementation for return_nan make processing in the downstream encoders more difficult because we are mapping nan to -2.

The relationship between value and indicator for the multi-column encoders and the output of the ordinal encoder currently confuses me. I am going to sit down and write it all out so I know what should lead to what.

Check the changes to the test_ordinal_dist test in test_ordinal. Why was None not being treated as a category?

Tell me what you think and I can get started o the other encoders.
opened by JohnnyC08 23
Fix binary encoder for columntransformer

I discovered that when using the BinaryEncoder in a sklearn.ColumnTransformer, the passed params are lost.

This is because the encoder gets instantiated twice in a ColumnTransformer. Currently, params are not registered to self in BinaryEncoder.init(), so they are lost when the ColumnTransformer is put to work.

Disclaimer: I was able to correctly binary encode in a local debug session. However, as there are so many tests failing on the upstream master currently, it was hard to find out whether my solution has an undesired impact.

Also, I am confused by ordinal.py L323-L326. Is this a bug? It seems to correctly encode both with the -2 and np.nan...

opened by datarian 21
Quantile encoder
This PR (#302), implements two methods from a recently published paper at a conference (MDAI 2021).

Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems (Carlos Mougan, David Masip, Jordi Nin, Oriol Pujol)

Encoding methods, full technical development can be followed in the paper:

Quantile Encoder

Tests are implemented and passed. Scikit learn API semantics Docs is extended

If I missed something or any comments, please let me know :)
opened by cmougan 19
1.4.0 Release Organization
Hey all, been away from the project for a bit, but I'm going back through all of the issues and PRs worked on recently (looks like a bunch of good progress!). Special thanks to @janmotl for all of the work as primary maintainer over the past months.

Our last release was 1.3.0 on October 14th. Since then ya'll have:

Sped up the TargetEncoder and LeaveOneOutEncoder w/ vectorization (significantly)

Added support for Categorical types in many encoders

Implemented get_feature_names in remaining transformers

Improved testing coverage and quality

Solved edge cases in repeated column names for some transformers

Added support for transforming pandas Series as well as DataFrames and numpy Arrays

Fixed inverse transform for many encoders

Lots of smaller performance enhancements and code cleanups

Which I think is a quite full set of features to constitute a release. I will be opening a separate issue to discuss how we as a community can improve our release cycle, but for now will be going through open issues and tagging anything that should be included before the v1.4.0 release. Any input on what should or shouldn't be completed prior to release is welcome.

Thank you all for the work and support this year, and Happy Holidays.
Release
opened by wdm0006 17

Behavior of OneHotEncoder handle_unknown option

I'm trying to understand the behavior (and intent) of the handle_unknown option for OneHotEncoder (and by extension OrdinalEncoder). The docs imply that this should control NaN handling but below examples seem to indicate otherwise (category_encoders==1.2.8)

In [2]: import pandas as pd
   ...: import numpy as np
   ...: from category_encoders import OneHotEncoder
   ...: 

In [3]: X = pd.DataFrame({'a': ['foo', 'bar', 'bar'],
   ...:                   'b': ['qux', np.nan, 'foo']})
   ...: X
   ...: 
Out[3]: 
     a    b
0  foo  qux
1  bar  NaN
2  bar  foo

In [4]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[4]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [5]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='impute', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[5]: 
   a_foo  a_bar  a_-1  b_qux  b_nan  b_foo  b_-1
0      1      0     0      1      0      0     0
1      0      1     0      0      1      0     0
2      0      1     0      0      0      1     0

In [6]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='error', 
   ...:                         impute_missing=True, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[6]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In [7]: encoder = OneHotEncoder(cols=['a', 'b'], handle_unknown='ignore', 
   ...:                         impute_missing=False, use_cat_names=True)
   ...: encoder.fit_transform(X)
   ...: 
Out[7]: 
   a_foo  a_bar  b_qux  b_nan  b_foo
0      1      0      1      0      0
1      0      1      0      1      0
2      0      1      0      0      1

In particular, 'error' and 'ignore' give the same behavior, treating missing observations as another category. 'impute' adds constant zero-valued columns but also treats missing observations as another category. Naively would've expected behavior similar to pd.get_dummies(X, dummy_na={True|False}), with handle_unknown=ignore corresponding to dummy_na=False.

bug

opened by multiloc 17

Get feature names

Implemented get_feature_names for HashingEncoder, OneHotEncoder and OrdinalEncoder.

For my purposes, these work now. Not fully tested. It's more of a proposal for a concept. If liked, I will gladly implement for the rest of the encoders, incorporating any feedback.

opened by datarian 16
Question: Difference between TargetEncoder and LeaveOneOutEncoder

It's not really clear to me what the difference between TargetEncoder and LeaveOneOutEncoder, as both encode using the target with leave-one-out. Can you maybe clarify and clarify this in the docs? Does either work for multi-class classification?
question

opened by amueller 13
Implement Target Encoding with Hierarchical Structure Smoothing

From section 4 of the paper sited in TargetEncoding.

Instead of choosing the prior probability of the target as the null hypothesis, it is reasonable to replace it with the estimated probability at the next higher level of aggregation in the attribute hierarchy

In other words, if we have a single zipcode 54321 but 100 zipcodes 54322 and 100 zipcodes 54323, we could use the mean of zipcode level 4 5432X as our mean smoothing term for zipcode 54321, instead of the mean for all zipcodes XXXXX.

This would be a really nice additional piece of functionality to add as an encoder.
enhancement

opened by JoshuaC3 13
In the ordinal encoder go ahead and update the existing column instea…

To fix https://github.com/scikit-learn-contrib/categorical-encoding/issues/100

The issue was arising because _tmp columns were being appended to the end of the data frame as part of the transform process.

First, we noticed that the transform process was to append a temporary column, drop the existing column, and rename the temporary column to the existing column name.

So, we went ahead and reduced that step to one step where we update the existing column using our mapping which preserves the order. I wasn't sure why the above mentioned transform method had that many steps and a single update seems to ensure the tests are passing.

@janmotl I also noticed in travis the python3 step seems to be running python 2.7 instead of python3. From the install.sh I see mentions of a conda create and in the CI logs I see the mention of a virtualenv being set which I don't see mentioned in the project. Perhaps the travis cache needs to be cleared?

opened by JohnnyC08 13

Differing dimensions for training and test

Hi,

I would like to fit encodings on my training set and then using this fitted encoding to transform both the training and the test set:

import category_encoders as ce

train = ['Brunswick East', 'Fitzroy', 'Williamstown', 'Newport', 'Balwyn North', 'Doncaster', 'Melbourne', 'Albert Park', 'Bentleigh', 'Northcote']
test = ['Fitzroy North', 'Fitzroy', 'Richmond', 'Surrey Hills', 'Blackburn', 'Port Melbourne', 'Footscray', 'Yarraville', 'Carnegie', 'Surrey Hills']

encoder = ce.HelmertEncoder()
encoder.fit(train)

train_t = encoder.transform(train)
test_t = encoder.transform(test)

print train_t.shape
>> (10, 10)
print test_t.shape
>> (10, 2)

The problem is that the dimensions do not fit. What do I do wrong or how can I fix this issue?

Best regards, Felix

opened by FelixNeutatz 12

ValueError: `X` and `y` both have indexes, but they do not match.
Expected Behavior

When running any of the category encoders e.g. TargetEncoder() within a pipeline through permutation_test_score() it errors out with the above message. The error occurs in the convert_inputs() function which checks for if any(X.index != y.index): before raising the error.

Actual Behavior

The error is not correct and shouldn't occur. When I ran the same check above on my input X (dataframe) and y (series), the error doesn't occur.

In fact, when I load input data, after splitting the data into X and y, and after label encoding y, I explicitly convert it into a pd.Series and assign it the X.index, so they are in fact identical.

If in contrast, I do not convert the label encoded y into a pd.Series and leave it as an ndarray, then this error doesn't occur!

Also, note that the same pipeline when fitted with the same X, y df and series works absolutely fine.

Steps to Reproduce the Problem

See an example of my pipeline below:

Create an arbitrary pipeline as follows:

from sklearn.linear_model import SGDClassifier from category_encoders import TargetEncoder test_pipe = Pipeline([('enc', TargetEncoder()), ('clf', SGDClassifier(loss='log_loss'))])

Run

score, perm_scores, pvalue = permutation_test_score(test_pipe, X, y)

Specifications

Version: 2.5.1.post0

Platform: Python 3.10.8

Subsystem: Pandas 1.5.1
opened by RNarayan73 1

OneHotEncoder: handle_missing = 'ignore' would be very useful

Expected Behavior

It would be nice to be able to ignore missing values instead of creating new columns with an "_nan" suffix. Just like it is possible with pandas. What do you think?

Actual Behavior

Doesn't exist in the current latest version (accoring to my knowledge)

Steps to Reproduce

import pandas as pd
import numpy as np
from category_encoders import OneHotEncoder

encoder = OneHotEncoder(
    cols=None,  # all non-numeric
    return_df=True,
    handle_missing="value",  # would be nice to have the option 'ignore'
    use_cat_names=True,
)
df = pd.DataFrame(
    {"this": ["GREEN", "GREEN", "YELLOW", "YELLOW"], "that": ["A", "B", "A", np.nan]}
)

encoder.fit_transform(df) # unwanted result
pd.get_dummies(df, dummy_na=False) # wanted result

Specifications

Version: 2.5.1.post0

opened by woodly0 0

fix: Broken inverse_transform for OrdinalEncoder when custom mapping …

This PR fixes issue #202. It then allows for inverse transform to be performed with custom dict.

Proposed Changes

The changes are minor and modify line 171 by implementing comment https://github.com/scikit-learn-contrib/category_encoders/issues/202#issuecomment-946159286

opened by fredmontet 2
Intercept in Contrast Coding Schemes
Expected Behavior

The constant (all values 1) intercept column should not be added when applying contrast coding schemes (i.e. backward difference, sum, polynomial and helmert coding)

I don't think this intercept column is needed. If you fit a supervised learning model it is probably gonna help to remove the intercept column. I think it is there because when fitting linear models with statsmodels you have to add the intercept.
However I don't like that the output of an encoder would then depend on whether the intercept column is already there or not, e.g. if I first apply encoder A on column A and then encoder B on column B the intercept column of B overwrite A's intercept column hence not adding a new column. Also if I have (for some reason) a column called intercept that is not constant it would get overwritten.

Any opinion? Am I missing something? Is the intercept necessary?

Actual Behavior

A constant column with all values 1 is added

Steps to Reproduce the Problem

Run transform on any fitted contrast coding encoder, e.g.

train = ['A', 'B', 'C'] encoder = encoders.BackwardDifferenceEncoder(handle_unknown='value', handle_missing='value') encoder.fit_transform(train)
opened by PaulWestenthanner 3
No need to check if # of dimensions of testing set align with training set in target_encoder

https://github.com/scikit-learn-contrib/category_encoders/blob/6a13c14919d56fed8177a173d4b3b82c5ea2fef5/category_encoders/utils.py#L322-L323

For the function _check_transform_inputs(), I do not want it to report error when # of dimensions for testing set don't align with training set. However, the default is it has to align. Considering that the purpose of target encoder is to transform designated columns using target encoders, nothing else, logically we don't have to validate the dimension alignment.

opened by hongG1997EQ 1
Memory increase of WOEEncoder for newer category_encoders version
Memory increase of WOEEncoder for category_encoders version >=2.0.0

Hi, I noticed another memory issue with WOEEncoder. I have submitted the same bug before in #335, the difference between two bugs is the different encoder methods used and different datasets. In order to distinguish between the two encoder APIs, I resubmitted a new bug report.

Expected Behavior

Similar memory usage

Actual Behavior

According to the experiment results, when the category_encoders version is higher than 2.0.0, weight_enc.fit(train[weight_encode], train['target']) memory usage increase from 58MB to 206MB.

Memory(MB) | Version -- | -- 209| 2.3.0 209| 2.2.2 209| 2.1.0 209| 2.0.0 58| 1.3.0

Steps to Reproduce the Problem

Step 1: Download the dataset

train.zip

Step 2: install category_encoders

pip install category_encoders == #version#

Step 3: change category_encoders version and save the memory usage

import numpy as np import pandas as pd train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') columns = [x for x in train.columns if x != 'target'] object_col_label = ['bin_0','bin_1','bin_2','bin_3','bin_4'] one_hot_encode = ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4'] target_encode = ['nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9'] weight_encode = target_encode + ['ord_4', 'ord_5' ,'ord_3'] + one_hot_encode + object_col_label import category_encoders as ce weight_enc = ce.woe.WOEEncoder(cols=weight_encode) import tracemalloc tracemalloc.start() weight_enc.fit(train[weight_encode], train['target']) current3, peak3 = tracemalloc.get_traced_memory() print("Get_dummies memory usage is {",current3 /1024/1024,"}MB; Peak memory was :{",peak3 / 1024/1024,"}MB")

Specifications

Version: 2.3.0, 2.2.2, 2.1.0, 2.0.0, 1.3.0 Platform: ubuntu 16.4 OS : Ubuntu CPU : Intel(R) Core(TM) i9-9900K CPU GPU : TITAN V
opened by Piecer-plc 1

Releases(2.5.1.post0)

2.5.1.post0(Oct 6, 2022)

release sdist fix
Source code(tar.gz)
Source code(zip)
2.5.1(Oct 5, 2022)

changes according to changelog
Source code(tar.gz)
Source code(zip)
2.5.0(Jun 2, 2022)

Release 2.5.0 with changes according to changelog
Source code(tar.gz)
Source code(zip)
2.4.1(May 10, 2022)

release 2.4.1 with minor fixes
Source code(tar.gz)
Source code(zip)
2.4.0(Mar 9, 2022)

Releasing changes described in changelog.md
Source code(tar.gz)
Source code(zip)
2.3.0-rerelease(Oct 13, 2021)

Re-running the 2.3.0 release with updated pypi creds
Source code(tar.gz)
Source code(zip)
2.3.0(Oct 7, 2021)

Largely a bugfix release after a period of no maintenance. May still be issues with GLMEncoder.
Source code(tar.gz)
Source code(zip)
2.2.2(Apr 29, 2020)

Source code(tar.gz)
Source code(zip)
2.2.1(Apr 29, 2020)

Source code(tar.gz)
Source code(zip)
2.2.0(Apr 29, 2020)

Source code(tar.gz)
Source code(zip)
2.0.0(Apr 28, 2019)

Source code(tar.gz)
Source code(zip)
1.2.6(Jan 22, 2018)

A copy of the v1.2.6 release for zenodo
Source code(tar.gz)
Source code(zip)

Owner

scikit-learn compatible projects

GitHub Repository http://contrib.scikit-learn.org/category_encoders/

Summer: compartmental disease modelling in Python

Summer: compartmental disease modelling in Python Summer is a Python-based framework for the creation and execution of compartmental (or "state-based"

6 May 13, 2022

PennyLane is a cross-platform Python library for differentiable programming of quantum computers

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural ne

1.6k Jan 01, 2023

Library of Stan Models for Survival Analysis

survivalstan: Survival Models in Stan author: Jacki Novik Overview Library of Stan Models for Survival Analysis Features: Variety of standard survival

122 Jan 06, 2023

Drug prediction

I have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Dr

1 Jan 28, 2022

PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

1.1k Jan 04, 2023

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

Seldon Core: Blazing Fast, Industry-Ready ML An open source platform to deploy your machine learning models on Kubernetes at massive scale. Overview S

3.5k Jan 01, 2023

A collection of machine learning examples and tutorials.

machine_learning_examples A collection of machine learning examples and tutorials.

7.1k Jan 01, 2023

A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Machine Learning Conference & Summer School Notes. 🦄📝🎉

558 Dec 28, 2022

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

PyTASER PyTASER is a Python (3.9+) library and set of command-line tools for classifying spectral features in bulk materials, post-DFT. The goal of th

4 Dec 27, 2022

Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

Databricks Certification Spark Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along

19 Dec 13, 2022

CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

CyLP CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL). CyLP’s unique feature is that you can use i

161 Dec 14, 2022

Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

326 Jan 02, 2023

Transform ML models into a native code with zero dependencies

m2cgen (Model 2 Code Generator) - is a lightweight library which provides an easy way to transpile trained statistical models into a native code

2.3k Jan 03, 2023

Stacked Generalization (Ensemble Learning)

Stacking (stacked generalization) Overview ikki407/stacking - Simple and useful stacking library, written in Python. User can use models of scikit-lea

192 Dec 23, 2022

Python package for concise, transparent, and accurate predictive modeling

Python package for concise, transparent, and accurate predictive modeling. All sklearn-compatible and easy to use. 📚 docs • 📖 demo notebooks Modern

983 Jan 01, 2023

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 03, 2023

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models

538 Jan 01, 2023

Traingenerator 🧙 A web app to generate template code for machine learning ✨

Traingenerator 🧙 A web app to generate template code for machine learning ✨ 🎉 Traingenerator is now live! 🎉

1.2k Jan 07, 2023

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022

A machine learning model for Covid case prediction

CovidcasePrediction A machine learning model for Covid case prediction Problem Statement Using regression algorithms we can able to track the active c

1 Feb 02, 2022

A library of sklearn compatible categorical variable encoders

Related tags

Overview

Categorical Encoding Methods

Important Links

Encoding Methods

Installation

Usage

Examples

Contributing

References

Comments

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Expected Behavior

Actual Behavior

Steps to Reproduce

Specifications

Proposed Changes

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

Releases(2.5.1.post0)

2.5.1.post0(Oct 6, 2022)

2.5.1(Oct 5, 2022)

2.5.0(Jun 2, 2022)

2.4.1(May 10, 2022)

2.4.0(Mar 9, 2022)

2.3.0-rerelease(Oct 13, 2021)

2.3.0(Oct 7, 2021)

2.2.2(Apr 29, 2020)

2.2.1(Apr 29, 2020)

2.2.0(Apr 29, 2020)

2.0.0(Apr 28, 2019)

1.2.6(Jan 22, 2018)

Owner

Summer: compartmental disease modelling in Python

PennyLane is a cross-platform Python library for differentiable programming of quantum computers

Library of Stan Models for Survival Analysis

Drug prediction

PySpark + Scikit-learn = Sparkit-learn

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models

A collection of machine learning examples and tutorials.

A Collection of Conference & School Notes in Machine Learning 🦄📝🎉

Simulate & classify transient absorption spectroscopy (TAS) spectral features for bulk semiconducting materials (Post-DFT)

Databricks Certified Associate Spark Developer preparation toolkit to setup single node Standalone Spark Cluster along with material in the form of Jupyter Notebooks.

CyLP is a Python interface to COIN-OR’s Linear and mixed-integer program solvers (CLP, CBC, and CGL)

Python implementation of the rulefit algorithm

Transform ML models into a native code with zero dependencies

Stacked Generalization (Ensemble Learning)

Python package for concise, transparent, and accurate predictive modeling

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

Traingenerator 🧙 A web app to generate template code for machine learning ✨

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

A machine learning model for Covid case prediction