Extra blocks for scikit-learn pipelines.

Overview

Build status Downloads Version Conda Version Code style: black

scikit-lego

We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to attempt to consolidate these into a package that offers code quality/testing. This project started as a collaboration between multiple companies in the Netherlands but has since received contributions from around the globe. It was initiated by Matthijs Brouns and Vincent D. Warmerdam as a tool to teach people how to contribute to open source.

Note that we're not formally affiliated with the scikit-learn project at all, but we aim to strictly adhere to their standards.

The same holds with lego. LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this project.

Installation

Install scikit-lego via pip with

python -m pip install scikit-lego

Via conda with

conda install -c conda-forge scikit-lego

Alternatively, to edit and contribute you can fork/clone and run:

python -m pip install -e ".[dev]"
python setup.py develop

Documentation

The documentation can be found here.

Usage

We offer custom metrics, models and transformers. You can import them just like you would in scikit-learn.

# the scikit learn stuff we love
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# from scikit lego stuff we add
from sklego.preprocessing import RandomAdder
from sklego.mixture import GMMClassifier

...

mod = Pipeline([
    ("scale", StandardScaler()),
    ("random_noise", RandomAdder()),
    ("model", GMMClassifier())
])

...

Features

Here's a list of features that this library currently offers:

  • sklego.datasets.load_abalone loads in the abalone dataset
  • sklego.datasets.load_arrests loads in a dataset with fairness concerns
  • sklego.datasets.load_chicken loads in the joyful chickweight dataset
  • sklego.datasets.load_heroes loads a heroes of the storm dataset
  • sklego.datasets.load_hearts loads a dataset about hearts
  • sklego.datasets.load_penguins loads a lovely dataset about penguins
  • sklego.datasets.fetch_creditcard fetch a fraud dataset from openml
  • sklego.datasets.make_simpleseries make a simulated timeseries
  • sklego.pandas_utils.add_lags adds lag values in a pandas dataframe
  • sklego.pandas_utils.log_step a useful decorator to log your pipeline steps
  • sklego.dummy.RandomRegressor dummy benchmark that predicts random values
  • sklego.linear_model.DeadZoneRegressor experimental feature that has a deadzone in the cost function
  • sklego.linear_model.DemographicParityClassifier logistic classifier constrained on demographic parity
  • sklego.linear_model.EqualOpportunityClassifier logistic classifier constrained on equal opportunity
  • sklego.linear_model.ProbWeightRegression linear model that treats coefficients as probabilistic weights
  • sklego.linear_model.LowessRegression locally weighted linear regression
  • sklego.linear_model.LADRegression least absolute deviation regression
  • sklego.linear_model.ImbalancedLinearRegression punish over/under-estimation of a model directly
  • sklego.naive_bayes.GaussianMixtureNB classifies by training a 1D GMM per column per class
  • sklego.naive_bayes.BayesianGaussianMixtureNB classifies by training a bayesian 1D GMM per class
  • sklego.mixture.BayesianGMMClassifier classifies by training a bayesian GMM per class
  • sklego.mixture.BayesianGMMOutlierDetector detects outliers based on a trained bayesian GMM
  • sklego.mixture.GMMClassifier classifies by training a GMM per class
  • sklego.mixture.GMMOutlierDetector detects outliers based on a trained GMM
  • sklego.meta.ConfusionBalancer experimental feature that allows you to balance the confusion matrix
  • sklego.meta.DecayEstimator adds decay to the sample_weight that the model accepts
  • sklego.meta.EstimatorTransformer adds a model output as a feature
  • sklego.meta.OutlierClassifier turns outlier models into classifiers for gridsearch
  • sklego.meta.GroupedPredictor can split the data into runs and run a model on each
  • sklego.meta.GroupedTransformer can split the data into runs and run a transformer on each
  • sklego.meta.SubjectiveClassifier experimental feature to add a prior to your classifier
  • sklego.meta.Thresholder meta model that allows you to gridsearch over the threshold
  • sklego.meta.RegressionOutlierDetector meta model that finds outliers by adding a threshold to regression
  • sklego.meta.ZeroInflatedRegressor predicts zero or applies a regression based on a classifier
  • sklego.preprocessing.ColumnCapper limits extreme values of the model features
  • sklego.preprocessing.ColumnDropper drops a column from pandas
  • sklego.preprocessing.ColumnSelector selects columns based on column name
  • sklego.preprocessing.InformationFilter transformer that can de-correlate features
  • sklego.preprocessing.IdentityTransformer returns the same data, allows for concatenating pipelines
  • sklego.preprocessing.OrthogonalTransformer makes all features linearly independent
  • sklego.preprocessing.PandasTypeSelector selects columns based on pandas type
  • sklego.preprocessing.PatsyTransformer applies a patsy formula
  • sklego.preprocessing.RandomAdder adds randomness in training
  • sklego.preprocessing.RepeatingBasisFunction repeating feature engineering, useful for timeseries
  • sklego.preprocessing.DictMapper assign numeric values on categorical columns
  • sklego.preprocessing.OutlierRemover experimental method to remove outliers during training
  • sklego.model_selection.KlusterFoldValidation experimental feature that does K folds based on clustering
  • sklego.model_selection.TimeGapSplit timeseries Kfold with a gap between train/test
  • sklego.pipeline.DebugPipeline adds debug information to make debugging easier
  • sklego.pipeline.make_debug_pipeline shorthand function to create a debugable pipeline
  • sklego.metrics.correlation_score calculates correlation between model output and feature
  • sklego.metrics.equal_opportunity_score calculates equal opportunity metric
  • sklego.metrics.p_percent_score proxy for model fairness with regards to sensitive attribute
  • sklego.metrics.subset_score calculate a score on a subset of your data (meant for fairness tracking)

New Features

We want to be rather open here in what we accept but we do demand three things before they become added to the project:

  1. any new feature contributes towards a demonstratable real-world usecase
  2. any new feature passes standard unit tests (we use the ones from scikit-learn)
  3. the feature has been discussed in the issue list beforehand

We automate all of our testing and use pre-commit hooks to keep the code working.

Comments
  • [FEATURE] Bayesian Kernel Density Classifier

    [FEATURE] Bayesian Kernel Density Classifier

    • I've been using this Bayesian kernel density classifier for a few years and I thought I should move it out from my poorly organized project to this one here.

    The prior is $P(y=0)$. I primarily use it for spatial problems.

    It is similar to the GMM Classifier with only 2 caveats I can think of.

    • Hyperparameters are easier to decide on.
    • Scaling is worse as I believe due to the KDE part scaling linearly with the sample size.
    # noinspection PyPep8Naming
    class BayesianKernelDensityClassifier(BaseEstimator, ClassifierMixin):
        """
        Bayesian Classifier that uses Kernel Density Estimations to generate the joint distribution
        Parameters:
            - bandwidth: float
            - kernel: for scikit learn KernelDensity
        """
        def __init__(self, bandwidth=0.2, kernel='gaussian'):
            self.classes_, self.models_, self.priors_logp_ = [None] * 3
            self.bandwidth = bandwidth
            self.kernel = kernel
    
        def fit(self, X, y):
            self.classes_ = np.sort(np.unique(y))
            training_sets = [X[y == yi] for yi in self.classes_]
            self.models_ = [KernelDensity(bandwidth=self.bandwidth, kernel=self.kernel).fit(x_subset)
                            for x_subset in training_sets]
    
            self.priors_logp_ = [np.log(x_subset.shape[0] / X.shape[0]) for x_subset in training_sets]
            return self
    
        def predict_proba(self, X):
            logp = np.array([model.score_samples(X) for model in self.models_]).T
            result = np.exp(logp + self.priors_logp_)
            return result / result.sum(1, keepdims=True)
    
        def predict(self, X):
            return self.classes_[np.argmax(self.predict_proba(X), 1)]
    
    enhancement 
    opened by arose13 26
  • [FEATURE] A light version that does not depend on cvxpy (and others?)

    [FEATURE] A light version that does not depend on cvxpy (and others?)

    Please explain clearly what you'd like to see added.

    • [x] convince us of the use-case, we're open to many suggestions but we prefer to solve problems with pipelines that are at least somewhat general When using the package in a Docker container without C installed, installation (can?) fail on CVXPY

    • [x] ~~add a screenshot if applicable (ML stuff is hard to explain with words, pictures say 1000 words)~~

    • [x] ~~make sure that the feature you want is not already supported by sklearn~~

    Happy to work on this if you agree

    enhancement 
    opened by pim-hoeven 19
  • Bugfix for #158

    Bugfix for #158

    Fixed the bug caused when using the grouped_estimator with a string column as grouping variable.

    Solution: Try the checks without adjustments, if that fails: remove the grouping column from the array or dataframe.

    opened by pim-hoeven 18
  • Timegap optimize simplify

    Timegap optimize simplify

    • replace two parameters df, date_col by one date_serie
    • insure no mutation with a copy
    • check if index match and if index is unique
    • optimise the iloc/loc/get_loc with a join and index
    • add plotting function with unit test
    • add summary function with unit test
    • update notebook doc
    opened by stephanecollot 17
  • [BUG] Stacking classifier cannot use Thresholder function - no .predict_proba

    [BUG] Stacking classifier cannot use Thresholder function - no .predict_proba

    Description:

    I'm able to use the thresholder on sklearn's voting classifer, but not on the stacking classifier. It throws this error, which I believe is in error. StackingClassifier does have predict_proba. Maybe I'm missunderstanding the use case, but this seems to fit.

    ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.

    Code for reproduction (using the sklearn sample data for StackingClassifier):

    from sklearn.datasets import load_iris
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.svm import LinearSVC
    from sklearn.linear_model import LogisticRegression
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import make_pipeline
    from sklearn.ensemble import StackingClassifier
    
    X, y = load_iris(return_X_y=True)
    estimators = [
        ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
        ('svr', make_pipeline(StandardScaler(), LinearSVC(random_state=42)))]
    clf = StackingClassifier(    estimators=estimators, final_estimator=LogisticRegression())
    clf.fit(X, y)
    
    a = Thresholder(clf, threshold=0.2)
    a.fit(X, y)
    a.predict(X)
    

    Full trace:

    ValueError                                Traceback (most recent call last)
    <ipython-input-26-1b89dbfa16b8> in <module>
         16 
         17 a = Thresholder(clf, threshold=0.2)
    ---> 18 a.fit(X_train_std, np.ceil(y_train[targets[2]]))
         19 a.predict(X_train_std)
    
    ~\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\sklego\meta\thresholder.py in fit(self, X, y, sample_weight)
         54         self.estimator_ = clone(self.model)
         55         if not isinstance(self.estimator_, ProbabilisticClassifier):
    ---> 56             raise ValueError(
         57                 "The Thresholder meta model only works on classifcation models with .predict_proba."
         58             )
    
    ValueError: The Thresholder meta model only works on classifcation models with .predict_proba.
    
    bug 
    opened by L-Marriott 16
  • GMM methods - classification *and* outlier detection

    GMM methods - classification *and* outlier detection

    As far as outlier detection is concerned, this is the current flow:

    import numpy as np
    import pandas as pd
    import plotnine as p9
    
    from sklego.mixture import GMMOutlierDetector
    
    X = np.random.normal(-10, 1, (2000, 2))
    mod = GMMOutlierDetector(n_components=1, threshold=0.99).fit(X)
    
    df = pd.DataFrame({"x1": X[:, 0], "x2": X[:, 1],
                       "loglik": mod.gmm.score_samples(X), 
                       "prediction": mod.predict(X).astype(str)})
    
    (p9.ggplot() + p9.geom_point(data=df, mapping=p9.aes("x1", "x2", color="loglik")))
    

    image

    (p9.ggplot() + p9.geom_point(data=df, mapping=p9.aes("x1", "x2", color="prediction")))
    

    image

    opened by koaning 16
  • Add log names and types

    Add log names and types

    I feel that by copying the log_step function many times and slightly changing the logging section I'm repeating a lot of code. Do you have any suggestion to avoid this?

    This is WIP, some TODOs:

    • rename log_step -> log_shape.
    • tests, docs.
    • log_dtypes
    opened by david26694 15
  • [FEATURE] Pass additional parameters to fit underlying estimator in `EstimatorTransformer`

    [FEATURE] Pass additional parameters to fit underlying estimator in `EstimatorTransformer`

    In EstimatorTransformer the underlying estimator is being fitted without the ability to pass along additional arguments to self.estimator_.fit.

    This limits use cases for EstimatorTransformer. For example, if the underlying estimator is an XGBClassifier we would like to be able to pass eval_set to monitor validation performance and enable early stopping. This is currently not possible. Adding *args, **kwargs should fix this issue.

    https://github.com/koaning/scikit-lego/blob/b4d087f0131ff164e6feebf238356ba6512b3635/sklego/meta/estimator_transformer.py#L31

    enhancement 
    opened by CarloLepelaars 14
  • Add Repeating Basis Functions

    Add Repeating Basis Functions

    I worked on this a while ago in PR #147, but I started from a fresh branch because I decided to limit the scope only to repeating basis functions (#20), excluding the spanning basis functions (#29).

    Feedback adapted so far:

    • Write a transformer when X contains only one column
    • Write a wrapper which selects one column, when X has more than one column
    • Added basic tests
    • Added docstring

    To Do:

    • Write documentation

    Could someone review the code? I'll work on the documentation in the meantime as well.

    Tagging people involved in PR #147: @kayhoogland @koaning @MBrouns

    opened by RensDimmendaal 14
  • Fix start of train split in TimeGapSplit and added n_split parameter

    Fix start of train split in TimeGapSplit and added n_split parameter

    Addresses changes in #192 and #232 I am currently working on a Time Series problem with vibration data where I needed a functionality like the one suggested in #232 so I decided to add it here.

    I tried to explain the changes in functionality in the docstring:

    Each validation fold doesn't overlap. The entire 'window' moves by 1 valid_duration until there is not enough data. If this would lead to more splits then specified with n_splits, the 'window' moves by the validation_duration times the fraction of possible splits and requested splits -- n_possible_splits = (total_length-train_duration-gap_duration)//valid_duration -- time_shift = valid_duratiopn n_possible_splits/n_slits so the CV spans the whole dataset. If train_duration is not passed but n_split, the training duration is increased to -- train_duration = total_length-(self.gap_duration + self.valid_duration * self.n_splits) such that the shifting the entire window by one validation duration spans the whole training set

    The changes are also added to the docs notebook for visualization.

    opened by rpauli 13
  • [FEATURE] WontPredict: meta model.

    [FEATURE] WontPredict: meta model.

    In the world of hype @MBrouns and myself came up with a very normal thought.

    Screenshot 2019-10-28 at 20 00 04

    One way to acomplish this is by introduction of a new meta model: WontPredict.

    This thread is meant as a place to discuss the implementation.

    enhancement 
    opened by koaning 13
  • [FEATURE] Adding the MRMR (Maximum Relevance Minimum Redundancy) feature selection

    [FEATURE] Adding the MRMR (Maximum Relevance Minimum Redundancy) feature selection

    Hi!

    The only feature selections that scikit-learn offers are quite naive. MRMR seems like a bit more advanced and reasonable approach to select informative and non-redundant features as described here.

    Long story short:

    1. Pick a feature that is most informative in some metric (e.g. F-statistic).
    2. Pick the next feature that is very informative, but doesn't correlate with the previous feature too much (e.g. the average absolute Pearson correlation between the current feature and the feature selected in step 2).
    3. Pick the next feature that is very informative, but doesn't correlate with the previous 2 features too much.
    4. Pick the next feature that is very informative, but doesn't correlate with the previous 3 features too much.
    5. (repeat until K features selected)

    Here, K, the measure of information and correlation can be specified by the user.

    enhancement 
    opened by Garve 7
  • [FEATURE] allow for all kwargs when using @log_step

    [FEATURE] allow for all kwargs when using @log_step

    Hi,

    When using @log_step in debugging a Pandas Pipeline, the current function must accept a single argument of df:pd.Dataframe.

    However if the user sends all the parameters as kwargs there is an error .

    It would be useful if the @log_step will check the first kwargs and if it is a pd.Dataframe then it will convert it into an arg - possible implementation before running the def wrapper()as follows

        _kwargs = {**kwargs}
        first_arg= next(iter(_kwargs))
        if isinstance(_kwargs[first_arg],pd.DataFrame) and len(args)==0:
            args=args+(_kwargs.pop[first_arg],)
    
    
    
    enhancement 
    opened by sephib 6
  • [FEATURE] - Grid search across model parameters AND thresholds with Thresholder() without refitting

    [FEATURE] - Grid search across model parameters AND thresholds with Thresholder() without refitting

    Thanks for this great set of extensions to sklearn.

    The Tresholder() model is quite close to something I've been looking for for a while.

    I'm looking to include threshold optimisation as part of a broader parameter search.

    I can perhaps best describe the desired behaviour as follows

    for each parameters in grid:
        fit model with parameters
        for each threshold in thresholds:
            evaluate model
    

    However, if I pass a model that has not yet been fit to Thresholder(), then, even with refit=False, the same model is fit also for each threshold.

    Is there an easy way around this? Thinking about this the best way to achieve this would be tinkering with the GridSearchCV code, but perhaps you have an idea and would also find this interesting?

    Thanks!

    enhancement 
    opened by mcallaghan 1
  • Selectors : allow results in empty dataframe

    Selectors : allow results in empty dataframe

    Before working on a large PR, please check with @koaning or @MBrouns that they agree with the direction of the PR. This discussion should take place in a Github issue before working on the PR, unless it's a minor change like spelling in the docs.

    Description

    Consider you want to build a semi-auto Pipeline. So, the pipeline may looks like:

    
    import pandas as pd
    import numpy as np
    from sklearn.pipeline import Pipeline, FeatureUnion
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    
    transformer = Pipeline([
        ('features', FeatureUnion(n_jobs=1, transformer_list=[
            # Part 1
            ('boolean', Pipeline([
                ('selector', PandasTypeSelector(include='bool')),
            ])),  # booleans close
            
            ('numericals', Pipeline([
                ('selector', PandasTypeSelector(include='number')),
                ('scaler', StandardScaler()),
                ('add_pca', FeatureUnion([
                    ('orig', IdentityTransformer()),
                    ('pca', PCA(2))
                ]))
            ])),  # numericals close
            
            # Part 2
            ('categoricals', Pipeline([
                ('selector', PandasTypeSelector(include='category')),
                ('labeler', StringIndexer()),
                ('encoder', OneHotEncoder(handle_unknown='ignore')),
            ]))  # categoricals close
        
        ])),  # features close
    ])  # pipeline close
    

    There may be boolean, numericals and categoricals variables , but may not exists erther . Current behaviour is raise Exception when a Selector return empty DataFrame , I think we can expose a parameter let user choose .

    Fixes # (issue)

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)

    Checklist:

    • [ ] My code follows the style guidelines (flake8)
    • [ ] I have commented my code, particularly in hard-to-understand areas
    • [ ] I have made corresponding changes to the documentation (also to the readme.md)
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] I have added tests to check whether the new feature adheres to the sklearn convention
    • [ ] New and existing unit tests pass locally with my changes

    If you feel your PR is ready for a review, ping @koaning or @mbrouns.

    opened by eromoe 3
  • [WIP] `get_feature_names_out` for `sklego.preprocessing`.

    [WIP] `get_feature_names_out` for `sklego.preprocessing`.

    This PR solves issue #543 and implements get_feature_names_out for all relevant transformers in sklego.preprocessing (i.e. transformers that do not contain the TrainOnlyTransformerMixin).

    Functionality is implemented through adding the _ClassNamePrefixFeaturesOutMixin to the class and making sure self._n_features_out is defined in .fit. This is also generally how scikit-learn implements get_feature_names_out for many of its transformers (Example). Unit tests are added for all new functionality.

    P.S. Don't pay attention to the commit history before October 10th. These changes have already been merged into koaning/scikit-lego/main, but is still displayed here as commit history. Will try to fix this. Suggestions to remove these redundant commits from the commit history of this CarloLepelaars/scikit-lego/ fork are welcome.

    • [x] ~~Find alternative solution for using _ClassNamePrefixFeaturesOutMixin so it works with scikit-learn on Python 3.7. (Remove Python 3.7. support)~~
    • [x] Add implementation of get_feature_names_out to contributing guidelines so people implement this for each new preprocessor that is not TrainOnly.
    • [x] Remove Python 3.7. GitHub Actions pipelines and update Optional dependencies GitHub Actions pipeline to use Python 3.8.
    • [x] Add general unit test that checks if get_feature_names_out can be called for all relevant preprocessors and EstimatorTransformer.
    opened by CarloLepelaars 14
Releases(0.6.14)
Owner
vincent d warmerdam
Solving problems involving data. Mostly NLP these days. AskMeAnything[tm].
vincent d warmerdam
machine learning with logical rules in Python

skope-rules Skope-rules is a Python machine learning module built on top of scikit-learn and distributed under the 3-Clause BSD license. Skope-rules a

504 Dec 31, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

combo: A Python Toolbox for Machine Learning Model Combination Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License

Yue Zhao 606 Dec 21, 2022
Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

celer Fast algorithm to solve Lasso-like problems with dual extrapolation. Currently, the package handles the following problems: Lasso weighted Lasso

168 Dec 13, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 28, 2022
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

418 Jan 09, 2023
scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

745 Jan 05, 2023
Large-scale linear classification, regression and ranking in Python

lightning lightning is a library for large-scale linear classification, regression and ranking in Python. Highlights: follows the scikit-learn API con

1.6k Dec 31, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

803 Jan 05, 2023
A Python library for dynamic classifier and ensemble selection

DESlib DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and

425 Dec 18, 2022
Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

Andreas Mueller 122 Dec 27, 2022
Extra blocks for scikit-learn pipelines.

scikit-lego We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to atte

vincent d warmerdam 941 Dec 30, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 01, 2023
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 02, 2023
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

Scikit-TDA 373 Dec 24, 2022
Multivariate imputation and matrix completion algorithms implemented in Python

A variety of matrix completion and imputation algorithms implemented in Python 3.6. To install: pip install fancyimpute Do not use conda. We don't sup

Alex Rubinsteyn 1.1k Dec 18, 2022