A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Overview

Azure Travis Codecov CircleCI PythonVersion Pypi Gitter Black

imbalanced-learn

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Documentation

Installation documentation, API documentation, and examples can be found on the documentation.

Installation

Dependencies

imbalanced-learn is tested to work under Python 3.6+. The dependency requirements are based on the last scikit-learn release:

  • scipy(>=0.19.1)
  • numpy(>=1.13.3)
  • scikit-learn(>=0.23)
  • joblib(>=0.11)
  • keras 2 (optional)
  • tensorflow (optional)

Additionally, to run the examples, you need matplotlib(>=2.0.0) and pandas(>=0.22).

Installation

From PyPi or conda-forge repositories

imbalanced-learn is currently available on the PyPi's repositories and you can install it via pip:

pip install -U imbalanced-learn

The package is release also in Anaconda Cloud platform:

conda install -c conda-forge imbalanced-learn
From source available on GitHub

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:

git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git
cd imbalanced-learn
pip install .

Be aware that you can install in developer mode with:

pip install --no-build-isolation --editable .

If you wish to make pull-requests on GitHub, we advise you to install pre-commit:

pip install pre-commit
pre-commit install

Testing

After installation, you can use pytest to run the test suite:

make coverage

Development

The development of this scikit-learn-contrib is in line with the one of the scikit-learn community. Therefore, you can refer to their Development Guide.

About

If you use imbalanced-learn in a scientific publication, we would appreciate citations to the following paper:

@article{JMLR:v18:16-365,
author  = {Guillaume  Lema{{\^i}}tre and Fernando Nogueira and Christos K. Aridas},
title   = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
journal = {Journal of Machine Learning Research},
year    = {2017},
volume  = {18},
number  = {17},
pages   = {1-5},
url     = {http://jmlr.org/papers/v18/16-365}
}

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:
  1. Under-sampling the majority class(es).
  2. Over-sampling the minority class.
  3. Combining over- and under-sampling.
  4. Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

  • Under-sampling
    1. Random majority under-sampling with replacement
    2. Extraction of majority-minority Tomek links [1]
    3. Under-sampling with Cluster Centroids
    4. NearMiss-(1 & 2 & 3) [2]
    5. Condensed Nearest Neighbour [3]
    6. One-Sided Selection [4]
    7. Neighboorhood Cleaning Rule [5]
    8. Edited Nearest Neighbours [6]
    9. Instance Hardness Threshold [7]
    10. Repeated Edited Nearest Neighbours [14]
    11. AllKNN [14]
  • Over-sampling
    1. Random minority over-sampling with replacement
    2. SMOTE - Synthetic Minority Over-sampling Technique [8]
    3. SMOTENC - SMOTE for Nominal and Continuous [8]
    4. SMOTEN - SMOTE for Nominal [8]
    5. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 [9]
    6. SVM SMOTE - Support Vectors SMOTE [10]
    7. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]
    8. KMeans-SMOTE [17]
    9. ROSE - Random OverSampling Examples [19]
  • Over-sampling followed by under-sampling
    1. SMOTE + Tomek links [12]
    2. SMOTE + ENN [11]
  • Ensemble classifier using samplers internally
    1. Easy Ensemble classifier [13]
    2. Balanced Random Forest [16]
    3. Balanced Bagging
    4. RUSBoost [18]
  • Mini-batch resampling for Keras and Tensorflow

The different algorithms are presented in the sphinx-gallery.

References:

[1] : I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976.
[2] : I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003.
[3] : P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968.
[4] : M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997.
[5] : J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001.
[6] : D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972.
[7] : M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014.
[8] (1, 2, 3) : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
[9] : H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005.
[10] : H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009.
[11] : G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004.
[12] : G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003.
[13] : X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009.
[14] (1, 2) : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976.
[15] : H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008.
[16] : C. Chao, A. Liaw, and L. Breiman. "Using random forest to learn imbalanced data." University of California, Berkeley 110 (2004): 1-12.
[17] : Felix Last, Georgios Douzas, Fernando Bacao, "Oversampling for Imbalanced Learning Based on K-Means and SMOTE"
[18] : Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
[19] : Menardi, G., Torelli, N.: "Training and assessing classification rules with unbalanced data", Data Mining and Knowledge Discovery, 28, (2014): 92–122
Comments
  • Speed improvements

    Speed improvements

    I have a dataset which has around 150.000 entries. Exploring SMOTHE sampling seems to be pretty slow as only a single core is used to perform calculations. Am I missing a configuration property? How else could I improve the speed of SMOTHE?

    Type: Enhancement 
    opened by geoHeil 31
  • Issues using SMOTE

    Issues using SMOTE

    Hi First of all thank you for providing us with the nice library

    I have a imbalanced dataset and I've loaded the dataset using pandas. When I'm supplying the dataset as input to the SMOTE I'm getting the following error:

    ValueError: Expected n_neighbors <= n_samples,  but n_samples = 1, n_neighbors = 6
    

    Thanks in Advance

    opened by Ayyappatheegala 30
  • [BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update

    [BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update

    I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.

                   balancer = SMOTETomek(random_state=2425, n_jobs=-1)
                   df_resampled, target_resampled = balancer.fit_resample(dataframe, target)
                   return df_resampled, target_resampled
    
    opened by jruokolainen 29
  • [MRG] ENH: K-Means SMOTE implementation

    [MRG] ENH: K-Means SMOTE implementation

    What does this implement/fix? Explain your changes.

    This pull request implements K-Means SMOTE, as described in Oversampling for Imbalanced Learning Based on K-Means and SMOTE by Last et al.

    Any other comments?

    The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.

    opened by StephanHeijl 25
  • [MRG] Address issue #113 - Create toy example for testing

    [MRG] Address issue #113 - Create toy example for testing

    Address issue #113

    • Over-sampling
      • [x] ADASYN
      • [x] SMOTE
      • [x] ROS
    • Under-sampling
      • [x] CC
      • [x] CNN
      • [x] ENN
      • [x] RENN => PR #135 needs to be merged before writing this code
      • [x] AllKNN => PR #136 needs to be merged before writing this code
      • [x] IHT
      • [x] NearMiss
      • [x] OSS
      • [x] RUS
      • [x] Tomek
    • Combine
      • [x] SMOTE ENN
      • [x] SMOTE Tomek
    • Ensemble
      • [x] Easy Ensemble => PR #117 needs to be merged before writing this code
      • [x] Balance Cascade
    opened by glemaitre 25
  • [MRG+1] Rename all occurrences of size_ngh to n_neighbors for consistency with scikit-learn

    [MRG+1] Rename all occurrences of size_ngh to n_neighbors for consistency with scikit-learn

    For consistency reasons I think that we should follow scikit-learn conventions in naming the parameters. I propose to change the size_ngh parameter to n_neighbors. Unfortunately, this change will have impact in the public API. It is an early modification but it will break users code. I don't know if we could merge this change without a deprecation warning.

    opened by chkoar 25
  • MNT blackify source code and add pre-commit

    MNT blackify source code and add pre-commit

    Reference Issue

    Addressing https://github.com/scikit-learn-contrib/imbalanced-learn/issues/684

    What does this implement/fix? Explain your changes.

    Integrating black into the codebase, to keep the code format consistent.

    • [x] Integrate black
    • [x] Run black over all files
    • [x] Add black into precommit hook

    Any other comments?

    Open questions -

    1. Which requirements file should the black dependency be added to?
    2. line-length for black is currently set as 79. Is that alright?
    opened by akash-suresh 23
  • conda install version 0.3.0

    conda install version 0.3.0

    I used

    conda install -c glemaitre imbalanced-learn

    to install Imbalanced-learn. Instead of getting version 0.3.0, I have the older version

    #
    imbalanced-learn          0.2.1                    py27_0    glemaitre
    

    How do I install version 0.3.0 via conda install?

    Type: Bug Type: CI/CD 
    opened by ljiang14 22
  • ValueError: could not convert string to float: 'aaa'

    ValueError: could not convert string to float: 'aaa'

    I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?

    clf_sample = RandomUnderSampler(ratio=.025)
    x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde"))
    x.loc[:, "b"] = "aaa"
    clf_sample.fit(x, y.head(100))
    
    opened by simonm3 22
  • `ratio` should allow to specify which class to target when resampling

    `ratio` should allow to specify which class to target when resampling

    TomekLinks and EditedNearestNeighbours only remove samples form the majority class. However both methods are often used rather for data cleaning (removing samples form both classes) but undersampling (only removing samples form the majority class). Thus SMOTETomek and SMOTEENN are not implemented as proposed by Batista, Prati and Monard (2004), because they use TomekLinks and ENN for removing samples from the majority and the minority class.

    It would be great to have a parameter that lets you choose whether to remove samples from both classes or only from the majority class.

    Type: Enhancement Status: Blocker 
    opened by lmittmann 22
  • EHN: implementation of SMOTE-NC for continuous and categorical mixed types

    EHN: implementation of SMOTE-NC for continuous and categorical mixed types

    Reference Issue

    #401

    What does this implement/fix? Explain your changes.

    Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer

    Any other comments?

    Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.

    Points to pay attention to:

    • working with sparse matrices
    • 2 FIXME points in code
    • 'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)
    opened by ddudnik 21
  • ValueError: Found array with dim 4. RandomOverSampler expected <= 2

    ValueError: Found array with dim 4. RandomOverSampler expected <= 2

    I want to perform OverSampler on the image classification task, but the result shows "ValueError: Found array with dim 4. RandomOverSampler expected <= 2." How can I use imbalanced-learn?

    opened by LHXqwq 1
  • [MRG] [ENH] Add sample_indices_ for SMOTE/ADASYN classes

    [MRG] [ENH] Add sample_indices_ for SMOTE/ADASYN classes

    Adding attribute sample_indices to SMOTE/ADASYN classes that contains tuple of samples used to generate new sample. For the samples for original dataset it is index of original sample.

    Reference Issue

    Fixes #772

    What does this implement/fix? Explain your changes.

    • Adds a get_sample_indices() function that returns a tuple of sample indices from which the new sample was created. For the original samples of dataset then [index, 0] is returned. Implemented for SMOTE and ADASYN class.
    • Adds tests for get_sample_indices() function.
    opened by JurajSlivka 2
  • WIP ENH Add fixture in common tests

    WIP ENH Add fixture in common tests

    Reference Issue

    Fixes #672

    What does this implement/fix? Explain your changes.

    • Created a common fixture to create a sample dataset used in tests.
    • Replaced boilerplate code with fixture in necessary tests in estimator_checks.py.
    opened by awinml 1
  • Add MLSMOTE algorithm to imblearn

    Add MLSMOTE algorithm to imblearn

    What does this implement/fix? Explain your changes.

    This is an implementation of the Multilabel SMOTE (MLSMOTE) algorithm described in the paper:

    Charte, F. & Rivera Rivas, Antonio & Del Jesus, María José & Herrera, Francisco. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems. -. 10.1016/j.knosys.2015.07.019.

    It is an oversampling technique that AFAIK there is no open-source implementation yet.

    Addresses: https://github.com/scikit-learn-contrib/imbalanced-learn/issues/340

    Any other comments?

    The implementation is ready to be reviewed. Once reviewed, I can squash the commits for cleaner history.

    opened by balvisio 6
  • ValueError: Found array with 0 sample(s) (shape=(0, 19)) while a minimum of 1 is required.

    ValueError: Found array with 0 sample(s) (shape=(0, 19)) while a minimum of 1 is required.

    So I'm new at programming and machine learning, and I'm using this code I found from a journal for spam detection. When I try to use it, the result turns out to be error, even though I already prepared the data correctly. The error message is 'ValueError: Found array with 0 sample(s) (shape=(0, 19)) while a minimum of 1 is required.' Can anyone please help me out with this issue? [The link for the complete code is here] (https://github.com/ijdutse/spd)

    #!/usr/bin/env python3
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from collections import defaultdict, Counter
    from datetime import datetime
    import preprocessor as p
    import random, os, utils, smart_open, json, codecs, pickle, time
    import gensim
    from gensim.models.doc2vec import Doc2Vec, TaggedDocument
    from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    from scipy.fftpack import fft
    
    data_sources = ['phone.json']
    
    def main():
        spd = Spd(data_sources) #class instantiation
        start = time.process_time()
        relevant_tweets = spd.detector(data_sources)
        stop = time.process_time()
        return relevant_tweets
    
    
    
    
    class Spd:
        """ some functions to accept raw files, extract relevant fields and filter our irrelevent content"""
        def __init__(self, data_sources):
            self.data_sources = data_sources
        pass
            
        # first function in the class:
        def extractor(self, data_sources): # accept list of files consisting of raw tweets in form of json object
            data_extracts = {'TweetID':[],'ScreenName':[],'RawTweets':[],'CreatedAt':[],'RetweetCount':[],\
                             'FollowersCount':[],'FriendsCount':[], 'StatusesCount':[],'FavouritesCount':[],\
                             'UserName':[],'Location':[],'AccountCreated':[],'Language':[],'Description':[],\
                             'UserURL':[],'VerifiedAccount':[],'CleanTweets':[],'UserID':[], 'TimeZone':[],'TweetFavouriteCount':[]}
            non_english_tweets = 0 # keep track of the non-English tweets
            with codecs.open('phone.json', 'r') as f: # data_source is read from extractor() function
                for line in f.readlines():
                    non_English = 0
                    try:
                        line = json.loads(line)
                        if line['lang'] in ['en','en-gb','en-GB','en-AU','en-IN','en_US']:
                            data_extracts['Language'].append(line['Language'])
                            data_extracts['TweetID'].append(line['TweetID'])
                            data_extracts['RawTweets'].append(line['RawTweets'])
                            data_extracts['CleanTweets'].append(p.clean(line['RawTweets']))
                            data_extracts['CreatedAt'].append(line['CreatedAt'])
                            data_extracts['AccountCreated'].append(line['AccountCreated'])                       
                            data_extracts['ScreenName'].append(line['ScreenName'])                          
                            data_extracts['RetweetCount'].append(line['RetweetCount'])
                            data_extracts['FollowersCount'].append(line['FollowersCount'])
                            data_extracts['FriendsCount'].append(line['FriendsCount'])
                            data_extracts['StatusesCount'].append(line['StatusesCount'])
                            data_extracts['FavouritesCount'].append(line['FavouritesCount'])
                            data_extracts['UserName'].append(line['UserName'])
                            data_extracts['Location'].append(line['Location'])
                            data_extracts['Description'].append(line['Description'])
                            data_extracts['UserURL'].append(line['UserURL'])
                            data_extracts['VerifiedAccount'].append(line['VerifiedAccount'])
                            data_extracts['UserID'].append(line['UserID'])
                            data_extracts['TimeZone'].append(line['TimeZone'])
                            data_extracts['TweetFavouriteCount'].append(line['TweetFavouriteCount'])
                        else:
                            non_english_tweets +=1
                    except:
                        continue
                df0 = pd.DataFrame(data_extracts) #convert data extracts to pandas DataFrame
                df0['CreatedAt']=pd.to_datetime(data_extracts['CreatedAt'],errors='coerce') # convert to datetime
                df0['AccountCreated']=pd.to_datetime(data_extracts['AccountCreated'],errors='coerce')
                df0 = df0.dropna(subset=['AccountCreated','CreatedAt']) # drop na in datetime
                AccountAge = [] # compute the account age of accounts
                date_format = "%Y-%m-%d  %H:%M:%S"
                for dr,dc in zip(df0.CreatedAt, df0.AccountCreated):
                    #try:
                    dr = str(dr)
                    dc = str(dc)
                    d1 = datetime.strptime(dr,date_format)
                    d2 = datetime.strptime(dc,date_format)
                    dif = d1 - d2
                    AccountAge.append(dif.days)
                    #except:
                        #continue
                df0['AccountAge']=AccountAge
                # add/define additional features ...
                df0['Retweets'] = df0.RawTweets.apply(lambda x: str(x).split()[0]=='RT' )
                df0['RawTweetsLen'] = df0.RawTweets.apply(lambda x: len(str(x))) # modified
                df0['DescriptionLen'] = df0.Description.apply(lambda x: len(str(x)))
                df0['UserNameLen'] = df0.UserName.apply(lambda x: len(str(x)))
                df0['ScreenNameLen'] = df0.ScreenName.apply(lambda x: len(str(x)))
                df0['LocationLen'] = df0.Location.apply(lambda x: len(str(x)))
                df0['Activeness'] = df0.StatusesCount.truediv(df0.AccountAge)
                df0['Friendship'] = df0.FriendsCount.truediv(df0.FollowersCount)
                df0['Followership'] = df0.FollowersCount.truediv(df0.FriendsCount)
                df0['Interestingness'] = df0.FavouritesCount.truediv(df0.StatusesCount)
                df0['BidirFriendship'] = (df0.FriendsCount + df0.FollowersCount).truediv(df0.FriendsCount)
                df0['BidirFollowership'] = (df0.FriendsCount + df0.FollowersCount).truediv(df0.FollowersCount)
                df0['NamesRatio'] = df0.ScreenNameLen.truediv(df0.UserNameLen)
                df0['CleanTweetsLen'] = df0.CleanTweets.apply(lambda x: len(str(x)))
                df0['LexRichness'] = df0.CleanTweetsLen.truediv(df0.RawTweetsLen)       
                # Remove all RTs, set UserID as index and save relevant files:
                df0 = df0[df0.Retweets.values==False] # remove retweets
                df0 = df0.set_index('UserID')
                df0 = df0[~df0.index.duplicated()] # remove duplicates in the tweet
                #df0.to_csv(data_source[:15]+'all_extracts.csv') #save all extracts as csv
                df0.to_csv(data_sources[:5]+'all_extracts.csv') #save all extracts as csv 
                with open(data_sources[:5]+'non_English.txt','w') as d: # save count of non-English tweets
                    d.write('{}'.format(non_english_tweets))
                    d.close()
            return df0
    
        
        def detector(self, data_sources): # accept list of raw tweets as json objects
            self.data_sources = data_sources
            for data_sources in data_sources:
                self.data_sources = data_sources
                df0 = self.extractor(data_sources)
                #drop fields not required for predicition
                X = df0.drop(['Language','TweetID','RawTweets','CleanTweets','CreatedAt','AccountCreated','ScreenName',\
                     'Retweets','UserName','Location','Description','UserURL','VerifiedAccount','RetweetCount','TimeZone','TweetFavouriteCount'], axis=1)
                X = X.replace([np.inf,-np.inf],np.nan) # replace infinity values to avoid 0 division ...
                X = X.dropna()
                # reload the trained model for use:
                spd_filter=pickle.load(open('trained_rf.pkl','rb'))
                PredictedClass = spd_filter.predict(X) # Predict spam or automated accounts/tweets:
                X['PredictedClass'] = PredictedClass # include the predicted class in the dataframe
                nonspam = df0.loc[X.PredictedClass.values==1] # sort out the nonspam accounts
                spam = df0.loc[X.PredictedClass.values==0] # sort out spam/automated accounts
                #relevant_tweets = nonspam[['CreatedAt', 'CleanTweets']]
                relevant_tweets = nonspam[['CreatedAt','AccountCreated','ScreenName','Location','TimeZone','Description','VerifiedAccount','RawTweets', 'CleanTweets','TweetFavouriteCount','Retweets']]
                relevant_tweets = relevant_tweets.reset_index() # reset index and remove it from the dataframe
                #relevant_tweets = relevant_tweets.drop('UserID', axis=1) 
                # save files:
                X.to_csv(data_source[:5]+'_all_predicted_classes.csv') #save all extracts as csv, used to be 15
                nonspam.to_csv(data_source[:5]+'_nonspam_accounts.csv')
                spam.to_csv(data_source[:5]+'_spam_accounts.csv')
                relevant_tweets.to_csv(data_source[:5]+'_relevant_tweets.csv') # relevant tweets for subsequent analysis
            return relevant_tweets # or return relevant_tweets, nonspam, spam
    
    if __name__ =='__main__':
        main()`
    
    Status: More Info Needed 
    opened by balalaunicorn 2
  • [BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes

    [BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes

    Describe the bug

    The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.

    Steps/Code to Reproduce

    from sklearn.datasets import make_blobs
    from sklearn.neighbors import KNeighborsClassifier
    from imblearn.under_sampling import CondensedNearestNeighbour
    
    n_clusters = 10
    X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)
    
    n_neighbors = 1
    condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
    X_cond, y_cond = condenser.fit_resample(X, y)
    print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
    print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))
    
    condenser.estimator_.classes_ [5 9]
    condenser.estomator_ accuracy 0.2
    
    # I think the estimator we want should look like this
    knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
    print('knn_cond_manual.classes_', knn_cond_manual.classes_)  # yes 10 classes!
    print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!
    
    knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
    Manual KNN on condensted data accuracy 0.996
    

    The issue

    The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.

    This looks like it's also an issue with OneSidedSelection and possibly other samplers.

    Fix

    I think we should just add the following to directly before the return statement in fit_resample

    X_condensed, y_condensed = _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)
    self.estimator_.fit(X_condensed, y_condensed)
    return X_condensed, y_condensed
    

    Versions

    
    System:
        python: 3.8.12 (default, Oct 12 2021, 06:23:56)  [Clang 10.0.0 ]
    executable: /Users/iaincarmichael/anaconda3/envs/comp_onc/bin/python
       machine: macOS-10.16-x86_64-i386-64bit
    
    Python dependencies:
          sklearn: 1.1.1
              pip: 21.2.4
       setuptools: 58.0.4
            numpy: 1.21.4
            scipy: 1.7.3
           Cython: 0.29.25
           pandas: 1.3.5
       matplotlib: 3.5.0
           joblib: 1.1.0
    threadpoolctl: 2.2.0
    
    Built with OpenMP: True
    
    threadpoolctl info:
           filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
             prefix: libomp
           user_api: openmp
       internal_api: openmp
            version: None
        num_threads: 8
    
           filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
             prefix: libopenblas
           user_api: blas
       internal_api: openblas
            version: 0.3.17
        num_threads: 4
    threading_layer: pthreads
       architecture: Haswell
    
           filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libmkl_rt.1.dylib
             prefix: libmkl_rt
           user_api: blas
       internal_api: mkl
            version: 2021.4-Product
        num_threads: 4
    threading_layer: intel
    
           filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libomp.dylib
             prefix: libomp
           user_api: openmp
       internal_api: openmp
            version: None
        num_threads: 8
    
    Type: Bug Package: under_sampling 
    opened by idc9 0
Releases(0.10.0)
  • 0.10.0(Dec 9, 2022)

    Changelog

    Bug fixes

    • Make sure that Substitution is working with python -OO that replaces doc by None. #953 bu Guillaume Lemaitre.

    Compatibility

    Deprecation

    Enhancements

    • Add support to accept compatible NearestNeighbors objects by only duck-typing. For instance, it allows to accept cuML instances. #858 by NV-jpt and Guillaume Lemaitre.
    Source code(tar.gz)
    Source code(zip)
  • 0.9.1(May 16, 2022)

  • 0.9.0(Jan 16, 2022)

  • 0.8.1(Sep 29, 2021)

  • 0.8.0(Feb 18, 2021)

    Version 0.8.0

    February 18, 2021

    Changelog

    New features

    • Add the the function imblearn.metrics.macro_averaged_mean_absolute_error returning the average across class of the MAE. This metric is used in ordinal classification. #780 by Aurélien Massiot.
    • Add the class imblearn.metrics.pairwise.ValueDifferenceMetric to compute pairwise distances between samples containing only categorical values. #796 by Guillaume Lemaitre.
    • Add the class imblearn.over_sampling.SMOTEN to over-sample data only containing categorical features. #802 by Guillaume Lemaitre.
    • Add the possibility to pass any type of samplers in imblearn.ensemble.BalancedBaggingClassifier unlocking the implementation of methods based on resampled bagging. #808 by Guillaume Lemaitre.

    Enhancements

    • Add option output_dict in imblearn.metrics.classification_report_imbalanced to return a dictionary instead of a string. #770 by Guillaume Lemaitre.
    • Added an option to generate smoothed bootstrap in `imblearn.over_sampling.RandomOverSampler. It is controled by the parameter shrinkage. This method is also known as Random Over-Sampling Examples (ROSE). #754 by Andrea Lorenzon and Guillaume Lemaitre.

    Bug fixes

    • Fix a bug in imblearn.under_sampling.ClusterCentroids where voting="hard" could have lead to select a sample from any class instead of the targeted class. #769 by Guillaume Lemaitre.
    • Fix a bug in imblearn.FunctionSampler where validation was performed even with validate=False when calling fit. #790 by Guillaume Lemaitre.

    Maintenance

    • Remove requirements files in favour of adding the packages in the extras_require within the setup.py file. #816 by Guillaume Lemaitre.
    • Change the website template to use pydata-sphinx-theme. #801 by Guillaume Lemaitre.

    Deprecation

    • The context manager imblearn.utils.testing.warns is deprecated in 0.8 and will be removed 1.0. #815 by Guillaume Lemaitre.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Jun 9, 2020)

  • 0.6.2(Feb 16, 2020)

    This is a bug-fix release to resolve some issues regarding the handling the input and the output format of the arrays.

    Changelog

    • Allow column vectors to be passed as targets. #673 by @chkoar.
    • Better input/output handling for pandas, numpy and plain lists. #681 by @chkoar.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 7, 2019)

    This is a bug-fix release to primarily resolve some packaging issues in version 0.6.0. It also includes minor documentation improvements and some bug fixes.

    Changelog

    Bug fixes

    • Fix a bug in :class:imblearn.ensemble.BalancedRandomForestClassifier leading to a wrong number of samples used during fitting due max_samples and therefore a bad computation of the OOB score. :pr:656 by :user:Guillaume Lemaitre <glemaitre>.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Dec 5, 2019)

    Changelog

    Changed models ..............

    The following models might give some different sampling due to changes in scikit-learn:

    • :class:imblearn.under_sampling.ClusterCentroids
    • :class:imblearn.under_sampling.InstanceHardnessThreshold

    The following samplers will give different results due to change linked to the random state internal usage:

    • :class:imblearn.over_sampling.SMOTENC

    Bug fixes .........

    • :class:imblearn.under_sampling.InstanceHardnessThreshold now take into account the random_state and will give deterministic results. In addition, cross_val_predict is used to take advantage of the parallelism. :pr:599 by :user:Shihab Shahriar Khan <Shihab-Shahriar>.

    • Fix a bug in :class:imblearn.ensemble.BalancedRandomForestClassifier leading to a wrong computation of the OOB score. :pr:656 by :user:Guillaume Lemaitre <glemaitre>.

    Maintenance ...........

    • Update imports from scikit-learn after that some modules have been privatize. The following import have been changed: :class:sklearn.ensemble._base._set_random_states, :class:sklearn.ensemble._forest._parallel_build_trees, :class:sklearn.metrics._classification._check_targets, :class:sklearn.metrics._classification._prf_divide, :class:sklearn.utils.Bunch, :class:sklearn.utils._safe_indexing, :class:sklearn.utils._testing.assert_allclose, :class:sklearn.utils._testing.assert_array_equal, :class:sklearn.utils._testing.SkipTest. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

    • Synchronize :mod:imblearn.pipeline with :mod:sklearn.pipeline. :pr:620 by :user:Guillaume Lemaitre <glemaitre>.

    • Synchronize :class:imblearn.ensemble.BalancedRandomForestClassifier and add parameters max_samples and ccp_alpha. :pr:621 by :user:Guillaume Lemaitre <glemaitre>.

    Enhancement ...........

    • :class:imblearn.under_sampling.RandomUnderSampling, :class:imblearn.over_sampling.RandomOverSampling, :class:imblearn.datasets.make_imbalance accepts Pandas DataFrame in and will output Pandas DataFrame. Similarly, it will accepts Pandas Series in and will output Pandas Series. :pr:636 by :user:Guillaume Lemaitre <glemaitre>.

    • :class:imblearn.FunctionSampler accepts a parameter validate allowing to check or not the input X and y. :pr:637 by :user:Guillaume Lemaitre <glemaitre>.

    • :class:imblearn.under_sampling.RandomUnderSampler, :class:imblearn.over_sampling.RandomOverSampler can resample when non finite values are present in X. :pr:643 by :user:Guillaume Lemaitre <glemaitre>.

    • All samplers will output a Pandas DataFrame if a Pandas DataFrame was given as an input. :pr:644 by :user:Guillaume Lemaitre <glemaitre>.

    • The samples generation in :class:imblearn.over_sampling.SMOTE, :class:imblearn.over_sampling.BorderlineSMOTE, :class:imblearn.over_sampling.SVMSMOTE, :class:imblearn.over_sampling.KMeansSMOTE, :class:imblearn.over_sampling.SMOTENC is now vectorize with giving an additional speed-up when X in sparse. :pr:596 by :user:Matt Eding <MattEding>.

    Deprecation ...........

    • The following classes have been removed after 2 deprecation cycles: ensemble.BalanceCascade and ensemble.EasyEnsemble. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

    • The following functions have been removed after 2 deprecation cycles: utils.check_ratio. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

    • The parameter ratio and return_indices has been removed from all samplers. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

    • The parameters m_neighbors, out_step, kind, svm_estimator have been removed from the :class:imblearn.over_sampling.SMOTE. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

    Source code(tar.gz)
    Source code(zip)
  • 0.5.0(Jun 28, 2019)

    Version 0.5.0

    Changed models

    The following models or function might give different results even if the same data X and y are the same.

    • :class:imblearn.ensemble.RUSBoostClassifier default estimator changed from :class:sklearn.tree.DecisionTreeClassifier with full depth to a decision stump (i.e., tree with max_depth=1).

    Documentation

    • Correct the definition of the ratio when using a float in sampling strategy for the over-sampling and under-sampling. :issue:525 by :user:Ariel Rossanigo <arielrossanigo>.

    • Add :class:imblearn.over_sampling.BorderlineSMOTE and :class:imblearn.over_sampling.SVMSMOTE in the API documenation. :issue:530 by :user:Guillaume Lemaitre <glemaitre>.

    Enhancement

    • Add Parallelisation for SMOTEENN and SMOTETomek. :pr:547 by :user:Michael Hsieh <Microsheep>.

    • Add :class:imblearn.utils._show_versions. Updated the contribution guide and issue template showing how to print system and dependency information from the command line. :pr:557 by :user:Alexander L. Hayes <batflyer>.

    • Add :class:imblearn.over_sampling.KMeansSMOTE which is an over-sampler clustering points before to apply SMOTE. :pr:435 by :user:Stephan Heijl <StephanHeijl>.

    Maintenance

    • Make it possible to import imblearn and access submodule. :pr:500 by :user:Guillaume Lemaitre <glemaitre>.

    • Remove support for Python 2, remove deprecation warning from scikit-learn 0.21. :pr:576 by :user:Guillaume Lemaitre <glemaitre>.

    Bug

    • Fix wrong usage of :class:keras.layers.BatchNormalization in porto_seguro_keras_under_sampling.py example. The batch normalization was moved before the activation function and the bias was removed from the dense layer. :pr:531 by :user:Guillaume Lemaitre <glemaitre>.

    • Fix bug which converting to COO format sparse when stacking the matrices in :class:imblearn.over_sampling.SMOTENC. This bug was only old scipy version. :pr:539 by :user:Guillaume Lemaitre <glemaitre>.

    • Fix bug in :class:imblearn.pipeline.Pipeline where None could be the final estimator. :pr:554 by :user:Oliver Rausch <orausch>.

    • Fix bug in :class:imblearn.over_sampling.SVMSMOTE and :class:imblearn.over_sampling.BorderlineSMOTE where the default parameter of n_neighbors was not set properly. :pr:578 by :user:Guillaume Lemaitre <glemaitre>.

    • Fix bug by changing the default depth in :class:imblearn.ensemble.RUSBoostClassifier to get a decision stump as a weak learner as in the original paper. :pr:545 by :user:Christos Aridas <chkoar>.

    • Allow to import keras directly from tensorflow in the :mod:imblearn.keras. :pr:531 by :user:Guillaume Lemaitre <glemaitre>.

    Source code(tar.gz)
    Source code(zip)
  • 0.4.3(Nov 6, 2018)

  • 0.4.2(Oct 21, 2018)

    Version 0.4.2

    Bug fixes

    • Fix a bug in imblearn.over_sampling.SMOTENC in which the the median of the standard deviation instead of half of the median of the standard deviation. By Guillaume Lemaitre in #491.
    • Raise an error when passing target which is not supported, i.e. regression target or multilabel targets. Imbalanced-learn does not support this case. By Guillaume Lemaitre in #490.
    Source code(tar.gz)
    Source code(zip)
  • 0.4.1(Oct 12, 2018)

    Version 0.4

    October, 2018

    Version 0.4 is the last version of imbalanced-learn to support Python 2.7 and Python 3.4. Imbalanced-learn 0.5 will require Python 3.5 or higher.

    Highlights

    This release brings its set of new feature as well as some API changes to strengthen the foundation of imbalanced-learn.

    As new feature, 2 new modules imblearn.keras and imblearn.tensorflow have been added in which imbalanced-learn samplers can be used to generate balanced mini-batches.

    The module imblearn.ensemble has been consolidated with new classifier: imblearn.ensemble.BalancedRandomForestClassifier, imblearn.ensemble.EasyEnsembleClassifier, imblearn.ensemble.RUSBoostClassifier.

    Support for string has been added in imblearn.over_sampling.RandomOverSampler and imblearn.under_sampling.RandomUnderSampler. In addition, a new class imblearn.over_sampling.SMOTENC allows to generate sample with data sets containing both continuous and categorical features.

    The imblearn.over_sampling.SMOTE has been simplified and break down to 2 additional classes: imblearn.over_sampling.SVMSMOTE and imblearn.over_sampling.BorderlineSMOTE.

    There is also some changes regarding the API: the parameter sampling_strategy has been introduced to replace the ratio parameter. In addition, the return_indices argument has been deprecated and all samplers will exposed a sample_indices_ whenever this is possible.

    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Oct 12, 2018)

    Version 0.4

    October, 2018

    .. warning::

    Version 0.4 is the last version of imbalanced-learn to support Python 2.7
    and Python 3.4. Imbalanced-learn 0.5 will require Python 3.5 or higher.
    

    Highlights

    This release brings its set of new feature as well as some API changes to strengthen the foundation of imbalanced-learn.

    As new feature, 2 new modules imblearn.keras and imblearn.tensorflow have been added in which imbalanced-learn samplers can be used to generate balanced mini-batches.

    The module imblearn.ensemble has been consolidated with new classifier: imblearn.ensemble.BalancedRandomForestClassifier, imblearn.ensemble.EasyEnsembleClassifier, imblearn.ensemble.RUSBoostClassifier.

    Support for string has been added in imblearn.over_sampling.RandomOverSampler and imblearn.under_sampling.RandomUnderSampler. In addition, a new class imblearn.over_sampling.SMOTENC allows to generate sample with data sets containing both continuous and categorical features.

    The imblearn.over_sampling.SMOTE has been simplified and break down to 2 additional classes: imblearn.over_sampling.SVMSMOTE and imblearn.over_sampling.BorderlineSMOTE.

    There is also some changes regarding the API: the parameter sampling_strategy has been introduced to replace the ratio parameter. In addition, the return_indices argument has been deprecated and all samplers will exposed a sample_indices_ whenever this is possible.

    Source code(tar.gz)
    Source code(zip)
  • 0.3.4(Sep 7, 2018)

  • 0.3.3(Feb 22, 2018)

  • 0.3.1(Oct 9, 2017)

  • 0.3.0(Oct 9, 2017)

    What's new in version 0.3.0

    Testing

    • Pytest is used instead of nosetests. :issue:321 by Joan Massich_.

    Documentation

    • Added a User Guide and extended some examples. :issue:295 by Guillaume Lemaitre_.

    Bug fixes

    • Fixed a bug in :func:utils.check_ratio such that an error is raised when the number of samples required is negative. :issue:312 by Guillaume Lemaitre_.

    • Fixed a bug in :class:under_sampling.NearMiss version 3. The indices returned were wrong. :issue:312 by Guillaume Lemaitre_.

    • Fixed bug for :class:ensemble.BalanceCascade and :class:combine.SMOTEENN and :class:SMOTETomek. :issue:295 by Guillaume Lemaitre_.`

    • Fixed bug for check_ratio to be able to pass arguments when ratio is a callable. :issue:307 by Guillaume Lemaitre_.`

    New features

    • Turn off steps in :class:pipeline.Pipeline using the None object. By Christos Aridas_.

    • Add a fetching function :func:datasets.fetch_datasets in order to get some imbalanced datasets useful for benchmarking. :issue:249 by Guillaume Lemaitre_.

    Enhancement

    • All samplers accepts sparse matrices with defaulting on CSR type. :issue:316 by Guillaume Lemaitre_.

    • :func:datasets.make_imbalance take a ratio similarly to other samplers. It supports multiclass. :issue:312 by Guillaume Lemaitre_.

    • All the unit tests have been factorized and a :func:utils.check_estimators has been derived from scikit-learn. By Guillaume Lemaitre_.

    • Script for automatic build of conda packages and uploading. :issue:242 by Guillaume Lemaitre_

    • Remove seaborn dependence and improve the examples. :issue:264 by Guillaume Lemaitre_.

    • adapt all classes to multi-class resampling. :issue:290 by Guillaume Lemaitre_

    API changes summary

    • __init__ has been removed from the :class:base.SamplerMixin to create a real mixin class. :issue:242 by Guillaume Lemaitre_.

    • creation of a module :mod:exceptions to handle consistant raising of errors. :issue:242 by Guillaume Lemaitre_.

    • creation of a module utils.validation to make checking of recurrent patterns. :issue:242 by Guillaume Lemaitre_.

    • move the under-sampling methods in prototype_selection and prototype_generation submodule to make a clearer dinstinction. :issue:277 by Guillaume Lemaitre_.

    • change ratio such that it can adapt to multiple class problems. :issue:290 by Guillaume Lemaitre_.

    Deprecation

    • Deprecation of the use of min_c_ in :func:datasets.make_imbalance. :issue:312 by Guillaume Lemaitre_

    • Deprecation of the use of float in :func:datasets.make_imbalance for the ratio parameter. :issue:290 by Guillaume Lemaitre_.

    • deprecate the use of float as ratio in favor of dictionary, string, or callable. :issue:290 by Guillaume Lemaitre_.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Dec 31, 2016)

  • 0.2.0.dev0(Sep 1, 2016)

  • 0.1.6(Aug 9, 2016)

  • 0.1.5(Jul 31, 2016)

  • 0.1.4(Jul 31, 2016)

  • 0.1.3(Jul 19, 2016)

  • 0.1.2(Jul 19, 2016)

Owner
scikit-learn compatible projects
scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

745 Jan 05, 2023
A Python library for dynamic classifier and ensemble selection

DESlib DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and

425 Dec 18, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

803 Jan 05, 2023
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 01, 2023
Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

Andreas Mueller 122 Dec 27, 2022
Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

celer Fast algorithm to solve Lasso-like problems with dual extrapolation. Currently, the package handles the following problems: Lasso weighted Lasso

168 Dec 13, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 02, 2023
Large-scale linear classification, regression and ranking in Python

lightning lightning is a library for large-scale linear classification, regression and ranking in Python. Highlights: follows the scikit-learn API con

1.6k Dec 31, 2022
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

418 Jan 09, 2023
Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

Scikit-TDA 373 Dec 24, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 28, 2022
(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

combo: A Python Toolbox for Machine Learning Model Combination Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License

Yue Zhao 606 Dec 21, 2022
machine learning with logical rules in Python

skope-rules Skope-rules is a Python machine learning module built on top of scikit-learn and distributed under the 3-Clause BSD license. Skope-rules a

504 Dec 31, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
Extra blocks for scikit-learn pipelines.

scikit-lego We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to atte

vincent d warmerdam 941 Dec 30, 2022
Multivariate imputation and matrix completion algorithms implemented in Python

A variety of matrix completion and imputation algorithms implemented in Python 3.6. To install: pip install fancyimpute Do not use conda. We don't sup

Alex Rubinsteyn 1.1k Dec 18, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023