scikit-learn cross validators for iterative stratification of multilabel data

Overview

Build Status Coverage Status

iterative-stratification

iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilabel data.

Presently scikit-learn provides several cross validators with stratification. However, these cross validators do not offer the ability to stratify multilabel data. This iterative-stratification project offers implementations of MultilabelStratifiedKFold, MultilabelRepeatedStratifiedKFold, and MultilabelStratifiedShuffleSplit with a base algorithm for stratifying multilabel data described in the following paper:

Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.

Requirements

iterative-stratification has been tested under Python 3.4 through 3.8 with the following dependencies:

  • scipy(>=0.13.3)
  • numpy(>=1.8.2)
  • scikit-learn(>=0.19.0)

Installation

iterative-stratification is currently available on the PyPi repository and can be installed via pip:

pip install iterative-stratification


The package is also installable from the Anaconda Cloud platform:

conda install -c trent-b iterative-stratification

Toy Examples

The multilabel cross validators that this package provides may be used with the scikit-learn API in the same manner as any other cross validators. For example, these cross validators may be passed to cross_val_score or cross_val_predict. Below are some toy examples of the direct use of the multilabel cross validators.

MultilabelStratifiedKFold

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]

RepeatedMultilabelStratifiedKFold

from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=2, n_repeats=2, random_state=0)

for train_index, test_index in rmskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [0 1 4 5] TEST: [2 3 6 7]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]

MultilabelStratifiedShuffleSplit

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)

for train_index, test_index in msss.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]
Comments
  • Adjusting test_size doesn't actually change test_size

    Adjusting test_size doesn't actually change test_size

    Hello! I'm trying to use this code for a project, however, I don't want my test size to be 0.5. When I try and adjust it, I don't get a change:

    # from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
    msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=42)
    
    for train_index, test_index in msss.split(X, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        print(len(train_index))
        print(len(test_index))
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    

    outputs:

    ('TRAIN:', array([1, 2, 4, 7]), 'TEST:', array([0, 3, 5, 6]))
    4
    4
    ('TRAIN:', array([2, 3, 6, 7]), 'TEST:', array([0, 1, 4, 5]))
    4
    4
    ('TRAIN:', array([0, 2, 4, 6]), 'TEST:', array([1, 3, 5, 7]))
    4
    4
    

    Koodos on putting this out there!

    opened by tyler-lanigan-hs 9
  • [MOD] Bug Fix for sklearn 1.0~

    [MOD] Bug Fix for sklearn 1.0~

    scikit-learn has been updated to 1.0.0. As a result, there are some functions that don't work properly. it makes errors like the below:

    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given.
    

    To fix this problem, I added * in init parameters refers to PEP 3102(https://www.python.org/dev/peps/pep-3102/).

    opened by CryptoSalamander 4
  • Incompatibility with scikit-learn 1.0 in latest release

    Incompatibility with scikit-learn 1.0 in latest release

    As of scikit-learn 1.0 the deprecation warning fixed in 0a108bc2062fd32f98c9a6305508ea213292ba08 has become a hard error. Could a new release be pushed to pypi in order to remain compatible with the latest scikit-learn?

    For other users experiencing this issue (it will look something like

    , in __init__
        super(MultilabelStratifiedShuffleSplit, self).__init__(
    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given
    

    ^this) the workaround is to use the latest master of this package.

    opened by lunik1 4
  • Error using MultilabelStratifiedKFold

    Error using MultilabelStratifiedKFold

    Hi Trent! First, thanks for this repository, it have helped me a lot.

    I have a question. I use the MultilabelStratifiedKFold for a machine learning model, but since the last week it have been giving me an error. I haven't changed anything on it, so I don't know what can be happening.

    The error I'm having is in this line of code:

    mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)

    And the error that it throws is it:

    Input In [13], in <cell line: 6>()
          3 oof_preds["fold_idx"] = -1
          4 oof_preds["oof_pred"] = -1
    ----> 6 mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)
          7 mskf_split = mskf.split(dataset, dataset[["rvm_tipo_enc","rvm_marca_enc","rvm_antiguedad","converted"]])
          9 for fold,(train_idx,valid_idx) in enumerate(mskf_split):
    
    File ~\Anaconda3\envs\JARVIS\lib\site-packages\iterstrat\ml_stratifiers.py:157, in MultilabelStratifiedKFold.__init__(self, n_splits, shuffle, random_state)
        156 def __init__(self, n_splits=3, shuffle=False, random_state=None):
    --> 157     super(MultilabelStratifiedKFold, self).__init__(n_splits, shuffle, random_state)
    
    TypeError: __init__() takes 2 positional arguments but 4 were given```
    
    
    
    What can be happening on here? Thanks a lot!
    opened by robertogarces 3
  • Ability to set a custom fold proportions for MultilabelStratifiedKFold (pass

    Ability to set a custom fold proportions for MultilabelStratifiedKFold (pass "r" to IterativeStratification)

    For us it's useful to be able to set custom fold proportions when using MultilabelStratifiedKFold (essentially passing custom r to IterativeStratification). It's easy enough to extend outside of the lib (only _make_test_folds needs to be copied), but I wonder if such a feature could be useful in the library itself, what do you think?

    And thanks for a great library!

    opened by lopuhin 3
  • Balanced sample with low number of one of the classes

    Balanced sample with low number of one of the classes

    I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> X = np.arange(10)
    >>> 
    >>> 
    >>> 
    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> 
    >>> 
    >>> X = np.arange(10)
    >>> X
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> 
    >>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
    >>> y
    array([[1, 1, 0],
           [0, 1, 0],
           [1, 0, 0],
           [1, 0, 0],
           [0, 1, 0],
           [0, 1, 0],
           [0, 1, 0],
           [1, 1, 0],
           [0, 1, 1],
           [1, 0, 1]])
    >>> 
    >>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
    >>> train, test  = list(temp.split(X, y))[0]
    >>> 
    >>> train
    array([1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> 
    >>> 
    >>> test
    array([0])
    

    The train set contains both samples 8 and 9, which are the only ones that have the class with index 2. How can I make sure that all splits have at least one sample per class?

    opened by miguelwon 3
  • Getting started help

    Getting started help

    Hello and thank you for this project.

    I am new to machine learning and have a little bit of trouble getting started with this.

    If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.

    To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.

    from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    from matplotlib import pyplot as plt
    
    
    AMOUNT_OF_CLASSES = 3
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])
    

    If I take a look at the distribution at the beginning it will look like the following:

    dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for i in range(0,AMOUNT_OF_CLASSES):
        dis[i] = y[:,i].sum()
    
    # Show original distribution
    plt.figure(0)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
    

    image

    If I now do the stratification like this:

    # now go for stratifcaation
    msss = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=0.5, random_state=0)
    
    cnt = 1
    # distribution over all iterations
    all_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for train_index, test_index in msss.split(X, y):
        iter_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        
        for i in range(0,AMOUNT_OF_CLASSES):
            iter_dis[i] = y_train[:,i].sum()
            
        all_dis += iter_dis
        # Show new distribution (for the latest one at first)
        plt.figure(cnt)
        plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],iter_dis)
        
        
        
        cnt += 1
    

    and look at the distribution at the end:

    
    plt.figure(cnt+1)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],all_dis)    
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
    plt.title("Distribution after Stratification")
    plt.legend(['Distribution after stratification','original distribution'])
    

    I will get the following:

    image

    So it still looks like I do not have an even distribution among the classes.

    Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much

    opened by kevinkit 3
  • Possibility to do stratification with multi-output multi-class (multi-target) data

    Possibility to do stratification with multi-output multi-class (multi-target) data

    Hi, I have a multi-output multi-class (multi-target) dataset and would like to do data stratification before applying a learning algorithm. Using iterative_train_test_split from skmultilearn library (``` from skmultilearn.model_selection import iterative_train_test_split x_train, y_train, x_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

    Thank you.
    opened by bundit786 2
  • Do we need X for `split`

    Do we need X for `split`

    Forgive me if this is a dumb question, but if I understand this library correctly, the main aim is to look at correlations between the y's and somehow accomodate for that when stratifying. Is there any reason why we need the X variable as when splitting? eg. to get the indices we always have to do: train_index, test_index = next(iter(msss.split(X, y))).

    Thanks in advance.

    opened by sachinruk 2
  • Is it possible to calculate the total number of all possible splits?

    Is it possible to calculate the total number of all possible splits?

    Love this repo, it spares me a lot effort.

    Here is my question (or concern).

    When we don't enforce any constraint when generating KFold, the number of all possible splits is the largest and simple to calculate.

    When we only have one label and enforce the splits to be stratified, i.e. StratifiedKFold, this number drops, but normally will still be large enough to generate a diverse set of splits. Again, this number can be calculated with some simple combinatorics.

    However, when stratification on multiple labels is enforced (the goal of this repo), things become more complicated and I am worried that if there are too much labels, say hundreds of them, there won't be too many possible splits that can satisfy the stratification constraint😟.

    So my question is,

    • Does my concern make sense?
    • Can we calculate the total number of possibilities?

    Looking forward to reply.

    opened by whatever60 2
  • Different percentage of samples for each label after using MultilabelStratifiedKFold

    Different percentage of samples for each label after using MultilabelStratifiedKFold

    Hi trent-b:

    Thanks for this nice repository, hope you can reply these questions below:

    def multi2single_labels(y):
        d = {}
        for yy in y:
            d[str(yy)] = d.get(str(yy), 0) + 1
        return d
    yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
                  [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\
                  [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2)
    xx = np.zeros((yy.shape[0],))
    kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
    for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
        print(f'Now in {idx_fold}th fold')
        y_valid = yy[idx_valid]
        d_y = multi2single_labels(y_valid)
        print(f'labels of y: {d_y}')
    

    Using the code (simplest 2 fold) above will get result: Now in 0th fold labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4} Now in 1th fold labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2} Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold? Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

    Thanks!

    opened by Lance0218 2
  • Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit

    Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit

    Hi trent-b:

    Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

    def mlb_train_test_split(labels, test_size, train_size, random_state=0):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)
            msss = MultilabelStratifiedShuffleSplit(
                test_size=test_size, train_size=train_size, random_state=random_state
            )
        train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
        return train_idx, test_idx
    

    i then call the function as :

    train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

    When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

    opened by meltedhead 1
Releases(0.1.7)
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 01, 2023
Large-scale linear classification, regression and ranking in Python

lightning lightning is a library for large-scale linear classification, regression and ranking in Python. Highlights: follows the scikit-learn API con

1.6k Dec 31, 2022
(AAAI' 20) A Python Toolbox for Machine Learning Model Combination

combo: A Python Toolbox for Machine Learning Model Combination Deployment & Documentation & Stats Build Status & Coverage & Maintainability & License

Yue Zhao 606 Dec 21, 2022
Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

Andreas Mueller 122 Dec 27, 2022
Extra blocks for scikit-learn pipelines.

scikit-lego We love scikit learn but very often we find ourselves writing custom transformers, metrics and models. The goal of this project is to atte

vincent d warmerdam 941 Dec 30, 2022
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

213 Jan 02, 2023
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

2.1k Jan 02, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

418 Jan 09, 2023
A Python library for dynamic classifier and ensemble selection

DESlib DESlib is an easy-to-use ensemble learning library focused on the implementation of the state-of-the-art techniques for dynamic classifier and

425 Dec 18, 2022
scikit-learn cross validators for iterative stratification of multilabel data

iterative-stratification iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilab

745 Jan 05, 2023
Multivariate imputation and matrix completion algorithms implemented in Python

A variety of matrix completion and imputation algorithms implemented in Python 3.6. To install: pip install fancyimpute Do not use conda. We don't sup

Alex Rubinsteyn 1.1k Dec 18, 2022
machine learning with logical rules in Python

skope-rules Skope-rules is a Python machine learning module built on top of scikit-learn and distributed under the 3-Clause BSD license. Skope-rules a

504 Dec 31, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

6.2k Jan 01, 2023
Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

Scikit-TDA 373 Dec 24, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

803 Jan 05, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 28, 2022
Fast solver for L1-type problems: Lasso, sparse Logisitic regression, Group Lasso, weighted Lasso, Multitask Lasso, etc.

celer Fast algorithm to solve Lasso-like problems with dual extrapolation. Currently, the package handles the following problems: Lasso weighted Lasso

168 Dec 13, 2022