A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Overview

Master status: Master Build Status - Mac/Linux Master Build Status - Windows Master Coverage Status

Development status: Development Build Status - Mac/Linux Development Build Status - Windows Development Coverage Status

Package information: Python 3.7 License: LGPL v3 PyPI version

TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

TPOT Demo

TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

An example Machine Learning pipeline

An example Machine Learning pipeline

Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.

An example TPOT pipeline

TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.

TPOT is still under active development and we encourage you to check back on this repository regularly for updates.

For further information about TPOT, please see the project documentation.

License

Please see the repository license for the licensing and usage information for TPOT.

Generally, we have licensed TPOT to make it as widely usable as possible.

Installation

We maintain the TPOT installation instructions in the documentation. TPOT requires a working installation of Python.

Usage

TPOT can be used on the command line or with Python code.

Click on the corresponding links to find more information on TPOT usage in the documentation.

Examples

Classification

Below is a minimal working example with the optical recognition of handwritten digits dataset.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Running this code should discover a pipeline that achieves about 98% testing accuracy, and the corresponding Python code should be exported to the tpot_digits_pipeline.py file and look similar to the following:

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import PolynomialFeatures
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: 0.9799428471757372
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    StackingEstimator(estimator=LogisticRegression(C=0.1, dual=False, penalty="l1")),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.35000000000000003, min_samples_leaf=20, min_samples_split=19, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Regression

Similarly, TPOT can optimize pipelines for regression problems. Below is a minimal working example with the practice Boston housing prices data set.

from tpot import TPOTRegressor
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

housing = load_boston()
X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target,
                                                    train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_boston_pipeline.py')

which should result in a pipeline that achieves about 12.77 mean squared error (MSE), and the Python code in tpot_boston_pipeline.py should look similar to:

import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=42)

# Average CV score on the training set was: -10.812040755234403
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    ExtraTreesRegressor(bootstrap=False, max_features=0.5, min_samples_leaf=2, min_samples_split=3, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

Check the documentation for more examples and tutorials.

Contributing to TPOT

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.

Before submitting any contributions, please review our contribution guidelines.

Having problems or have questions about TPOT?

Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.

Citing TPOT

If you use TPOT in a scientific publication, please consider citing at least one of the following papers:

Trang T. Le, Weixuan Fu and Jason H. Moore (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics.36(1): 250-256.

BibTeX entry:

@article{le2020scaling,
  title={Scaling tree-based automated machine learning to biomedical big data with a feature set selector},
  author={Le, Trang T and Fu, Weixuan and Moore, Jason H},
  journal={Bioinformatics},
  volume={36},
  number={1},
  pages={250--256},
  year={2020},
  publisher={Oxford University Press}
}

Randal S. Olson, Ryan J. Urbanowicz, Peter C. Andrews, Nicole A. Lavender, La Creis Kidd, and Jason H. Moore (2016). Automating biomedical data science through tree-based pipeline optimization. Applications of Evolutionary Computation, pages 123-137.

BibTeX entry:

@inbook{Olson2016EvoBio,
    author={Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
    editor={Squillero, Giovanni and Burelli, Paolo},
    chapter={Automating Biomedical Data Science Through Tree-Based Pipeline Optimization},
    title={Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 -- April 1, 2016, Proceedings, Part I},
    year={2016},
    publisher={Springer International Publishing},
    pages={123--137},
    isbn={978-3-319-31204-0},
    doi={10.1007/978-3-319-31204-0_9},
    url={http://dx.doi.org/10.1007/978-3-319-31204-0_9}
}

Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz, and Jason H. Moore (2016). Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. Proceedings of GECCO 2016, pages 485-492.

BibTeX entry:

@inproceedings{OlsonGECCO2016,
    author = {Olson, Randal S. and Bartley, Nathan and Urbanowicz, Ryan J. and Moore, Jason H.},
    title = {Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science},
    booktitle = {Proceedings of the Genetic and Evolutionary Computation Conference 2016},
    series = {GECCO '16},
    year = {2016},
    isbn = {978-1-4503-4206-3},
    location = {Denver, Colorado, USA},
    pages = {485--492},
    numpages = {8},
    url = {http://doi.acm.org/10.1145/2908812.2908918},
    doi = {10.1145/2908812.2908918},
    acmid = {2908918},
    publisher = {ACM},
    address = {New York, NY, USA},
}

Alternatively, you can cite the repository directly with the following DOI:

DOI

Support for TPOT

TPOT was developed in the Computational Genetics Lab at the University of Pennsylvania with funding from the NIH under grant R01 AI117694. We are incredibly grateful for the support of the NIH and the University of Pennsylvania during the development of this project.

The TPOT logo was designed by Todd Newmuis, who generously donated his time to the project.

Comments
  • [WIP] Use dask.delayed within fit

    [WIP] Use dask.delayed within fit

    I thought I submitted this already, but couldn't find it. My apologies if this is a double-post

    This should not be merged, it likely breaks existing behavior

    This addresses #304 . It sprinkles dask.delayed in a couple places of the current codebase. To improve things we should do the following:

    1. Break apart or reimplement the _fit_and_score function to use dask.delayed on every step of a pipeline. This would help to improve the sharing of intermediate results and would also improve diagnostics. dask_ml/model_selection/_search.py::do_fit_and_score does some of this, but it was heavily optimized for efficiency. It would be good to do the same thing, but with dask.delayed here, which would probably be nicer for external devs even if it adds a millisecond or two of overhead. cc @jcrist
    2. It might also be useful to delay the cross-validation process so that we don't keep passing around numpy arrays. I suspect that we're being a bit unclean with how we're handling things there. This may not be performance-critical near-term though.

    If anyone wants to take a shot at task 1 I suspect that this would be interesting work.

    opened by mrocklin 58
  • Add support for progress bar during GP if verbosity is set to 2

    Add support for progress bar during GP if verbosity is set to 2

    What does this PR do?

    Adds a progress bar (from the tqdm module) that displays during calls to fit() if verbosity is set to 2

    Where should the reviewer start?

    The reviewer should check the fit() and _gp_new_generation functions

    How should this PR be tested?

    Running TPOT with verbosity set to 2

    Any background context you want to provide?

    What are the relevant issues?

    #140

    Screenshots (if appropriate)

    As you can see below there's a bit of a conflict between the GP min/max/avg stats and the tqdm progress bar. Typically tqdm will smoothly update the progress bar as it progresses, but the stat output forces tqdm to redraw the bar. So it may be desirable to either remove the stats or change the conditions under which the stats and tqdm show up.

    Console-shot:

    GP Progress:  22%|██████▌                      | 9/40 [00:01<00:06,  4.98pipeline/s]
    gen nevals  Minimum score   Average score   Maximum score
    0   10      0.5             0.662568        0.972352                                                         
    GP Progress:  50%|██████████████              | 20/40 [00:06<00:09,  2.14pipeline/s]
    1   10      0.5             0.837614        0.984079                                                                
    GP Progress:  70%|███████████████████▌        | 28/40 [00:11<00:07,  1.63pipeline/s]
    2   8       0.97224         0.979522        0.984079                                                            
    GP Progress:  95%|██████████████████████████▌ | 38/40 [00:19<00:01,  1.15pipeline/s]
    3   8       0.881496        0.968862        0.984079  
    

    Edit:

    Above issue is no longer present in the branch

    Questions:

    • Do the docs need to be updated?

    There does not seem to be any information in the docs about what the verbosity setting does at the moment, so I suppose not unless it is desirable to add that now.

    • Does this PR add new (Python) dependencies?

    Yes, tqdm

    enhancement 
    opened by danthedaniel 34
  • Switch estimator operator logical checks to be interface based rather than inheritance based

    Switch estimator operator logical checks to be interface based rather than inheritance based

    What does this PR do?

    This PR:

    • Switches several inheritance based operator/estimator checks to be duck typing based (verifying based on the estimator interface). The primary use of this is in evaluating whether an operator can be the root of a pipeline and setting the optype correctly. Specifically, logical checks for whether an operator is an estimator of a certain category are done by checking if it inherits from one of several scikit-learn Mixin classes. This PR switches these checks to evaluate whether the interface of the operator is consistent with the scikit-learn estimators, rather than an explicit subclassing.

    • Adds a new configuration, "TPOT cuML". With this configuration, TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed. With this configuration, all model training and predicting will be GPU-accelerated. This configuration is particularly useful for medium-sized and larger datasets on which CPU-based estimators are a common bottleneck, and works for both the TPOTClassifier and TPOTRegressor.

    Where should the reviewer start?

    The reviewer should start in stacking_estimator.py, and then move to operator_utils.py. Next, they should look at base.py and then the new configuration options.

    How should this PR be tested?

    ~Currently, this PR should probably be tested with existing tests, as it does not introduce any new public-interface behavior or dependencies. If it's desirable, I'm happy to add tests for the private helper methods in operator_utils.py (_is_selector and _is_transformer).~

    This PR adds new tests to the general testing suite, as it now adds new public-interface options. It can be tested with the standard tests.

    • [x] Passes existing + new tests with nosetests -s -v (on my local machine)
    Ran 256 tests in 81.943s
    OK (SKIP=1)
    

    Any background context you want to provide?

    Currently, TPOT requires that estimators explicitly inherit from Scikit-learn Mixin classes in order to determine the nature of an estimator operator within the TPOTOperatorClassFactory and StackingEstimator. This Scikit-learn inheritance based programming model provides consistency, but also limits flexibility. Switching these checks to be duck-typing based rather than inheritance based would still preserve consistency but also allow users to use other libraries with TPOT, such as cuML, a GPU-based machine learning library. Using cuML with TPOT can provide significant speedups, as shown in the issue linked below.

    What are the relevant issues?

    This closes https://github.com/EpistasisLab/tpot/issues/1106

    Screenshots (if appropriate)

    From the linked issue, with the specified configuration and key parameters:

    tpot-cuml-speedup (2)

    Questions:

    • [x] Do the docs need to be updated? Yes.
    • [x] Does this PR add new (Python) dependencies? No. Only optional dependencies that a user can control independently.
    opened by beckernick 30
  • Add compatibility for coming version (0.20) of scikit-learn

    Add compatibility for coming version (0.20) of scikit-learn

    Context of the issue

    The coming scikit-learn 0.20 will remove RandomizedPCA and cross_validation module and replace them with new modules. So we need add compatibility for the coming version of scikit-learn

    Release history of scikit-learn

    Process to reproduce the issue

    Running a tpot example with the latest scikit-learn (version 0.18)

    Expected result

    Result would be the same with previous version of scikit-learn, but with 2 warning messages below when runing pipeline evaluation.

    Current result

    2 warning messages :

    ..\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) ..\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:52: DeprecationWarning: Class RandomizedPCA is deprecated; RandomizedPCA was deprecated in 0.18 and will be removed in 0.20. Use PCA(svd_solver='randomized') instead. The new implementation DOES NOT store whiten components_. Apply transform to get them. warnings.warn(msg, category=DeprecationWarning)

    Possible fix

    Add separate opinions of cross-validation and PCA based on the version of sciket-learn

    enhancement need contributor 
    opened by weixuanfu 30
  • Operator.inheritors() returns a list rather than a set.

    Operator.inheritors() returns a list rather than a set.

    What does this PR do?

    In the current version of TPOT, different pipelines are generated, even if the same random state is given. This PR aims to remedy that.

    When picking a new primitive when generating a new individual, this is done by calling np.random.choice(pset.terminals[type_]). While numpy properly had its seed set, pset.terminals was not ordered - resulting in a different primitive picked anyway.

    The reason pset.terminals was unordered, is that it is constructed in the same order as Operator.inheritors() returns the available operators. However, this function internally stored the operators in a set, which is unordered in Python. Hence, multiple calls of Operator.inheritors() could result in differently ordered sets, meaning the random choice, while always picking the same index, would pick different operators.

    This problem is fixed by storing the operators in a list, since that remembers the order elements were added to it.

    Edit: Apparently on my windows system, it was not enough to just use a list, but the list had to be explicitly sorted as well, this change has been made in this PR. On my Linux (Ubuntu) systems it worked fine without.

    Where should the reviewer start?

    All actual code changes are here.

    The new unit test is here.

    How should this PR be tested?

    To test if Operator.inheritors() now always returns the same list, simply call it a few times and see if it indeed works now. This is also done in the unit test. Unfortunately I don't know of a way to better test this change (note that the old code could also return the same order).

    Any background context you want to provide?

    I feel enough background was given.

    What are the relevant issues?

    #349

    Questions:

    I feel that the docs do not need to be updated, as to the best of my knowledge it is intended that the results should always be reproducible, this is a bugfix.

    It adds no new dependencies.

    need contributor 
    opened by PGijsbers 29
  • Unit testing

    Unit testing

    What does this PR do?

    Added 6 unit tests Updated 1 unit test

    Where should the reviewer start?

    tests.py file

    How should this PR be tested?

    Travis-CI should automatically test the tests added

    Any background context you want to provide?

    What are the relevant issues?

    #41

    Screenshots (if appropriate)

    Questions:

    • Do the docs need to be updated? No
    • Does this PR add new (Python) dependencies? No
    enhancement 
    opened by GJena 29
  • Tpot examples do not seem to differentiate/evolve?

    Tpot examples do not seem to differentiate/evolve?

    Running the example the initial optimization values found do not change?

    Context of the issue

    so for instance on the Iris classifier example I ran it and got the following:

    /home/tom/anaconda3/envs/py36n/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning) Optimization Progress: 31%|███ | 92/300 [00:23<00:32, 6.41pipeline/s] Generation 1 - Current best internal CV score: 0.9746376811594203 Optimization Progress: 47%|████▋ | 141/300 [00:39<00:30, 5.23pipeline/s] Generation 2 - Current best internal CV score: 0.9746376811594203 Optimization Progress: 63%|██████▎ | 190/300 [00:50<00:14, 7.69pipeline/s] Generation 3 - Current best internal CV score: 0.9746376811594203 Optimization Progress: 77%|███████▋ | 231/300 [00:57<00:07, 8.68pipeline/s] Generation 4 - Current best internal CV score: 0.9746376811594203

    Generation 5 - Current best internal CV score: 0.9746376811594203

    Best pipeline: GaussianNB(input_matrix) 0.921052631579

    So best internal CV score Never changes..

    Seeing the same when running the Minst example.. IE no change in scores..

    One would expect to see some search activity with differing values??

    minst output:

    Optimization Progress: 31%|███▏ | 94/300 [05:29<04:56, 1.44s/pipeline] Generation 1 - Current best internal CV score: 0.9859273244941604 Optimization Progress: 48%|████▊ | 143/300 [10:57<12:22, 4.73s/pipeline] Generation 2 - Current best internal CV score: 0.9859273244941604 Optimization Progress: 62%|██████▏ | 187/300 [13:09<04:19, 2.30s/pipeline] Generation 3 - Current best internal CV score: 0.9859273244941604 Optimization Progress: 77%|███████▋ | 231/300 [15:04<03:09, 2.74s/pipeline] Generation 4 - Current best internal CV score: 0.9859273244941604

    Generation 5 - Current best internal CV score: 0.9859273244941604

    Best pipeline: LinearSVC(PolynomialFeatures(input_matrix, PolynomialFeatures__degree=2, PolynomialFeatures__include_bias=DEFAULT, PolynomialFeatures__interaction_only=DEFAULT), LinearSVC__C=0.0001, LinearSVC__dual=True, LinearSVC__loss=DEFAULT, LinearSVC__penalty=l2, LinearSVC__tol=0.1) 0.986666666667

    question 
    opened by dartdog 28
  • No module named model_selection?how fix it?

    No module named model_selection?how fix it?

    "from sklearn.model_selection import train_test_split" something wrong with this ,sklearn can't find the model_selection BTW,train_test_split can be used by "from sklearn.cross_validation import train_test_split"

    question 
    opened by MrLevo520 28
  • Adding new unit tests

    Adding new unit tests

    What does this PR do?

    Adds some unit tests to TPOT

    Where should the reviewer start?

    tests.py

    What are the relevant issues?

    #41


    Should probably just sit on this as I add tests

    opened by danthedaniel 28
  • pip installed xgboost causing cpu to spike and crash

    pip installed xgboost causing cpu to spike and crash

    [provide general introduction to the issue and why it is relevant to this repository]

    Hello, first of all, I am very happy with your tool but I am having issues getting it to finish. I would like to avoid using Tpot light if possible since i have the requisite processing speed. Also, I am new to posting bugs on this medium so apologies if I am missing anythin.

    Context of the issue

    [provide more detailed introduction to the issue itself and why it is relevant]

    I am using spyder 3.2.6 on python 3.6 on a windows machine with the specs below. Also I have a slower laptop on 3.2.8 also on 3.6 experiencing the same issue.

    Currently when I fit a pipeline on spyder the evolutionary algorithm will start. However, it constantly fails. What i have noticed quite a few times is that the CPU spikes to 100% usage and then the kernel dies.

    image

    [the remaining entries are only necessary if you are reporting a bug]

    Process to reproduce the issue

    [ordered list the process to finding and recreating the issue, example below]

    PLEASE USE MY SCRIPT HERE to recreate - data and packages preloaded

    https://github.com/GinoWoz1/AdvancedHousePrices/blob/master/TPOT_issue_fix.py

    1. Create dataframe with data
    2. Call Tpot regressor object
    3. Calls TPOT fit() function with training data where n_jobs = -1 ; I have 6 core PC. I use 50 generations, 50 population size, 50 offspring size.
    4. Kernel crashes with 4%-50% on any given run. Havent completed a run in days.

    Expected result

    I would expect , with my specs, a full completion.

    Current result

    Crashes between 4-50% progress

    Possible fix

    Are there certain models in the pipeline that cause CPU to spike?

    name of issue screenshot

    [if relevant, include a screenshot]

    image

    question 
    opened by GinoWoz1 27
  • Convenience function: Detect if there are non-numerical features and encode them as numerical features

    Convenience function: Detect if there are non-numerical features and encode them as numerical features

    (As discussed in #60)

    Since many sklearn tools only work on numerical data, one limitation of TPOT is that it cannot work with non-numerical features. We should look into adding a convenience function that:

    1. detects whether there exist non-numerical features in the feature set

    2. sends a warning to the user that they should preprocess the non-numerical features into numerical features

    3. ... but also tell the user that TPOT is automatically encoding the non-numerical features as numerical features, do so, and pass the new preprocessed feature set to the optimization process.

    enhancement being worked on 
    opened by rhiever 25
  • Update NumPy version requirement

    Update NumPy version requirement

    Update NumPy version requirement to be < 1.24.0 to avoid an AttributeError from importing TPOT in that and newer versions (#1281). This PR may be superceded by #1280, or may need to be reverted in #1280, depending on review/merge order.

    Newer releases of Numpy (and by extension, fresh installs of TPOT) are not compatible with the latest TPOT code and requirements.txt. Several deprecated numpy aliases are removed in Numpy 1.24.0, and at least one (numpy.float) is still used in the TPOT code.

    Before this fix:

    pip install -r requirements.txt
    python
    >>> import tpot
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mark.forrer/Documents/code/tpot/tpot/__init__.py", line 27, in <module>
        from .tpot import TPOTClassifier, TPOTRegressor
      File "/Users/mark.forrer/Documents/code/tpot/tpot/tpot.py", line 31, in <module>
        from .base import TPOTBase
      File "/Users/mark.forrer/Documents/code/tpot/tpot/base.py", line 71, in <module>
        from .builtins import CombineDFs, StackingEstimator
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/__init__.py", line 29, in <module>
        from .one_hot_encoder import OneHotEncoder, auto_select_categorical_features, _transform_selected
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/one_hot_encoder.py", line 136, in <module>
        class OneHotEncoder(BaseEstimator, TransformerMixin):
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/one_hot_encoder.py", line 216, in OneHotEncoder
        def __init__(self, categorical_features='auto', dtype=np.float,
      File "/Users/mark.forrer/.pyenv/versions/tpot/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__
        raise AttributeError("module {!r} has no attribute "
    AttributeError: module 'numpy' has no attribute 'float'
    
    opened by chimaerase 0
  • Error using TPOT with numpy >= 1.24.0 (or fresh TPOT installs)

    Error using TPOT with numpy >= 1.24.0 (or fresh TPOT installs)

    Context of the issue

    Newer releases of Numpy (and by extension, fresh installs of TPOT) are not compatible with the latest TPOT code and requirements.txt. Several deprecated numpy aliases are removed in Numpy 1.24.0, and at least one (numpy.float) is still used in the TPOT code.

    Process to reproduce the issue

    1. Install TPOT in a fresh Python environment (virtualenv) pip install -r requirements.txt
    2. Run the Python shell and import tpot
      python
      >>> import tpot
      
    3. TPOT raises AttributeError
        Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mark.forrer/Documents/code/tpot/tpot/__init__.py", line 27, in <module>
        from .tpot import TPOTClassifier, TPOTRegressor
      File "/Users/mark.forrer/Documents/code/tpot/tpot/tpot.py", line 31, in <module>
        from .base import TPOTBase
      File "/Users/mark.forrer/Documents/code/tpot/tpot/base.py", line 71, in <module>
        from .builtins import CombineDFs, StackingEstimator
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/__init__.py", line 29, in <module>
        from .one_hot_encoder import OneHotEncoder, auto_select_categorical_features, _transform_selected
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/one_hot_encoder.py", line 136, in <module>
        class OneHotEncoder(BaseEstimator, TransformerMixin):
      File "/Users/mark.forrer/Documents/code/tpot/tpot/builtins/one_hot_encoder.py", line 216, in OneHotEncoder
        def __init__(self, categorical_features='auto', dtype=np.float,
      File "/Users/mark.forrer/.pyenv/versions/tpot/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__
        raise AttributeError("module {!r} has no attribute "
    AttributeError: module 'numpy' has no attribute 'float'
    

    Expected result

    Successful TPOT import

    Current result

    AtttributeError on TPOT import

    Possible fix

    Document the requirement for numpy <1.24.0 until the TPOT code can be updated

    opened by chimaerase 2
  • Replace `np.float` with `np.float64` in `one_hot_encoder`

    Replace `np.float` with `np.float64` in `one_hot_encoder`

    Any background context you want to provide?

    NumPy 1.24.0 removed the np.float alias, so we have to replace np.float with np.float64 in one_hot_encoder.

    Another option is to replace np.float with Python's float, although the NumPy version may be useful for consistency with NumPy arrays (source).

    opened by sarahyurick 0
  • How to apply groupKfold in classification

    How to apply groupKfold in classification

    Hi. I am trying to do groupkfold cross validation when applying TPOTClassifier. This is the data

    X_train=np.random.rand(100,5)
    y_train=np.concatenate((np.ones(50),np.zeros(50)))
    groups=np.concatenate((np.zeros(25),np.ones(25),np.ones(50)*2,np.ones(50)*3))
    

    This is how i applied classifier

    pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=GroupKFold(5),
                                        random_state=42, verbosity=2)
    pipeline_optimizer.fit(X_train, y_train,groups=groups)
    

    But this is giving error

    RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
    
    opened by talhaanwarch 0
  • How does the feature construction work please? (easy question)

    How does the feature construction work please? (easy question)

    Dear authors,

    First and foremost, thank you for this excellent AutoML system! It is very cool to see that, compared to BO-based AutoML systems, other approaches like GP here are comparable or even easily superior at some point, that brings diversed powerful optimiser/approach to the field. However, could you kindly explain how feature construction works or point me to your code so I can see how this works broadly, please?

    Have a lovely evening, Best wishes

    opened by simonprovost 1
Releases(v0.11.7)
  • v0.11.7(Jan 6, 2021)

    • Fix compatibility issue with scikit-learn 0.24 and xgboost 1.3.0
    • Fix a bug causing that TPOT does not work when classifying more than 50 classes
    • Add initial support Resampler from imblearn
    • Fix minor bugs
    Source code(tar.gz)
    Source code(zip)
  • 0.11.6.post3(Dec 14, 2020)

  • v0.11.6.post2(Nov 30, 2020)

  • v0.11.6.post1(Nov 5, 2020)

  • 0.11.6(Oct 26, 2020)

    • Fix a bug causing point mutation function does not work properly with using template option
    • Add a new built configuration called "TPOT cuML" which TPOT will search over a restricted configuration using the GPU-accelerated estimators in RAPIDS cuML and DMLC XGBoost. This configuration requires an NVIDIA Pascal architecture or better GPU with compute capability 6.0+, and that the library cuML is installed.
    • Add string path support for log/log_file parameter
    • Fix a bug in version 0.11.5 causing no update in stdout after each generation
    • Fix minor bugs
    Source code(tar.gz)
    Source code(zip)
  • v0.11.1-resAdj(Sep 2, 2020)

  • v0.11.5(Jun 1, 2020)

  • v0.11.4(May 29, 2020)

    • Add a new built configuration "TPOT NN" which includes all operators in "Default TPOT" plus additional neural network estimators written in PyTorch (currently tpot.builtins.PytorchLRClassifier and tpot.builtins.PytorchMLPClassifier for classification tasks only)
    • Refine log_file parameter's behavior
    Source code(tar.gz)
    Source code(zip)
  • v0.11.3(May 14, 2020)

  • v0.11.2(May 13, 2020)

    • Fix early_stop parameter does not work properly
    • TPOT built-in OneHotEncoder can refit to different datasets
    • Fix the issue that the attribute evaluated_individuals_ cannot record correct generation info.
    • Add a new parameter log_file to output logs to a file instead of sys.stdout
    • Fix some code quality issues and mistakes in documentations
    • Fix minor bugs
    Source code(tar.gz)
    Source code(zip)
  • v0.11.1(Jan 3, 2020)

    • Fix compatibility issue with scikit-learn v0.22
    • warm_start now saves both Primitive Sets and evaluated_pipelines_ from previous runs;
    • Fix the error that TPOT assign wrong fitness scores to non-evaluated pipelines (interrupted by max_min_mins or KeyboardInterrupt) ;
    • Fix the bug that mutation operator cannot generate new pipeline when template is not default value and warm_start is True;
    • Fix the bug that max_time_mins cannot stop optimization process when search space is limited.
    • Fix a bug in exported codes when the exported pipeline is only 1 estimator
    • Fix spelling mistakes in documentations
    • Fix some code quality issues
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Nov 5, 2019)

    • Support for Python 3.4 and below has been officially dropped. Also support for scikit-learn 0.20 or below has been dropped.
    • The support of a metric function with the signature score_func(y_true, y_pred) for scoring parameter has been dropped.
    • Refine StackingEstimator for not stacking NaN/Infinity predication probabilities.
    • Fix a bug that population doesn't persist even warm_start=True when max_time_mins is not default value.
    • Now the random_state parameter in TPOT is used for pipeline evaluation instead of using a fixed random seed of 42 before. The set_param_recursive function has been moved to export_utils.py and it can be used in exported codes for setting random_state recursively in scikit-learn Pipeline. It is used to set random_state in fitted_pipeline_ attribute and exported pipelines.
    • TPOT can independently use generations and max_time_mins to limit the optimization process through using one of the parameters or both.
    • .export() function will return string of exported pipeline if output filename is not specified.
    • Add SGDClassifier and SGDRegressor into TPOT default configs.
    • Documentation has been updated.
    • Fix minor bugs.
    Source code(tar.gz)
    Source code(zip)
  • v0.10.2(Jul 16, 2019)

    • TPOT v0.10.2 is the last version to support Python 2.7 and Python 3.4.
    • Minor updates for fixing compatibility issues with the latest version of scikit-learn (version > 0.21) and xgboost (v0.90)
    • Default value of template parameter is changed to None instead.
    • Fix errors in documentation
    Source code(tar.gz)
    Source code(zip)
  • v0.10.1(Apr 19, 2019)

    • Add data_file_path option into expert function for replacing 'PATH/TO/DATA/FILE' to customized dataset path in exported scripts. (Related issue #838)
    • Change python version in CI tests to 3.7
    • Add CI tests for macOS.
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Apr 12, 2019)

    • Add a new template option to specify a desired structure for machine learning pipeline in TPOT. Check TPOT API (it will be updated once it is merge to master branch).
    • Add FeatureSetSelector operator into TPOT for feature selection based on priori export knowledge. Please check our preprint paper for more details (Note: it was named DatasetSelector in 1st version paper but we will rename to FeatureSetSelector in next version of the paper)
    • Refine n_jobs parameter to accept value below -1. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. It is related to the issue #846.
    • Now memory parameter can create memory cache directory if it does not exist. It is related to the issue #837.
    • Fix minor bugs.
    Source code(tar.gz)
    Source code(zip)
  • v0.9.6(Mar 1, 2019)

    • Fix a bug causing that max_time_mins parameter doesn't work when use_dask=True in TPOT 0.9.5
    • Now TPOT saves best pareto values best pareto pipeline s in checkpoint folder
    • TPOT raises ImportError if operators in the TPOT configuration are not available when verbosity>2
    • Thank @PGijsbers for the suggestions. Now TPOT can save scores of individuals already evaluated in any generation even the evaluation process of that generation is interrupted/stopped. But it is noted that, in this case, TPOT will raise this warning message: WARNING: TPOT may not provide a good pipeline if TPOT is stopped/interrupted in a early generation., because the pipelines in early generation, e.g. 1st generation, are evolved/modified very limited times via evolutionary algorithm.
    • Fix bugs in configuration of TPOTRegressor
    • Error fixes in documentation
    Source code(tar.gz)
    Source code(zip)
  • v0.9.5(Sep 4, 2018)

    • TPOT now supports integration with Dask for parallelization + smart caching. Big thanks to the Dask dev team for making this happen!

    • TPOT now supports for imputation/sparse matrices into predict and predict_proba functions.

    • TPOTClassifier and TPOTRegressor now follows scikit-learn estimator API.

    • We refined scoring parameter in TPOT API for accepting Scorer object.

    • We refined parameters in VarianceThreshold and FeatureAgglomeration.

    • TPOT now supports using memory caching within a Pipeline via a optional memory parameter.

    • We improved documentation of TPOT.

    Source code(tar.gz)
    Source code(zip)
  • v0.9(Sep 27, 2017)

    • TPOT now supports sparse matrices with a new built-in TPOT configurations, "TPOT sparse". We are using a custom OneHotEncoder implementation that supports missing values and continuous features.

    • We have added an "early stopping" option for stopping the optimization process if no improvement is made within a set number of generations. Look up the early_stop parameter to access this functionality.

    • TPOT now reduces the number of duplicated pipelines between generations, which saves you time during the optimization process.

    • TPOT now supports custom scoring functions via the command-line mode.

    • We have added a new optional argument, periodic_checkpoint_folder, that allows TPOT to periodically save the best pipeline so far to a local folder during optimization process.

    • TPOT no longer uses sklearn.externals.joblib when n_jobs=1 to avoid the potential freezing issue that scikit-learn suffers from.

    • We have added pandas as a dependency to read input datasets instead of numpy.recfromcsv. NumPy's recfromcsv function is unable to parse datasets with complex data types.

    • Fixed a bug that DEFAULT in the parameter(s) of nested estimator raises KeyError when exporting pipelines.

    • Fixed a bug related to setting random_state in nested estimators. The issue would happen with pipeline with SelectFromModel (ExtraTreesClassifier as nested estimator) or StackingEstimator if nested estimator has random_state parameter.

    • Fixed a bug in the missing value imputation function in TPOT to impute along columns instead rows.

    • Refined input checking for sparse matrices in TPOT.

    Source code(tar.gz)
    Source code(zip)
  • v0.8(Jun 1, 2017)

    • TPOT now detects whether there are missing values in your dataset and replaces them with the median value of the column.

    • TPOT now allows you to set a group parameter in the fit function so you can use the GroupKFold cross-validation strategy.

    • TPOT now allows you to set a subsample ratio of the training instance with the subsample parameter. For example, setting subsample=0.5 tells TPOT to create a fixed subsample of half of the training data for the pipeline optimization process. This parameter can be useful for speeding up the pipeline optimization process, but may give less accurate performance estimates from cross-validation.

    • TPOT now has more built-in configurations, including TPOT MDR and TPOT light, for both classification and regression problems.

    • TPOTClassifier and TPOTRegressor now expose three useful internal attributes, fitted_pipeline_, pareto_front_fitted_pipelines_, and evaluated_individuals_. These attributes are described in the API documentation.

    • Oh, TPOT now has thorough API documentation. Check it out!

    • Fixed a reproducibility issue where setting random_seed didn't necessarily result in the same results every time. This bug was present since TPOT v0.7.

    • Refined input checking in TPOT.

    • Removed Python 2 uncompliant code.

    Source code(tar.gz)
    Source code(zip)
  • 0.7(Mar 22, 2017)

    TPOT 0.7 is now out, featuring multiprocessing support for Linux and macOS, customizable operator configurations, and more.

    • TPOT now has multiprocessing support (Linux and macOS only). TPOT allows you to use multiple processes for accelerating pipeline optimization in TPOT with the n_jobs parameter in both TPOTClassifier and TPOTRegressor.

    • TPOT now allows you to customize the operators and parameters explored during the optimization process. TPOT allows you to customize the list of operators and parameters in optimization process of TPOT with the config_dict parameter. The format of this customized dictionary can be found in the online documentation.

    • TPOT now allows you to specify a time limit for evaluating a single pipeline (default limit is 5 minutes) in optimization process with the max_eval_time_mins parameter, so TPOT won't spend hours evaluating overly-complex pipelines.

    • We tweaked TPOT's underlying evolutionary optimization algorithm to work even better, including using the mu+lambda algorithm. This algorithm gives you more control of how many pipelines are generated every iteration with the offspring_size parameter.

    • Fixed a reproducibility issue where setting random_seed didn't necessarily result in the same results every time. This bug was present since version 0.6.

    • Refined the default operators and parameters in TPOT, so TPOT 0.7 should work even better than 0.6.

    • TPOT now supports sample weights in the fitness function if some if your samples are more important to classify correctly than others. The sample weights option works the same as in scikit-learn, e.g., tpot.fit(x_train, y_train, sample_weights=sample_weights).

    • The default scoring metric in TPOT has been changed from balanced accuracy to accuracy, the same default metric for classification algorithms in scikit-learn. Balanced accuracy can still be used by setting scoring='balanced_accuracy' when creating a TPOT instance.

    Source code(tar.gz)
    Source code(zip)
  • v0.6(Sep 2, 2016)

    • TPOT now supports regression problems! We have created two separate TPOTClassifier and TPOTRegressor classes to support classification and regression problems, respectively. The command-line interface also supports this feature through the -mode parameter.
    • TPOT now allows you to specify a time limit for the optimization process with the max_time_mins parameter, so you don't need to guess how long TPOT will take any more to recommend a pipeline to you.
    • Added a new operator that performs feature selection using ExtraTrees feature importance scores.
    • XGBoost has been added as an optional dependency to TPOT. If you have XGBoost installed, TPOT will automatically detect your installation and use the XGBoostClassifier and XGBoostRegressor in its pipelines.
    • TPOT now offers a verbosity level of 3 ("science mode"), which outputs the entire Pareto front instead of only the current best score. This feature may be useful for users looking to make a trade-off between pipeline complexity and score.
    Source code(tar.gz)
    Source code(zip)
  • v0.5(Aug 20, 2016)

    After a couple months hiatus in refactor land, we're excited to release the latest and greatest version of TPOT v0.5. For the past couple months, we worked on heavily refactoring TPOT's code base from a hacky research demo into a more elegant code base that will be easier to maintain in the long run. As an added bonus, TPOT now directly optimizes over and exports to scikit-learn Pipeline objects, so your auto-generated code should be much more readable.

    Major changes in v0.5:

    • Major refactor: Each operator is defined in a separate class file. Hooray for easier-to-maintain code!
    • TPOT now exports directly to scikit-learn Pipelines instead of hacky code.
    • Internal representation of individuals now uses scikit-learn pipelines.
    • Parameters for each operator have been optimized so TPOT spends less time exploring useless parameters.
    • We have removed pandas as a dependency and instead use numpy matrices to store the data.
    • TPOT now uses k-fold cross-validation when evaluating pipelines, with a default k = 3. This k parameter can be tuned when creating a new TPOT instance.
    • Improved scoring function support: Even though TPOT uses balanced accuracy by default, you can now have TPOT use any of the scoring functions that cross_val_score supports.
    • Added the scikit-learn Normalizer preprocessor.
    • Minor text fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jun 23, 2016)

    In TPOT 0.4, we've made some major changes to the internals of TPOT and added some convenience functions. We've summarized the changes below.

    • Added new sklearn models and preprocessors
      • AdaBoostClassifier
      • BernoulliNB
      • ExtraTreesClassifier
      • GaussianNB
      • MultinomialNB
      • LinearSVC
      • PassiveAggressiveClassifier
      • GradientBoostingClassifier
      • RBFSampler
      • FastICA
      • FeatureAgglomeration
      • Nystroem
    • Added operator that inserts virtual features for the count of features with values of zero
    • Reworked parameterization of TPOT operators
      • Reduced parameter search space with information from a scikit-learn benchmark
      • TPOT no longer generates arbitrary parameter values, but uses a fixed parameter set instead
    • Removed XGBoost as a dependency
      • Too many users were having install issues with XGBoost
      • Replaced with scikit-learn's GradientBoostingClassifier
    • Improved descriptiveness of TPOT command line parameter documentation
    • Removed min/max/avg details during fit() when verbosity > 1
      • Replaced with tqdm progress bar
      • Added tqdm as a dependency
    • Added fit_predict() convenience function
    • Added get_params() function so TPOT can operate in scikit-learn's cross_val_score & related functions
    Source code(tar.gz)
    Source code(zip)
  • v0.2.8(Mar 6, 2016)

  • 0.2.1(Feb 3, 2016)

    This is the version of TPOT that was used in the GECCO 2016 paper, "Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science."

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 7, 2015)

    New in v0.2.0:

    • TPOT now has the ability to export the optimized pipelines to sklearn code. See the documentation for more information.
    • Logistic regression, SVM, and k-nearest neighbors classifiers were added as pipeline operators. Previously, TPOT only included decision tree and random forest classifiers.
    • TPOT can now use arbitrary scoring functions for the optimization process. See the scoring function documentation for more information.
    Source code(tar.gz)
    Source code(zip)
Owner
Epistasis Lab at UPenn
Prof. Jason H. Moore's research lab at the University of Pennsylvania
Epistasis Lab at UPenn
CVXPY is a Python-embedded modeling language for convex optimization problems.

CVXPY The CVXPY documentation is at cvxpy.org. We are building a CVXPY community on Discord. Join the conversation! For issues and long-form discussio

4.3k Jan 08, 2023
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022
Tools for diffing and merging of Jupyter notebooks.

nbdime provides tools for diffing and merging of Jupyter Notebooks.

Project Jupyter 2.3k Jan 03, 2023
PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

PyNNDescent PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbo

Leland McInnes 699 Jan 09, 2023
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 06, 2023
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
Python-based implementations of algorithms for learning on imbalanced data.

ND DIAL: Imbalanced Algorithms Minimalist Python-based implementations of algorithms for imbalanced learning. Includes deep and representational learn

DIAL | Notre Dame 220 Dec 13, 2022
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
Simple and flexible ML workflow engine.

This is a simple and flexible ML workflow engine. It helps to orchestrate events across a set of microservices and create executable flow to handle requests. Engine is designed to be configurable wit

Katana ML 295 Jan 06, 2023
easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

easyNeuron is a simple way to create powerful machine learning models, analyze data and research cutting-edge AI.

Neuron AI 5 Jun 18, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

92 Dec 14, 2022
Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and Python functions.

Mars is a tensor-based unified framework for large-scale data computation which scales numpy, pandas, scikit-learn and many other libraries. Documenta

2.5k Jan 07, 2023
Predict the demand for electricity (R) - FRENCH

06.demand-electricity Predict the demand for electricity (R) - FRENCH Prédisez la demande en électricité Prérequis Pour effectuer ce projet, vous devr

1 Feb 13, 2022
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
XAI - An eXplainability toolbox for machine learning

XAI - An eXplainability toolbox for machine learning XAI is a Machine Learning library that is designed with AI explainability in its core. XAI contai

The Institute for Ethical Machine Learning 875 Dec 27, 2022
Predict profitability of trades based on indicator buy / sell signals

Predict profitability of trades based on indicator buy / sell signals Trade profitability analysis for trades based on various indicators signals: MAC

Tomasz Porzycki 1 Dec 15, 2021
Learning --> Numpy January 2022 - winter'22

Numerical-Python Numpy NumPy is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along

Shahzaneer Ahmed 0 Mar 12, 2022
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

154 Dec 17, 2022
A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022