a delightful machine learning tool that allows you to train, test and use models without writing code

Overview

igel

igel-icon



PyPI https://pepy.tech/badge/igel Documentation Status PyPI - Wheel PyPI - Status Libraries.io dependency status for GitHub repo GitHub Repo stars Twitter URL


A delightful machine learning tool that allows you to train/fit, test and use models without writing code

Note

I'm also working on a GUI desktop app for igel based on people's requests. You can find it under Igel-UI. Please consider supporting the project!



Motivation & Goal

The goal of the project is to provide machine learning for everyone, both technical and non-technical users.

I needed a tool sometimes, which I can use to fast create a machine learning prototype. Whether to build some proof of concept or create a fast draft model to prove a point. I find myself often stuck at writing boilerplate code and/or thinking too much of how to start this.

Therefore, I decided to create igel. Hopefully, it will make it easier for technical and non-technical users to build machine learning models.

Features

  • Usage from GUI
  • Supports most dataset types (csv, txt, excel, json, html)
  • Supports all state of the art machine learning models (even preview models)
  • Supports different data preprocessing methods
  • Provides flexibility and data control while writing configurations
  • Supports cross validation
  • Supports both hyperparameter search (version >= 0.2.8)
  • Supports yaml and json format
  • Supports different sklearn metrics for regression, classification and clustering
  • Supports multi-output/multi-target regression and classification
  • Supports multi-processing for parallel model construction

Intro

igel is built on top of scikit-learn. It provides a simple way to use machine learning without writing a single line of code.

All you need is a yaml (or json) file, where you need to describe what you are trying to do. That's it!

Igel supports all sklearn's machine learning functionality, whether regression, classification or clustering. Precisely, you can use 63 different machine learning model in igel.

Igel supports most used dataset types in the data science field. For instance, your input dataset can be a csv, txt, excel sheet, json or even html file that you want to fetch. All these types are supported by igel. In the background, igel uses pandas to read and convert your input dataset to a dataframe.

Unlike other ML tools, igel is lightweight in the sense that it has minimal dependencies. Precisely, igel uses pandas in the background for data manipulation/preprocessing and sklearn for the machine learning part. Hence, it depends only on these two famous packages.

Installation

  • The easiest way is to install igel using pip
$ pip install -U igel
  • Check the docs for other ways to install igel from source

Running with Docker

  • Use the official image (recommended):

You can pull the image first from docker hub

$ docker pull nidhaloff/igel

Then use it:

$ docker run -it --rm -v $(pwd):/data nidhaloff/igel fit -yml 'your_file.yaml' -dp 'your_dataset.csv'
  • Alternatively, you can create your own image locally if you want:

You can run igel inside of docker by first building the image:

$ docker build -t igel .

And then running it and attaching your current directory (does not need to be the igel directory) as /data (the workdir) inside of the container:

$ docker run -it --rm -v $(pwd):/data igel fit -yml 'your_file.yaml' -dp 'your_dataset.csv'

Models

Igel's supported models:

+--------------------+----------------------------+-------------------------+
|      regression    |        classification      |        clustering       |
+--------------------+----------------------------+-------------------------+
|   LinearRegression |         LogisticRegression |                  KMeans |
|              Lasso |                      Ridge |     AffinityPropagation |
|          LassoLars |               DecisionTree |                   Birch |
| BayesianRegression |                  ExtraTree | AgglomerativeClustering |
|    HuberRegression |               RandomForest |    FeatureAgglomeration |
|              Ridge |                 ExtraTrees |                  DBSCAN |
|  PoissonRegression |                        SVM |         MiniBatchKMeans |
|      ARDRegression |                  LinearSVM |    SpectralBiclustering |
|  TweedieRegression |                      NuSVM |    SpectralCoclustering |
| TheilSenRegression |            NearestNeighbor |      SpectralClustering |
|    GammaRegression |              NeuralNetwork |               MeanShift |
|   RANSACRegression | PassiveAgressiveClassifier |                  OPTICS |
|       DecisionTree |                 Perceptron |                    ---- |
|          ExtraTree |               BernoulliRBM |                    ---- |
|       RandomForest |           BoltzmannMachine |                    ---- |
|         ExtraTrees |       CalibratedClassifier |                    ---- |
|                SVM |                   Adaboost |                    ---- |
|          LinearSVM |                    Bagging |                    ---- |
|              NuSVM |           GradientBoosting |                    ---- |
|    NearestNeighbor |        BernoulliNaiveBayes |                    ---- |
|      NeuralNetwork |      CategoricalNaiveBayes |                    ---- |
|         ElasticNet |       ComplementNaiveBayes |                    ---- |
|       BernoulliRBM |         GaussianNaiveBayes |                    ---- |
|   BoltzmannMachine |      MultinomialNaiveBayes |                    ---- |
|           Adaboost |                       ---- |                    ---- |
|            Bagging |                       ---- |                    ---- |
|   GradientBoosting |                       ---- |                    ---- |
+--------------------+----------------------------+-------------------------+

Quick Start

Run igel version to check the version.

Run igel info to get meta data about the project.

You can run the help command to get instructions:

$ igel --help

# or just

$ igel -h
"""
Take some time and read the output of help command. You ll save time later if you understand how to use igel.
"""
  • Demo:

assets/igel-help.gif


First step is to provide a yaml file (you can also use json if you want)

You can do this manually by creating a .yaml file (called igel.yaml by convention but you can name if whatever you want) and editing it yourself. However, if you are lazy (and you probably are, like me :D), you can use the igel init command to get started fast, which will create a basic config file for you on the fly.

"""
igel init <args>
possible optional args are: (notice that these args are optional, so you can also just run igel init if you want)
-type: regression, classification or clustering
-model: model you want to use
-target: target you want to predict


Example:
If I want to use neural networks to classify whether someone is sick or not using the indian-diabetes dataset,
then I would use this command to initialize a yaml file n.b. you may need to rename outcome column in .csv to sick:
$ igel init -type "classification" -model "NeuralNetwork" -target "sick"
"""
$ igel init

After running the command, an igel.yaml file will be created for you in the current working directory. You can check it out and modify it if you want to, otherwise you can also create everything from scratch.

  • Demo:

assets/igel-init.gif


# model definition
model:
    # in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering
    # Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithm
    type: classification
    algorithm: RandomForest     # make sure you write the name of the algorithm in pascal case
    arguments:
        n_estimators: 100   # here, I set the number of estimators (or trees) to 100
        max_depth: 30       # set the max_depth of the tree

# target you want to predict
# Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.
# Depending on your data, you need to provide the target(s) you want to predict here
target:
    - sick

In the example above, I'm using random forest to classify whether someone have diabetes or not depending on some features in the dataset I used the famous indian diabetes in this example indian-diabetes dataset)

Notice that I passed n_estimators and max_depth as additional arguments to the model. If you don't provide arguments then the default will be used. You don't have to memorize the arguments for each model. You can always run igel models in your terminal, which will get you to interactive mode, where you will be prompted to enter the model you want to use and type of the problem you want to solve. Igel will then show you information about the model and a link that you can follow to see a list of available arguments and how to use these.

  • The expected way to use igel is from terminal (igel CLI):

Run this command in terminal to fit/train a model, where you provide the path to your dataset and the path to the yaml file

$ igel fit --data_path 'path_to_your_csv_dataset.csv' --yaml_file 'path_to_your_yaml_file.yaml'

# or shorter

$ igel fit -dp 'path_to_your_csv_dataset.csv' -yml 'path_to_your_yaml_file.yaml'

"""
That's it. Your "trained" model can be now found in the model_results folder
(automatically created for you in your current working directory).
Furthermore, a description can be found in the description.json file inside the model_results folder.
"""
  • Demo:

assets/igel-fit.gif


You can then evaluate the trained/pre-fitted model:

$ igel evaluate -dp 'path_to_your_evaluation_dataset.csv'
"""
This will automatically generate an evaluation.json file in the current directory, where all evaluation results are stored
"""
  • Demo:

assets/igel-eval.gif


Finally, you can use the trained/pre-fitted model to make predictions if you are happy with the evaluation results:

$ igel predict -dp 'path_to_your_test_dataset.csv'
"""
This will generate a predictions.csv file in your current directory, where all predictions are stored in a csv file
"""
  • Demo:

assets/igel-pred.gif

assets/igel-predict.gif


You can combine the train, evaluate and predict phases using one single command called experiment:

$ igel experiment -DP "path_to_train_data path_to_eval_data path_to_test_data" -yml "path_to_yaml_file"

"""
This will run fit using train_data, evaluate using eval_data and further generate predictions using the test_data
"""
  • Demo:

assets/igel-experiment.gif

  • Alternatively, you can also write code if you want to:
from igel import Igel

# provide the arguments in a dictionary
params = {
        'cmd': 'fit',    # provide the command you want to use. whether fit, evaluate or predict
        'data_path': 'path_to_your_dataset',
        'yaml_path': 'path_to_your_yaml_file'
}

Igel(**params)
"""
check the examples folder for more
"""

Interactive Mode

Interactive mode is new in >= v0.2.6

This mode basically offers you the freedom to write arguments on your way. You are not restricted to write the arguments directly when using the command.

This means practically that you can use the commands (fit, evaluate, predict, experiment etc.) without specifying any additional arguments. For example:

igel fit

if you just write this and click enter, you will be prompted to provide the additional mandatory arguments. Any version <= 0.2.5 will throw an error in this case, which why you need to make sure that you have a >= 0.2.6 version.

  • Demo (init command):

assets/igel-init-interactive.gif

  • Demo (fit command):

assets/igel-fit-interactive.gif

As you can see, you don't need to memorize the arguments, you can just let igel ask you to enter them. Igel will provide you with a nice message explaining which argument you need to enter.

The value between brackets represents the default value. This means if you provide no value and hit return, then the value between brackets will be taken as the default value.

Overview

The main goal of igel is to provide you with a way to train/fit, evaluate and use models without writing code. Instead, all you need is to provide/describe what you want to do in a simple yaml file.

Basically, you provide description or rather configurations in the yaml file as key value pairs. Here is an overview of all supported configurations (for now):

# dataset operations
dataset:
    type: csv  # [str] -> type of your dataset
    read_data_options: # options you want to supply for reading your data (See the detailed overview about this in the next section)
        sep:  # [str] -> Delimiter to use.
        delimiter:  # [str] -> Alias for sep.
        header:     # [int, list of int] -> Row number(s) to use as the column names, and the start of the data.
        names:  # [list] -> List of column names to use
        index_col: # [int, str, list of int, list of str, False] -> Column(s) to use as the row labels of the DataFrame,
        usecols:    # [list, callable] -> Return a subset of the columns
        squeeze:    # [bool] -> If the parsed data only contains one column then return a Series.
        prefix:     # [str] -> Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
        mangle_dupe_cols:   # [bool] -> Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
        dtype:  # [Type name, dict maping column name to type] -> Data type for data or columns
        engine:     # [str] -> Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
        converters: # [dict] -> Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
        true_values: # [list] -> Values to consider as True.
        false_values: # [list] -> Values to consider as False.
        skipinitialspace: # [bool] -> Skip spaces after delimiter.
        skiprows: # [list-like] -> Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
        skipfooter: # [int] -> Number of lines at bottom of file to skip
        nrows: # [int] -> Number of rows of file to read. Useful for reading pieces of large files.
        na_values: # [scalar, str, list, dict] ->  Additional strings to recognize as NA/NaN.
        keep_default_na: # [bool] ->  Whether or not to include the default NaN values when parsing the data.
        na_filter: # [bool] -> Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
        verbose: # [bool] -> Indicate number of NA values placed in non-numeric columns.
        skip_blank_lines: # [bool] -> If True, skip over blank lines rather than interpreting as NaN values.
        parse_dates: # [bool, list of int, list of str, list of lists, dict] ->  try parsing the dates
        infer_datetime_format: # [bool] -> If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them.
        keep_date_col: # [bool] -> If True and parse_dates specifies combining multiple columns then keep the original columns.
        dayfirst: # [bool] -> DD/MM format dates, international and European format.
        cache_dates: # [bool] -> If True, use a cache of unique, converted dates to apply the datetime conversion.
        thousands: # [str] -> the thousands operator
        decimal: # [str] -> Character to recognize as decimal point (e.g. use ‘,’ for European data).
        lineterminator: # [str] -> Character to break file into lines.
        escapechar: # [str] ->  One-character string used to escape other characters.
        comment: # [str] -> Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character.
        encoding: # [str] -> Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
        dialect: # [str, csv.Dialect] -> If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting
        delim_whitespace: # [bool] -> Specifies whether or not whitespace (e.g. ' ' or '    ') will be used as the sep
        low_memory: # [bool] -> Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference.
        memory_map: # [bool] -> If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

    random_numbers: # random numbers options in case you wanted to generate the same random numbers on each run
        generate_reproducible:  # [bool] -> set this to true to generate reproducible results
        seed:   # [int] -> the seed number is optional. A seed will be set up for you if you didn't provide any

    split:  # split options
        test_size: 0.2  #[float] -> 0.2 means 20% for the test data, so 80% are automatically for training
        shuffle: true   # [bool] -> whether to shuffle the data before/while splitting
        stratify: None  # [list, None] -> If not None, data is split in a stratified fashion, using this as the class labels.

    preprocess: # preprocessing options
        missing_values: mean    # [str] -> other possible values: [drop, median, most_frequent, constant] check the docs for more
        encoding:
            type: oneHotEncoding  # [str] -> other possible values: [labelEncoding]
        scale:  # scaling options
            method: standard    # [str] -> standardization will scale values to have a 0 mean and 1 standard deviation  | you can also try minmax
            target: inputs  # [str] -> scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled


# model definition
model:
    type: classification    # [str] -> type of the problem you want to solve. | possible values: [regression, classification, clustering]
    algorithm: NeuralNetwork    # [str (notice the pascal case)] -> which algorithm you want to use. | type igel algorithms in the Terminal to know more
    arguments:          # model arguments: you can check the available arguments for each model by running igel help in your terminal
    use_cv_estimator: false     # [bool] -> if this is true, the CV class of the specific model will be used if it is supported
    cross_validate:
        cv: # [int] -> number of kfold (default 5)
        n_jobs:   # [signed int] -> The number of CPUs to use to do the computation (default None)
        verbose: # [int] -> The verbosity level. (default 0)
    hyperparameter_search:
        method: grid_search   # method you want to use: grid_search and random_search are supported
        parameter_grid:     # put your parameters grid here that you want to use, an example is provided below
            param1: [val1, val2]
            param2: [val1, val2]
        arguments:  # additional arguments you want to provide for the hyperparameter search
            cv: 5   # number of folds
            refit: true   # whether to refit the model after the search
            return_train_score: false   # whether to return the train score
            verbose: 0      # verbosity level

# target you want to predict
target:  # list of strings: basically put here the column(s), you want to predict that exist in your csv dataset
    - put the target you want to predict here
    - you can assign many target if you are making a multioutput prediction

Read Data Options

Note

igel uses pandas under the hood to read & parse the data. Hence, you can find this data optional parameters also in the pandas official documentation.

A detailed overview of the configurations you can provide in the yaml (or json) file is given below. Notice that you will certainly not need all the configuration values for the dataset. They are optional. Generally, igel will figure out how to read your dataset.

However, you can help it by providing extra fields using this read_data_options section. For example, one of the helpful values in my opinion is the "sep", which defines how your columns in the csv dataset are separated. Generally, csv datasets are separated by commas, which is also the default value here. However, it may be separated by a semicolon in your case.

Hence, you can provide this in the read_data_options. Just add the sep: ";" under read_data_options.

Supported Read Data Options
Parameter Type Explanation
sep str, default ‘,’ Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from 's+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: 'rt'.
delimiter default None Alias for sep.
header int, list of int, default ‘infer’ Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.
names array-like, optional List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
index_col int, str, sequence of int / str, or False, default None Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used. Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
usecols list-like or callable, optional Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order. If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.
squeeze bool, default False If the parsed data only contains one column then return a Series.
prefix str, optional Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …
mangle_dupe_cols bool, default True Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
dtype {‘c’, ‘python’}, optional Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.
converters dict, optional Dict of functions for converting values in certain columns. Keys can either be integers or column labels.
true_values list, optional Values to consider as True.
false_values list, optional Values to consider as False.
skipinitialspace bool, default False Skip spaces after delimiter.
skiprows list-like, int or callable, optional Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
skipfooter int, default 0 Number of lines at bottom of file to skip (Unsupported with engine=’c’).
nrows int, optional Number of rows of file to read. Useful for reading pieces of large files.
na_values scalar, str, list-like, or dict, optional Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na bool, default True Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows: If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing. If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing. If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing. If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN. Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.
na_filter bool, default True Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose bool, default False Indicate number of NA values placed in non-numeric columns.
skip_blank_lines bool, default True If True, skip over blank lines rather than interpreting as NaN values.
parse_dates bool or list of int or names or list of lists or dict, default False The behavior is as follows: boolean. If True -> try parsing the index. list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column. dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’ If a column or index cannot be represented as an array of datetimes, say because of an unparseable value or a mixture of timezones, the column or index will be returned unaltered as an object data type.
infer_datetime_format bool, default False If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.
keep_date_col bool, default False If True and parse_dates specifies combining multiple columns then keep the original columns.
date_parser function, optional Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst bool, default False DD/MM format dates, international and European format.
cache_dates bool, default True If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.
thousands str, optional Thousands separator.
decimal str, default ‘.’ Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator str (length 1), optional Character to break file into lines. Only valid with C parser.
escapechar str (length 1), optional One-character string used to escape other characters.
comment str, optional Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether.
encoding str, optional Encoding to use for UTF when reading/writing (ex. ‘utf-8’).
dialect str or csv.Dialect, optional If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting
low_memory bool, default True Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless,
memory_map bool, default False map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

E2E Example

A complete end to end solution is provided in this section to prove the capabilities of igel. As explained previously, you need to create a yaml configuration file. Here is an end to end example for predicting whether someone have diabetes or not using the decision tree algorithm. The dataset can be found in the examples folder.

  • Fit/Train a model:
model:
    type: classification
    algorithm: DecisionTree

target:
    - sick
$ igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file

That's it, igel will now fit the model for you and save it in a model_results folder in your current directory.

  • Evaluate the model:

Evaluate the pre-fitted model. Igel will load the pre-fitted model from the model_results directory and evaluate it for you. You just need to run the evaluate command and provide the path to your evaluation data.

$ igel evaluate -dp path_to_the_evaluation_dataset

That's it! Igel will evaluate the model and store statistics/results in an evaluation.json file inside the model_results folder

  • Predict:

Use the pre-fitted model to predict on new data. This is done automatically by igel, you just need to provide the path to your data that you want to use prediction on.

$ igel predict -dp path_to_the_new_dataset

That's it! Igel will use the pre-fitted model to make predictions and save it in a predictions.csv file inside the model_results folder

Advanced Usage

You can also carry out some preprocessing methods or other operations by providing them in the yaml file. Here is an example, where the data is split to 80% for training and 20% for validation/testing. Also, the data are shuffled while splitting.

Furthermore, the data are preprocessed by replacing missing values with the mean ( you can also use median, mode etc..). check this link for more information

# dataset operations
dataset:
    split:
        test_size: 0.2
        shuffle: True
        stratify: default

    preprocess: # preprocessing options
        missing_values: mean    # other possible values: [drop, median, most_frequent, constant] check the docs for more
        encoding:
            type: oneHotEncoding  # other possible values: [labelEncoding]
        scale:  # scaling options
            method: standard    # standardization will scale values to have a 0 mean and 1 standard deviation  | you can also try minmax
            target: inputs  # scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled

# model definition
model:
    type: classification
    algorithm: RandomForest
    arguments:
        # notice that this is the available args for the random forest model. check different available args for all supported models by running igel help
        n_estimators: 100
        max_depth: 20

# target you want to predict
target:
    - sick

Then, you can fit the model by running the igel command as shown in the other examples

$ igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file

For evaluation

$ igel evaluate -dp path_to_the_evaluation_dataset

For production

$ igel predict -dp path_to_the_new_dataset

Examples

In the examples folder in the repository, you will find a data folder,where the famous indian-diabetes, iris dataset and the linnerud (from sklearn) datasets are stored. Furthermore, there are end to end examples inside each folder, where there are scripts and yaml files that will help you get started.

The indian-diabetes-example folder contains two examples to help you get started:

  • The first example is using a neural network, where the configurations are stored in the neural-network.yaml file
  • The second example is using a random forest, where the configurations are stored in the random-forest.yaml file

The iris-example folder contains a logistic regression example, where some preprocessing (one hot encoding) is conducted on the target column to show you more the capabilities of igel.

Furthermore, the multioutput-example contains a multioutput regression example. Finally, the cv-example contains an example using the Ridge classifier using cross validation.

You can also find a cross validation and a hyperparameter search examples in the folder.

I suggest you play around with the examples and igel cli. However, you can also directly execute the fit.py, evaluate.py and predict.py if you want to.

GUI

You can also run the igel UI if you are not familiar with the terminal. Just install igel on your machine as mentioned above. Then run this single command in your terminal

$ igel gui

This will open up the gui, which is very simple to use. Check examples of how the gui looks like and how to use it here: https://github.com/nidhaloff/igel-ui

Links

Contributions

You think this project is useful and you want to bring new ideas, new features, bug fixes, extend the docs?

Contributions are always welcome. Make sure you read the guidelines first

License

MIT license

Copyright (c) 2020-present, Nidhal Baccouri

Comments
  • provide a basic GUI for users who prefer to uses GUI

    provide a basic GUI for users who prefer to uses GUI

    Description

    Users should have the flexibility of using a simple GUI if they don't want to use the CLI. A simple GUI can be made with Tkinter maybe? It's important to not add any dependency for this. It would be better to implement this using an existing python module.

    enhancement good first issue Epic contribution feature hacktoberfest-accepted 
    opened by nidhaloff 47
  • Support for ONNX export

    Support for ONNX export

    @nidhaloff This PR is to add ONNX export of sklearn models using the command igel export -dp "path_to_pre-fitted_sklearn_model" Please kindly review and do let me know your suggestions. This solves issue #72

    enhancement 
    opened by VishnuVardhanSaiLanka 15
  • Let's discuss how are you using igel and which updates/features would you rather see in the future

    Let's discuss how are you using igel and which updates/features would you rather see in the future

    Description

    Hi, I'm opening this to have a closer discussion with people, who are using igel. Let's dicuss together which advantages/disadvantages igel have from your point of vue and more importantly, which things would you change and which features would you want to see implemented in the future.

    the discussion moved to https://github.com/nidhaloff/igel/discussions/71

    question contribution feature discussion feedback 
    opened by nidhaloff 13
  • K medoids support

    K medoids support

    Pull request fixes ISSUE #92

    • Updated requirements document with the version for scikit-learn-extra
    • Imported and added KMedoids algorithm into the model dictionary (attributes still need to be specified as per sklearn_extra docs) Here, for this clustering algorithm, tolerance and n_init must be removed from yaml and replaced with metric (which can be euclidean, cosine, manhattan, etc.) and method ('alternate' is the default method optimized for speed, 'pam' is more accurate in clustering but slower)
    • Tested with existing dataset
    • Added a conditional statement, where if the model class name is KMedoids, since KMedoids does not have a score function, the underlying inertia_ attribute is used (negative of inertia gives score, as per docs?)

    ---EDIT---

    • Removed external dependency
    • Implemented KMedoids class in extras folder based on original implementation, with minor changes for ease of documentation
    • Only PAM algorithm has been implemented with two initialization methods: random, and heuristic
    • Tested with yaml file on both fit and predict options
    • Added required copyright from original source code
    enhancement 
    opened by anjali-rgpt 12
  • re-write the cli using click (or maybe typer?)

    re-write the cli using click (or maybe typer?)

    Description

    I'm the creator and only maintainer of the project at the moment. I'm working on adding new features and thus I would like to let this issue open for newcomers who want to contribute to the project.

    Basically, I wrote the cli usingargparse since it is part of the standard language already. However, I'm starting to rethink this choice because it has some issues that the click library already overcome.

    With that said, it would be great to re-write the cli in click or even in typer, which also uses click under the hood but adds more features.

    If someone wants to work on this, please feel free to start directly, you don't need to ask for permission.

    PS: Feel free to suggest other libraries. I just suggested click since I'm familiar with it

    I hope not but If this issue stayed open for a long time, then I will start working on it myself

    enhancement help wanted good first issue easy contribution feature discussion 
    opened by nidhaloff 11
  • Add docker support

    Add docker support

    Description

    create a docker image so that people can just docker pull the igel container and therefore they can use igel without installing anything.

    If someone wants to contribute and work on this, then feel free to create a docker folder in the repo and work on this feature there

    enhancement good first issue contribution feature 
    opened by nidhaloff 9
  • Reformat igel -h output

    Reformat igel -h output

    As the title, I think igel -h should print less than 80 characters per line. As a result, the users can read more easily.

    image I have to use the terminal at maximized state to read the page.

    opened by dinhanhx 7
  • provide a way to use deep neural networks

    provide a way to use deep neural networks

    Description

    Igel is built on top of sklearn at the moment. Therefore, All sklearn models can be used. This includes of course the neural network models integrated in sklearn (MLP classifier and MLP regressor). However, sklearn is not powerful enough when it comes to deep neural networks. Therefore, this issue aims to include support for using deep neural networks in igel. Maybe Keras API?

    Example:

    model:
         type: classification # or regression
        algorithm: neural network   # this is already implemented. However, it is using the sklearn NN implementation
        arguments: default   # this will use the default argument of the NN class
    

    As you can see. The user can provide these configs in the yaml file and igel will train a neural network. However, the NN model in sklearn is not as powerful as other frameworks like keras, tensorflow, torch etc..

    What I mean with this issue and want to implement in the future is maybe something like this (feel free to bring new ideas):

    model:
         deep: true
         type: classification # or regression
        algorithm: neural network   # this will now use the Keras NN model since the user added deep: true 
        arguments: default 
    

    OR Maybe even this can be implemented as a VISION ( This will probably take a long time to implement):

    model:
         uses: keras  # the user can here provide keras, tensorflow or torch 
         type: classification # or regression
        algorithm: neural network   # this will now use the Keras NN model since the user provided that he wants to use keras
        arguments: default 
    
    enhancement good first issue Epic contribution 
    opened by nidhaloff 7
  • Can't get it working like in Quick-Start

    Can't get it working like in Quick-Start

    • igel version: 0.3.1 (latest pip)
    • Python version: 3.8.5
    • Operating System: docker on top of Ubuntu 16.04.6 LTS (4.4.0)

    Description

    Very new to ML, don't know what and how to do something with the Igel. I followed the Quick-Start Demo to get an Idea.

    • Installed Igel
    • Downloaded the archive.zip from https://www.kaggle.com/uciml/pima-indians-diabetes-database and put diabetes.csv in working Folder
    • Followed Quick-Start

    Resulted in this igel.yaml:

    dataset:
      preprocess:
        missing_values: mean
        scale:
          method: standard
          target: inputs
      split:
        shuffle: true
        test_size: 0.1
      type: csv
    # model definition
    model:
        # in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering
        # Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithm
        type: classification
        algorithm: RandomForest     # make sure you write the name of the algorithm in pascal case
        arguments:
            n_estimators: 100   # here, I set the number of estimators (or trees) to 100
            max_depth: 30       # set the max_depth of the tree
    
    # target you want to predict
    # Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.
    # Depending on your data, you need to provide the target(s) you want to predict here
    target:
        - sick
    

    What I Did

    ... with having a big question mark above my head:

    $ igel fit -dp 'diabetes.csv' -yml 'igel.yaml' 
    
             _____          _       _
            |_   _| __ __ _(_)_ __ (_)_ __   __ _
              | || '__/ _` | | '_ \| | '_ \ / _` |
              | || | | (_| | | | | | | | | | (_| |
              |_||_|  \__,_|_|_| |_|_|_| |_|\__, |
                                            |___/
            
    INFO - Entered CLI args: {'data_path': 'diabetes.csv', 'yaml_path': 'igel.yaml', 'cmd': 'fit'}
    INFO - Executing command: fit ...
    INFO - reading data from diabetes.csv
    INFO - You passed the configurations as a yaml file.
    INFO - your chosen configuration: {'dataset': {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'}, 'model': {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}}, 'target': ['sick']}
    INFO - dataset_props: {'preprocess': {'missing_values': 'mean', 'scale': {'method': 'standard', 'target': 'inputs'}}, 'split': {'shuffle': True, 'test_size': 0.1}, 'type': 'csv'} 
    model_props: {'type': 'classification', 'algorithm': 'RandomForest', 'arguments': {'n_estimators': 100, 'max_depth': 30}} 
     target: ['sick'] 
    
    INFO - dataset shape: (768, 9)
    INFO - dataset attributes: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
    INFO - Check for missing values in the dataset ...  
    Pregnancies                 0
    Glucose                     0
    BloodPressure               0
    SkinThickness               0
    Insulin                     0
    BMI                         0
    DiabetesPedigreeFunction    0
    Age                         0
    Outcome                     0
    dtype: int64  
     ----------------------------------------------------------------------------------------------------
    INFO - shape of the dataset after handling missing values => (768, 9)
    ERROR - error occured while preparing the data: ('chosen target(s) to predict must exist in the dataset',)
    Traceback (most recent call last):
      File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 245, in _process_data
        raise Exception("chosen target(s) to predict must exist in the dataset")
    Exception: chosen target(s) to predict must exist in the dataset
    Traceback (most recent call last):
      File "/opt/conda/bin/igel", line 8, in <module>
        sys.exit(main())
      File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 508, in main
        CLI()
      File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 166, in __init__
        getattr(self, self.cmd.command)()
      File "/opt/conda/lib/python3.8/site-packages/igel/cli.py", line 297, in fit
        Igel(**self.dict_args)
      File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 102, in __init__
        getattr(self, self.command)()
      File "/opt/conda/lib/python3.8/site-packages/igel/igel.py", line 336, in fit
        x_train, y_train, x_test, y_test = self._prepare_fit_data()
    TypeError: cannot unpack non-iterable NoneType object
    

    If I understand right, the Igel want's to have a column named sick in dataset.csv. So there is a missing link and I have no idea how to close this.

    Can you provide test-data, maybe as part of this repo, to get something to work? Or help me finding the missing part?

    Please help

    opened by clausnizer-ondics 6
  • Bug fix in __main.py__ and addition of important commands

    Bug fix in __main.py__ and addition of important commands

    Bug

    It should be noted that when an optional argument is required to hold multiple values as in the case of -DP/--data_paths it should be taken care with the help of nargs feature. For example on executing the following we met this issue.

    $ igel experiment -DP 3.csv 3.csv e.csv -yml igel.yaml
    Usage: igel experiment [OPTIONS]
    Try 'igel experiment --help' for help.
    
    Error: Got unexpected extra arguments (3.csv e.csv)
    

    Note 3.csv, e.csv, igel.yaml files are present in the same working directory.

    Hence this method of separating the paths for the required files will not work as mentioned below:

    def experiment(data_paths: str, yaml_path: str) -> None:
        .
        .
    
        train_data_path, eval_data_path, pred_data_path = data_paths.strip().split(" ")
    

    The above bug can be solved by introducing nargs=3 and it will return data_paths as a tuple which is resolved in this PR.

    Addition of important commands

    Since igel is a package and as per the documentation the following missing commands have been added.

      info        get info & metadata about igel
      help        get help about how to use igel
      version     get the version of igel installed on your machine
    

    Shortcut for help option in every command

    This enhances the igel and increases the easiness to fetch the help of any command. To fetch the help list earlier we have to specify the complete name as --help but now user can specify a short flag -h to fetch the help command quickly.

    In case of any query or improvement kindly mention, I will be happy to do it.

    invalid 
    opened by kuspia 5
  • does it always use the preprocessing, i.e.

    does it always use the preprocessing, i.e. "scale:"

    • igel version: pip
    • Python version: 3.8.6
    • Operating System: windows

    Description

    I am just messing around, loaded a CSV with two columns, number (1-12,000,000) and prime (0 or 1) the prime column was factual, a 1 means prime, a zero means composite. I tried NeuralNetwork and CalibratedClassifier, but i think the preprocessing "scale:" messed it up. that is, the predictions.csv output for input csv single column numbers 3000-4000 i'd expect values other than 0.0, but all that is in there is 3000,0.0, 3001,0.0, and so on, every prediction is 0.0.

    It takes a long time to generate the csvs (on my side) and to run the igel -DP suite, so i wanted to clear this up. P.S. I really appreciate all the work, i just kinda don't really understand, i think.

    If i can help or ever can submit a PR or anything i'd be happy to!

    opened by genewitch 5
  • Installation Error importlib_metadata

    Installation Error importlib_metadata

    • igel version: 1.0.0
    • Python version: 3.8.10
    • Operating System: Ubuntu 20.04

    Description

    Installation fails with the following error: markdown 3.4.1 has requirement importlib-metadata>=4.4; python_version < "3.10", but you'll have importlib-metadata 1.7.0 which is incompatible.

    What I Did

    I tryed to install importlib_metadata 4.4 but I get the following error: igel 1.0.0 has requirement importlib_metadata<2.0.0,>=1.6.0; python_version >= "3.8", but you'll have importlib-metadata 4.4.0 which is incompatible.

    Paste the command(s) you ran and the output. If there was a crash, please include the traceback here. pip install -U igel pip install -U importlib_metadata==4.4

    opened by gironymo 0
  • CatBoost Classification and Regression feature

    CatBoost Classification and Regression feature

    • igel version: 1.0.0
    • Python version: 3.10
    • Operating System: MacOS

    Description

    I tried to add the CatBoost Regression and Classification Models in addition to the other models imported from sklearn library. It is my first pull request to forgive me if any mistakes were created.

    The pull request has been created for the CatBoost algorithm which includes the installation as well.

    opened by 0sparsh2 0
  • Origin/feature/catboost

    Origin/feature/catboost

    Installed CatBoost Library which can be used for both Regression and Classification models. Added them to data.py as options that can be selected later. Included installation in the requirements file.

    opened by 0sparsh2 0
  • CNN Support Added

    CNN Support Added

    Solves issue #77 Model is configured via the YAML file. Have provided an yaml file example. Tested on Colab. Utilising Keras ensured that we could support a variety of layers besides just the standard Conv2D. Have added docstrings as per the contributing guidelines @nidhaloff Is this adequate? Kindly give me a review

    opened by Prasanna28Devadiga 2
  • CNN in pytorch

    CNN in pytorch

    Hi @nidhaloff, I have added the feature for CNN in pytorch where the user enters random model. I also created a igel_cnn.yaml file for example. I have also tested it on my local machine and is able to clear all the tests. I have also tested the evaluate function locally.

    opened by GouravWadhwa 4
  • Inviting maintainers/contributors to the project

    Inviting maintainers/contributors to the project

    Hello everyone,

    First of all, I want to take a moment to thank all contributors and people who supported this project in any way ;) you are awesome!

    If you like the project and have any interest in contributing/maintaining it, you can contact me here or send me a msg privately:

    PS: You need to be familiar with python and machine learning

    help wanted good first issue contribution feature discussion feedback Hacktoberfest 
    opened by nidhaloff 6
Releases(v1.0.0)
Owner
Nidhal Baccouri
A software engineer who wants to improve the software world.
Nidhal Baccouri
A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

ManhattanSLAM Authors: Raza Yunus, Yanyan Li and Federico Tombari ManhattanSLAM is a real-time SLAM library for RGB-D cameras that computes the camera

117 Dec 28, 2022
Image augmentation library in Python for machine learning.

Augmentor is an image augmentation library in Python for machine learning. It aims to be a standalone library that is platform and framework independe

Marcus D. Bloice 4.8k Jan 07, 2023
Social Distancing Detector

Computer vision has opened up a lot of opportunities to explore into AI domain that were earlier highly limited. Here is an application of haarcascade classifier and OpenCV to develop a social distan

Ashish Pandey 2 Jul 18, 2022
This project uses Template Matching technique for object detecting by detection of template image over base image.

Object Detection Project Using OpenCV This project uses Template Matching technique for object detecting by detection the template image over base ima

Pratham Bhatnagar 7 May 29, 2022
[CVPRW 2021] Code for Region-Adaptive Deformable Network for Image Quality Assessment

RADN [CVPRW 2021] Code for Region-Adaptive Deformable Network for Image Quality Assessment [Paper on arXiv] Overview Update [2021/5/7] add codes for W

IIGROUP 53 Dec 28, 2022
Arxiv harvester - Poor man's simple harvester for arXiv resources

Poor man's simple harvester for arXiv resources This modest Python script takes

Patrice Lopez 5 Oct 18, 2022
Object detection evaluation metrics using Python.

Object detection evaluation metrics using Python.

Louis Facun 2 Sep 06, 2022
Self Governing Neural Networks (SGNN): the Projection Layer

Self Governing Neural Networks (SGNN): the Projection Layer A SGNN's word projections preprocessing pipeline in scikit-learn In this notebook, we'll u

Guillaume Chevalier 22 Nov 06, 2022
A python library for implementing a recommender system

python-recsys A python library for implementing a recommender system. Installation Dependencies python-recsys is build on top of Divisi2, with csc-pys

Oscar Celma 1.5k Dec 17, 2022
Exploring the link between uncertainty estimates obtained via "exact" Bayesian inference and out-of-distribution (OOD) detection.

Uncertainty-based OOD detection Exploring the link between uncertainty estimates obtained by "exact" Bayesian inference and out-of-distribution (OOD)

Christian Henning 1 Nov 05, 2022
Human Action Controller - A human action controller running on different platforms.

Human Action Controller (HAC) Goal A human action controller running on different platforms. Fun Easy-to-use Accurate Anywhere Fun Examples Mouse Cont

27 Jul 20, 2022
This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled Time Series presented at Causal Analysis Workshop 2021.

signed-area-causal-inference This repository contains code demonstrating the methods outlined in Path Signature Area-Based Causal Discovery in Coupled

Will Glad 1 Mar 11, 2022
Awesome Remote Sensing Toolkit based on PaddlePaddle.

基于飞桨框架开发的高性能遥感图像处理开发套件,端到端地完成从训练到部署的全流程遥感深度学习应用。 最新动态 PaddleRS 即将发布alpha版本!欢迎大家试用 简介 PaddleRS是遥感科研院所、相关高校共同基于飞桨开发的遥感处理平台,支持遥感图像分类,目标检测,图像分割,以及变化检测等常用遥

146 Dec 11, 2022
Image Completion with Deep Learning in TensorFlow

Image Completion with Deep Learning in TensorFlow See my blog post for more details and usage instructions. This repository implements Raymond Yeh and

Brandon Amos 1.3k Dec 23, 2022
HomeAssitant custom integration for dyson

HomeAssistant Custom Integration for Dyson This custom integration is still under development. This is a HA custom integration for dyson. There are se

Xiaonan Shen 232 Dec 31, 2022
DGCNN - Dynamic Graph CNN for Learning on Point Clouds

DGCNN is the author's re-implementation of Dynamic Graph CNN, which achieves state-of-the-art performance on point-cloud-related high-level tasks including category classification, semantic segmentat

Wang, Yue 1.3k Dec 26, 2022
Sequential model-based optimization with a `scipy.optimize` interface

Scikit-Optimize Scikit-Optimize, or skopt, is a simple and efficient library to minimize (very) expensive and noisy black-box functions. It implements

Scikit-Optimize 2.5k Jan 04, 2023
Useful materials and tutorials for 110-1 NTU DBME5028 (Application of Deep Learning in Medical Imaging)

Useful materials and tutorials for 110-1 NTU DBME5028 (Application of Deep Learning in Medical Imaging)

7 Jun 22, 2022
Official implementation of Long-Short Transformer in PyTorch.

Long-Short Transformer (Transformer-LS) This repository hosts the code and models for the paper: Long-Short Transformer: Efficient Transformers for La

NVIDIA Corporation 198 Dec 29, 2022
Code of paper "Compositionally Generalizable 3D Structure Prediction"

Compositionally Generalizable 3D Structure Prediction In this work, We bring in the concept of compositional generalizability and factorizes the 3D sh

Songfang Han 30 Dec 17, 2022