AutoX是一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Related tags

Machine LearningAutoX
Overview

English | 简体中文

AutoX是什么?

AutoX一个高效的自动化机器学习工具,它主要针对于表格类型的数据挖掘竞赛。 它的特点包括:

  • 效果出色: AutoX在多个kaggle数据集上,效果显著优于其他解决方案(见效果对比)。
  • 简单易用: AutoX的接口和sklearn类似,方便上手使用。
  • 通用: 适用于分类和回归问题。
  • 自动化: 无需人工干预,全自动的数据清洗、特征工程、模型调参等步骤。
  • 灵活性: 各组件解耦合,能单独使用,对于自动机器学习效果不满意的地方,可以结合专家知识,AutoX提供灵活的接口。
  • 比赛上分点总结:整理并公开历史比赛的上分点。

目录

安装

1. git clone https://github.com/4paradigm/autox.git
2. cd autox
3. python setup.py install

架构

├── autox
│   ├── ensemble
│   ├── feature_engineer
│   ├── feature_selection
│   ├── file_io
│   ├── join_tables
│   ├── metrics
│   ├── models
│   ├── process_data
│   └── util.py
│   ├── CONST.py
│   ├── autox.py
├── run_oneclick.py
└── demo
└── test
├── setup.py
├── README.md

快速上手

  • 全自动: 适合于想要快速获得一个不错结果的用户。只需要配置最少的数据信息,就能完成机器学习全流程的构建。
from autox import AutoX
path = data_dir
autox = AutoX(target = 'loss', train_name = 'train.csv', test_name = 'test.csv', 
               id = ['id'], path = path)
sub = autox.get_submit()
sub.to_csv("submission.csv", index = False)
  • 半自动: run_demo.ipynb
适合于想要获得更优预测结果的用户。AutoX提供了易用且丰富的接口,用户可以方便地根据实际数据场景进行配置,以获得更优的预测结果。

效果对比:

index data_type data_name(link) AutoX AutoGluon H2o
1 regression zhidemai 1.1231 1.9466 1.1927
2 regression Tabular Playground Series - Aug 2021 7.87731 10.3944 7.8895
3 binary classification Titanic x 0.78229 0.79186

数据类型

  • cat: Categorical,类别型无序变量
  • ord: Ordinal,类别型有序变量
  • num: Numeric,连续型变量
  • datetime: Datetime型时间变量
  • timestamp: imestamp型时间变量

pipeline的逻辑

  • 1.初始化AutoX类
1.1 读数据
1.2 合并train和test
1.3 识别数据表中列的类型
1.4 数据预处理
  • 2.特征工程
特征工程包含单表特征和多表特征。
每一个特征工程类都包含以下功能:
    一、自动筛选要执行当前操作的特征;
    二、查看筛选出来的特征
    三、修改要执行当前操作的特征
    四、执行特征数据的计算,返回和主表样本条数以及顺序一致的特征
  • 3.特征合并
将构造出来的特征进行合并,行数不变,列数增加,返回大的宽表
  • 4.训练集和测试集的划分
将宽表划分成训练集和测试集
  • 5.特征过滤
通过train和test的特征列数据分布情况,对构造出来的特征进行过滤,避免过拟合
  • 6.模型训练
利用过滤后的宽表特征对模型进行训练
模型类提供功能包括:
    一、查看模型默认参数;
    二、模型训练;
    三、模型调参;
    四、查看模型对应的特征重要性;
    五、模型预测
  • 7.模型预测

AutoX类

AutoX类自动为用户管理数据集和数据集信息。
初始化AutoX类之后会执行以下操作:
一、读数据;
二、合并train和test;
三、识别数据表中列的类型;
四、数据预处理。

属性

info_: info_属性用于保存数据集的信息。

  • info_['id']: List,用于标识数据表唯一的Key
  • info_['target']: String,用于标识数据表的标签列
  • info_['shape_of_train']: Int,train数据集的数据样本条数
  • info_['shape_of_test']: Int,test数据集的数据样本条数
  • info_['feature_type']: Dict of Dict,标识数据表中特征列的数据类型
  • info_['train_name']: String,用于训练集主表表名
  • info_['test_name']: String,用于测试集主表表名

dfs_: dfs_属性用于保存所有的DataFrame,包含原始表数据和构造的表数据。

  • dfs_['train_test']: train数据和test数据合并后的数据
  • dfs_['FE_feature_name']:特征工程所构造出的数据,例如FE_count,FE_groupby
  • dfs_['FE_all']:原始特征和所有特征工程合并后的数据集

方法

  • concat_train_test: 将训练集和测试集拼接起来,一般在读取数据之后执行
  • split_train_test: 将训练集和测试集分开,一般在完成特征工程之后执行
  • get_submit: 获得预测结果(中间过程执行了完成的机器学习pipeline,包括数据预处理,特征工程,模型训练,模型调参,模型融合,模型预测等)

AutoX的pipeline中的操作对应的具体细节:

读数据

读取给定路径下的所有文件。默认情况下,会将训练集主表和测试集主表进行拼接,
再进行后续的数据预处理以及特征工程等操作,并在模型预测开始前,将训练集和测试进行拆分。

数据预处理

- 对时间列解析年, 月, 日, 时、星期几等信息
- 在每次训练前,会对输入到模型的数据删除无效(nunique为1)的特征
- 去除异常样本,去除label为nan的样本

特征工程

  • 1-1拼表特征
  • 1-M拼表特征
- time diff特征
- 聚合统计类特征
  • count特征
对要操作的特征列,将全体数据集中,和当前样本特征属性一致的样本计数作为特征
  • target encoding特征

  • 统计类特征

使用两层for训练提取统计类特征。
第一层for循环遍历所有筛选出来的分组特征(group_col),
第二层for循环遍历所有筛选出来的聚合特征(agg_col),
在第二层for循环中,
若遇到类别型特征,计算的统计特征为nunique,
若遇到数值型特征,计算的统计特征包括[median, std, sum, max, min, mean].
  • shift特征

模型训练

AutoX目前支持以下模型,默认情况下,会对Lightgbm模型进行训练:
1. Lightgbm;
2. AutoX 深度神经网络。

模型融合

AutoX支持的模型融合方式包括一下两种,默认情况下,不进行融合。
1. Stacking;
2. Bagging。

比赛上分点总结:

kaggle criteo: 对于nunique很大的特征列,进行分桶操作。例如,对于nunique大于10000的特征,做hash后截断保留4位,再进行label_encode。 zhidemai: article_id隐含了时间信息,增加article_id的排序特征。例如,groupby(['date'])['article_id'].rank()。

错误排查

错误信息 解决办法
Comments
  • AutoX_Recommend, 数据集处理: kdd cup 2020

    AutoX_Recommend, 数据集处理: kdd cup 2020

    原始数据地址: https://tianchi.aliyun.com/competition/entrance/231785/introduction 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • AutoX_Recommend, 数据集处理: Amazon product data

    AutoX_Recommend, 数据集处理: Amazon product data

    原始数据地址: http://jmcauley.ucsd.edu/data/amazon/ 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • AutoX_Recommend, 数据集处理: Amazon electronic product recommendation

    AutoX_Recommend, 数据集处理: Amazon electronic product recommendation

    原始数据地址: https://www.kaggle.com/datasets/prokaggler/amazon-electronic-product-recommendation 数据处理方法参考: https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/MovieLens_data_process.ipynb 以及 https://github.com/4paradigm/AutoX/blob/master/autox/autox_recommend/data_process/Netflix-data-process.ipynb

    call-for-contributions AutoX_Recommend 
    opened by poteman 1
  • ModuleNotFoundError: No module named 'autox.autox_server'

    ModuleNotFoundError: No module named 'autox.autox_server'

    git clone https://github.com/4paradigm/AutoX.git pip install pytorch_tabnet pip install ./AutoX python from autox import AutoX

    ModuleNotFoundError: No module named 'autox.autox_server'

    opened by utopianet 1
  • lightgbm.train bug(lightgbm==3.3.2.99)

    lightgbm.train bug(lightgbm==3.3.2.99)

    Mac中 lightgbm==3.3.2.99, lightgbm.train不再包含verbose_eval和early_stopping_rounds接口,改用callbacks接口,调用lgb模型时会报错

    File ~/miniforge3/envs/lx/lib/python3.9/site-packages/autox/autox_competition/models/regressor_ts.py:231, in LgbRegressionTs.fit(self, train, test, used_features, target, time_col, ts_unit, Early_Stopping_Rounds, N_round, Verbose, log1p, custom_metric, weight_for_mae)
        226     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
        227                       verbose_eval=self.Verbose,
        228                       early_stopping_rounds=self.Early_Stopping_Rounds,
        229                       feval=weighted_mae_lgb(weight=weight_for_mae))
        230 else:
    --> 231     model = lgb.train(self.params_, trn_data, num_boost_round=self.N_round, valid_sets=[trn_data, val_data],
    ...
        233                     early_stopping_rounds=self.Early_Stopping_Rounds)
        234 val = model.predict(train.iloc[valid_idx][used_features])
        235 if log1p:
    
    TypeError: train() got an unexpected keyword argument 'verbose_eval'
    
    opened by LXlearning 0
  • AutoX_NLP/ nlp_feature.py,glove环境适配

    AutoX_NLP/ nlp_feature.py,glove环境适配

    opened by DHengW 0
  • AutoX_NLP/ nlp_feature.py, OOV问题优化

    AutoX_NLP/ nlp_feature.py, OOV问题优化

    opened by DHengW 0
  • AutoX_NLP/ nlp_feature.py, fasttext处理效率优化

    AutoX_NLP/ nlp_feature.py, fasttext处理效率优化

    opened by DHengW 0
Releases(v5.2.0)
Owner
4Paradigm
4Paradigm Open Source Community
4Paradigm
Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill

Python implementation of Weng-Lin Bayesian ranking, a better, license-free alternative to TrueSkill This is a port of the amazing openskill.js package

Open Debates Project 156 Dec 14, 2022
Fundamentals of Machine Learning

Fundamentals-of-Machine-Learning This repository introduces the basics of machine learning algorithms for preprocessing, regression and classification

Happy N. Monday 3 Feb 15, 2022
Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Puesta en Producción de un modelo de aprendizaje automático con Flask y Heroku L

Jesùs Guillen 1 Jun 03, 2022
Add built-in support for quaternions to numpy

Quaternions in numpy This Python module adds a quaternion dtype to NumPy. The code was originally based on code by Martin Ling (which he wrote with he

Mike Boyle 531 Dec 28, 2022
Real-time domain adaptation for semantic segmentation

Advanced-Machine-Learning This repository contains the code for the project Real

Andrea Cavallo 1 Jan 30, 2022
Python package for machine learning for healthcare using a OMOP common data model

This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database.

Sontag Lab 75 Jan 03, 2023
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
Reggy - Regressions with arbitrarily complex regularization terms

reggy Regressions with arbitrarily complex regularization terms. Currently suppo

Kim 1 Jan 20, 2022
Responsible Machine Learning with Python

Examples of techniques for training interpretable ML models, explaining ML models, and debugging ML models for accuracy, discrimination, and security.

ph_ 624 Jan 06, 2023
A game theoretic approach to explain the output of any machine learning model.

SHAP (SHapley Additive exPlanations) is a game theoretic approach to explain the output of any machine learning model. It connects optimal credit allo

Scott Lundberg 18.2k Jan 02, 2023
A Python toolkit for rule-based/unsupervised anomaly detection in time series

Anomaly Detection Toolkit (ADTK) Anomaly Detection Toolkit (ADTK) is a Python package for unsupervised / rule-based time series anomaly detection. As

Arundo Analytics 888 Dec 30, 2022
Adversarial Framework for (non-) Parametric Image Stylisation Mosaics

Fully Adversarial Mosaics (FAMOS) Pytorch implementation of the paper "Copy the Old or Paint Anew? An Adversarial Framework for (non-) Parametric Imag

Zalando Research 120 Dec 24, 2022
CVXPY is a Python-embedded modeling language for convex optimization problems.

CVXPY The CVXPY documentation is at cvxpy.org. We are building a CVXPY community on Discord. Join the conversation! For issues and long-form discussio

4.3k Jan 08, 2023
Machine Learning University: Accelerated Natural Language Processing Class

Machine Learning University: Accelerated Natural Language Processing Class This repository contains slides, notebooks and datasets for the Machine Lea

AWS Samples 2k Jan 01, 2023
A naive Bayes model for cancer classification using a set of documents

Naivebayes text classifcation model for cancer and noncancer documents Author: Alex King Purpose Requirements/files included How to use 1. Purpose The

Alex W King 1 Nov 24, 2021
AutoOED: Automated Optimal Experiment Design Platform

AutoOED is an optimal experiment design platform powered with automated machine learning to accelerate the discovery of optimal solutions. Our platform solves multi-objective optimization problems an

Yunsheng Tian 107 Jan 03, 2023
Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis. It is distributed under the MIT License.

Jeong-Yoon Lee 720 Dec 25, 2022
Covid-polygraph - a set of Machine Learning-driven fact-checking tools

Covid-polygraph, a set of Machine Learning-driven fact-checking tools that aim to address the issue of misleading information related to COVID-19.

1 Apr 22, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
Test symmetries with sklearn decision tree models

Test symmetries with sklearn decision tree models Setup Begin from an environment with a recent version of python 3. source setup.sh Leave the enviro

Rupert Tombs 2 Jul 19, 2022