当前位置：网站首页>Introduction and use of automatic machine learning framework (flaml, H2O)

Automatic machine learning （Automl）, It is a way to turn traditional machine learning into Automation ,start—end Fully automatic . At present, the popular automatic machine learning frameworks in the market include ：Flaml、H20 wait . This chapter records how these two frameworks are used .

Two 、 Data is introduced

The bank transaction flow created by the data refers to . share 23 Column data , Among them is 18 Column characteristic data ,1 Column label data ,4 Column user information data .

3、 ... and 、flaml frame

3.1、flaml brief introduction

Flaml It is an automatic machine learning framework launched by Microsoft , Support custom learners and parameters , It also provides a fast automatic adjustment tool .flaml You can find the accuracy with low computational resources in the customized learner ML Model . It frees users from choosing learners and super parameters . It is very convenient to use .

3.2、 Use flaml

3.2.1、 download flaml library

pip install flaml

3.2.2、 Import related libraries

from flaml import AutoML
from sklearn.datasets import load_iris
from sklearn.datasets import load_boston
import pandas as pd
import sys,logging
from sklearn.metrics import confusion_matrix,classification_report,recall_score,accuracy_score,f1_score,precision_score
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import RandomOverSampler
import time

3.2.3、 Data processing

#  Data path 
data_path = r"source/data_jianhang.csv"
input_data_all = pd.read_csv(data_path,encoding="gbk",index_col=0)
#  Extract forecast customer information   【 Customer name 、 Customer number 、 Customer account number 】
customer_info = input_data_all.iloc[:,:3]
print(customer_info)

#  features 
input_data_target = input_data_all[" label "]
input_data_feature = input_data_all.iloc[:, 3:-1]
#  selection input_data_all All of the line , From the third column to the last column （ The head is not the tail ）
input_data = input_data_all.iloc[:, 3:]
#  Fill the blank value with 0
input_data.fillna(0, inplace=True)
#  Output the first five lines to view 
print(input_data.head())


#  Random oversampling for the imbalance of positive and negative samples 
f = RandomOverSampler(random_state=0)
data, target = f.fit_resample(input_data.iloc[:,:-1], input_data.iloc[:,-1])
#  Data maximum and minimum normalization 
data = MinMaxScaler().fit_transform(data)
#  Output sample quantity 
print(target.to_frame().value_counts())

#  Segmentation data   X： Feature set   y： label 
X,y = input_data.iloc[:,:-1],input_data.iloc[:,-1]
X = MinMaxScaler().fit_transform(X)

3.2.4、 call flaml

Looking at the log printed by the program, we can find , Through parameters estimator_list, Calling flaml In the process of ,flaml Automatically compare for us lgbm、rf、xgboost The effect of classifier , Finally, print and use the optimal classifier and parameters to train the model . The whole process is automated , There is no need for artificial operation comparison . This is also a feature of automated machine learning .

t1 = time.time()
#  initialization flaml Automated modeling framework 
flaml_automl = AutoML()
#  Pass in training data x and y Conduct fit Training 
flaml_automl.fit(data,target,task='classification',log_file_name="xxx.log",metric="f1",estimator_list = ['lgbm', 'rf', 'xgboost'])
# fit Introduction to common parameters 
'''
    #  X_train=None,    Training data feature set 
    #  y_train=None,    Training data tag set 
    #  estimator_list = ['lgbm', 'rf', 'xgboost', 'extra_tree', 'xgb_limitdepth', 'lrl1']
    #  metric: 'accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo','f1', 'micro_f1', 'macro_f1', 'log_loss', 'mae', 'mse', 'r2','mape'.
    #  n_jobs: Pass in an integer , Enable multithreading 
    #  n_splits： Pass in an integer , Number of folds cross verified 
    #  log_file_name: Log output , If you don't want to output logs , Pass in an empty string  ’’  that will do 
    #  estimator_list： Model list , Optional 【‘lgbm’,’xgboot’,’xgb_limitdepth’,’catboost’,’rf’,’extra_tree’】, It will eventually output best Model .
    #  time_budget： The time limit , In seconds . If restricted 10s, Then the optimal model is output in ten seconds . There is no time limit for incoming -1
    #  sample： Boolean value , Default False. Whether to sample the incoming data .
    #  early_stop: Boolean value , Default False. If the model search converges , Stop ahead of time .
'''


# flaml Print the optimal model and parameters 
'''
[flaml.automl: 03-09 14:52:24] {2694} INFO - retrain lgbm for 1.3s
[flaml.automl: 03-09 14:52:24] {2699} INFO - retrained model: LGBMClassifier(colsample_bytree=0.5716563773446997,
               global_max_steps=9223372036854775807,
               learning_rate=0.7886932330930241, max_bin=511,
               min_child_samples=7, n_estimators=181, num_leaves=1006,
               reg_alpha=0.007095760722363662, reg_lambda=0.3005614400342159,
               verbose=-1)
[flaml.automl: 03-09 14:52:24] {2077} INFO - fit succeeded
[flaml.automl: 03-09 14:52:24] {2079} INFO - Time taken to find the best model: 23.60042953491211
'''


#  Print some results 
print(" Elapsed time ： ",time.time()-t1)
print(flaml_automl.estimator_list)
print(" The optimal model ",flaml_automl.model)
print(" Optimal parameters ",flaml_automl.best_config)
print(" Training time ",flaml_automl.best_config_train_time)
print(" classifier ",flaml_automl.best_estimator)
print(" Loss ",flaml_automl.best_loss)

#  call predict forecast X
y_pred = flaml_automl.predict(X)
#  Output forecast results 
print(y_pred)

#  Print indicators 、 Classification effect 
print(" Confusion matrix :\n",confusion_matrix(y,y_pred))
print(" Classified reports :\n",classification_report(y,y_pred))
print(" Recall rate :",recall_score(y,y_pred))
print(" Accuracy rate :",accuracy_score(y,y_pred))
print("f1 The score is :",f1_score(y,y_pred))
print(" Accuracy :",precision_score(y,y_pred))
print(" Total time consumed ： ",time.time()-t1)

Four 、h2o frame

4.1、h2o brief introduction

h2o The framework is an open source , Distributed based on java The framework of machine learning .h2o By （h2o.AI） The company developed and released . The website of their company ：H2O.ai | AI Cloud Platform.h20 It also supports visual analysis of user tasks .

4.2、h2o Use

4.2.1、 download h2o

pip install h20

4.2.2、 Import related libraries

import h2o
from h2o.automl import H2OAutoML
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator
from h2o.grid.grid_search import H2OGridSearch
import numpy as np
import pandas as pd
import time
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import confusion_matrix,classification_report,recall_score,accuracy_score,f1_score,precision_score
from imblearn.over_sampling import RandomOverSampler

4.2.3、 Data processing

Same as 3.2.3 The chapters are exactly the same , I'm not going to repeat it here , Direct copy 3.2.3 Chapter data processing code .

4.2.4、 start-up h2o Of jar package

1） Download free ： Automatic machine learning h2o start-up jar package - Machine learning document class

2） open cmd, Input java -jar h2o.jar start-up

3） visit web page ： http://localhost:54321

4.2.5、 call h2o



##``"DRF"``,``"GLM"``,``"XGBoost"``,``"GBM"``,``"DeepLearning"``,``"StackedEnsemble"``.
#  initialization 
automl_estimator = H2OAutoML(max_runtime_secs=50,balance_classes=True,exclude_algos=["DeepLearning"],stopping_metric="auc",sort_metric="auc")
#  Training 
automl_estimator.train(x=train_data_h2o.names[0:-1],y="target",training_frame=train_data_h2o)
print(" Time 2：",time.time()-t1)
# predict To make predictions , And output the prediction results 
h2o_result = automl_estimator.predict(test_data_h2o[:-1])[:,0]
print(h2o_result)

# Print indicators 
print(" Confusion matrix :\n",confusion_matrix(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print(" Classified reports :\n",classification_report(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print(" Recall rate :",recall_score(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print(" Accuracy rate :",accuracy_score(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print("f1 The score is :",f1_score(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print(" Accuracy :",precision_score(test_data_h2o[:,-1].as_data_frame(),h2o_result.as_data_frame()))
print(" Time 3：",time.time()-t1)

5、 ... and 、 summary

Because of h2o Less understanding , And so on h2o The code of the framework is not explained too much . Personally, I prefer to use flaml Do automated machine learning . As for the effect, it's still very good , No screenshots are posted here . Those who are interested can practice offline .

原网站

版权声明
本文为[zkkkkkkkkkkkkk]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/187/202207060912332585.html