当前位置：网站首页>Deep feature synthesis and genetic feature generation, comparison of two automatic feature generation strategies

Deep feature synthesis and genetic feature generation, comparison of two automatic feature generation strategies

2022-06-12 19:37:00 【deephub】

Feature engineering is the process of creating new features from existing features , Through feature engineering, we can capture the additional relationship with the target column that the original feature does not have . This process is very important to improve the performance of machine learning algorithm . Although when data scientists apply specific domain knowledge to specific transformations , Feature engineering works best , But there are some methods that can be done in an automated way , Without prior domain knowledge .

In this paper , We will show you how to use... Through an example ATOM Package to quickly compare two automatic feature generation algorithms ： Depth feature synthesis (Deep feature Synthesis, DFS) And genetic characteristics (Genetic feature generation, GFG).ATOM It's open source Python package , It can help data scientists speed up the exploration of machine learning pipeline .

Baseline model

For comparison , As a baseline for comparison, only the initial features are used to train the model . The data used here is from Kaggle A variant of the Australian weather dataset . The goal of this data set is to predict whether it will rain tomorrow , In target column RainTomorrow Train a binary classifier on .

import pandas as pd
from atom import ATOMClassifier

# Load the data and have a look
X = pd.read_csv("./datasets/weatherAUS.csv")
X.head()

Initialize the instance and prepare the modeling data . Only a subset of the dataset is used here （1000 That's ok ） demonstrate . The following code estimates the missing values and encodes the classification features .

atom = ATOMClassifier(X, y="RainTomorrow", n_rows=1e3, verbose=2)
atom.impute()
atom.encode()

The output is as follows .

have access to dataset Property to quickly check what the data looks like after conversion .

atom.dataset.head()

The data is now ready . This article will use LightGBM The model predicts . Use atom The training and evaluation model is very simple ：

atom.run(models="LGB", metric="accuracy")

You can see that the test set has reached 0.8471 The accuracy of . Let's see if automatic feature generation can improve this .

DFS

DFS Put the standard mathematical operator （ Add 、 Subtraction 、 Multiplication, etc ） Apply to existing features , And combine these features . for example , On our dataset ,DFS You can create new features MinTemp + MaxTemp or WindDir9am x WindDir3pm.

In order to be able to compare models , Need to be for DFS The pipe creates a new branch . If you're not familiar with it ATOM Branch system of , Please check the official documents .

atom.branch = "dfs"

Use atom Of feature_generation Method runs on the new branch DFS. For the sake of , Here, only addition and multiplication are used to create new features （ Use div、log or sqrt The operator may return with inf or nan Characteristics of value , So it needs to be processed again ）.

atom.feature_generation(
    strategy="dfs",
    n_features=10,
    operators=["add", "mul"],
)

ATOM It's using featuretools Package to run DFS Of . It's used here n_features=10, Therefore, ten features randomly selected from all possible combinations are added to the data set .

atom.dataset.head()

Train the model again ：

atom.run(models="LGB_dfs")

It should be noted that

Add a label after the acronym of the model _dfs To not cover the baseline model .
It is no longer necessary to specify indicators for validation .atom The instance will automatically use the same metrics as any previous model training . In our case for accuracy.

look DFS Did not improve the model . The result is even worse . Let's see GFG How do you behave .

GFG

GFG Using genetic programming （ A branch of evolutionary programming ） To determine which features are valid and create new features based on these features . And DFS The blind attempt feature combination is different ,GFG Try to improve its characteristics in each generation of algorithm .GFG Use with DFS The same operator , But not just apply the transformation once , But to further develop them , Create a nested structure of feature combinations . Using operators add (+) and mul (x), The combination of features may be ：

add(add(mul(MinTemp, WindDir3pm), Pressure3pm), mul(MaxTemp, MinTemp))

When used with DFS equally , First create a new branch （ From primitive master Branch will DFS exclude ）, Then train and evaluate the model . Again , So here we create 10 A new feature .

Be careful ：ATOM Use... At the bottom gplearn Package to run GFG.

atom.branch = "gfg_from_master"
atom.feature_generation(
    strategy="GFG",
    n_features=10,
    operators=["add", "mul"],
)

Can pass

generic_features

Property to access the newly generated feature 、 Their name and fitness （ Score obtained during genetic algorithm ） Overview .

atom.genetic_features

What needs to be noted here is , Because the description of features may become very long （ Look at the picture above ）, Therefore, the new feature will be numbered and named, for example feature n, among n Represents the... In the dataset n Features .

atom.dataset.head()

Run the model again ：

atom.run(models="LGB_gfg")

This time I got 0.8824 The accuracy of , Than the baseline model 0.8471 Much better ！

Result analysis

All three models have been trained and can analyze the results . Use results Attribute to view the scores of all models on the training set and test set .

atom.results

Use atom Of plot The method can further compare the characteristics and performance of the model .

atom.plot_roc()

Use atom You can draw multiple adjacent graphs , See which features contribute most to the prediction of the model

with atom.canvas(1, 3, figsize=(20, 8)):
    atom.lgb.plot_feature_importance(show=10, title="LGB")
    atom.lgb_dfs.plot_feature_importance(show=10, title="LGB + DFS")
    atom.lgb_gfg.plot_feature_importance(show=10, title="LGB + GFG")

For two non baseline models , The generated features seem to be the most important features , This indicates that this is relevant to the new target column , And they make a great contribution to the prediction of the model .

Use decision diagrams , You can also view the impact of features on individual rows in the dataset .

atom.lgb_dfs.decision_plot(index=0, show=15)

summary

This paper compares the performance of new features generated by two automatic feature generation techniques for model prediction . The results show that using these techniques can significantly improve the performance of the model . In this article, I used ATOM The package simplifies the process of processing training and modeling , of ATOM For more information , Please check the documentation of the package .

https://www.overfit.cn/post/389b1d229dae4d0fb7b584cb37a350de

author ：Marco vd Boom

原网站

版权声明
本文为[deephub]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202281545333363.html