当前位置：网站首页>Prediction and regression of stacking integrated model

Prediction and regression of stacking integrated model

2022-06-30 21:22:00 【Black willow smoke】

Preface

About various integration models , There have been many articles done Detailed principle introduction . This article will not be repeated stacking Principle , Directly through a case , Use stacking Integrated models predict regression problems .
This article is based on a stacking Inheritance learning predicts classification problems , The code has been adjusted , To solve the problem of regression .

Code and analysis

Guide pack

Use KFold Cross validation
stacking Base model contains 4 Kind of （GBDT、ET、RF、ADA）
The meta model is LinearRegression
The evaluation index of the regression model is r2_score

from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor as GBDT
from sklearn.ensemble import ExtraTreesRegressor as ET
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.ensemble import AdaBoostRegressor as ADA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd

About why KFold, This article mention ：
KFlod It is applicable to user regression type data division
stratifiedKFlod Applicable to classification data division
And in the experiment also found ,stratifiedKFlod.split(X_train,y_train) Of y_train Not continuous data , So I can't use , Only use KFold

Data loading

Read file and use train_test_split Divide the data set . The data type after division is Dataframe, And because of the subsequent use array Make it convenient , Therefore, data conversion is required after division . At the same time, you need to record Dataframe Column name of , It will be used later .

df = pd.read_csv("500.csv")
X = df.iloc[:, :6]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_cols = X_train.columns
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values

The original data （500.csv） Here's the picture , Yes 500 Samples , The first six parameters are characteristics , The last two are output . For simplicity , This case only studies how to predict the output of the last column through the first six features .X_train by （400,6）
Insert picture description here

The first layer model

models = [GBDT(n_estimators=100),
          RF(n_estimators=100),
          ET(n_estimators=100),
          ADA(n_estimators=100)]

X_train_stack = np.zeros((X_train.shape[0], len(models)))
X_test_stack = np.zeros((X_test.shape[0], len(models)))

here , Two new ndarray, It can be seen that its size is （400,4） and （100,4）.
combination stacking To understand ：
For each model X_train Cross validation was carried out , When it's over, you'll get （400,1） The predicted value of the size .4 The second model is （400,4）, As the input of the second layer model training data .
While cross checking, each discount will be verified X_test The forecast , Finally, the average is obtained （100,1） The predicted value of the size ,4 The second model is （100,4）, As a second level model test Data input .

The first layer trains and obtains the data required by the second layer

# 10 fold stacking
n_folds = 10
kf = KFold(n_splits=n_folds)

for i, model in enumerate(models):
    X_stack_test_n = np.zeros((X_test.shape[0], n_folds))

    for j, (train_index, test_index) in enumerate(kf.split(X_train)):
        tr_x = X_train[train_index]
        tr_y = y_train[train_index]
        model.fit(tr_x, tr_y)
        
        #  Generate stacking Training data set 
        X_train_stack[test_index, i] = model.predict(X_train[test_index])
        X_stack_test_n[:, j] = model.predict(X_test)

    #  Generate stacking Test data set 
    X_test_stack[:, i] = X_stack_test_n.mean(axis=1)

First, the first layer i loop , It's right 4 Model cycles .
Here is a new definition X_stack_test_n Used to store 10 In cross validation , Per pair X_test（100,6） The forecast , obtain （100,1） Result . Storage 10 The next size is （100,10）. You can see the last line of code for this 10 The column data is averaged , obtain （100,1） The data of . Final 4 Models get （100,4）, As a second level model test Data input .

Then look at the second floor j loop , It's going on 10 Crossover verification .
train_index, test_index The training and test numbers for each fold verification are recorded respectively . If not set shuffle=True, Then for 0-399 Data is taken in sequence instead of randomly 40 As a test set partition . Because my original data is randomly distributed , So it doesn't matter whether the cards are shuffled or not .
Due to the previous conversion X_train The data type is array, So here we can directly tr_x = X_train[train_index] Take out the data at the corresponding position .
adopt train_index Take out the data for model training , adopt test_index The extracted data is used for model prediction .
The last two or three lines of code give the predicted results to the previously created array in .

Layer 2 model training

model_second = LinearRegression()
model_second.fit(X_train_stack,y_train)
pred = model_second.predict(X_test_stack)
print("R2:", r2_score(y_test, pred))

Insert picture description here

It's clear here , The second level of training is simple .
Training input uses X_train_stack（400,4）,y_train（400,1）
Test the trained model X_test_stack（100,4） Get the results pred（100,1）
Then get the evaluation index r2_score that will do

Base model indicators

# GBDT
model_1 = models[0]
model_1.fit(X_train,y_train)
pred_1 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_1))

# RF
model_2 = models[1]
model_2.fit(X_train, y_train)
pred_2 = model_2.predict(X_test)
print("R2:", r2_score(y_test, pred_2))

# ET
model_3 = models[2]
model_3.fit(X_train, y_train)
pred_3 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_3))

# ADA
model_4 = models[3]
model_4.fit(X_train, y_train)
pred_4 = model_4.predict(X_test)
print("R2:", r2_score(y_test, pred_4))