当前位置:网站首页>Prediction and regression of stacking integrated model
Prediction and regression of stacking integrated model
2022-06-30 21:22:00 【Black willow smoke】
Preface
About various integration models , There have been many articles done Detailed principle introduction . This article will not be repeated stacking Principle , Directly through a case , Use stacking Integrated models predict regression problems .
This article is based on a stacking Inheritance learning predicts classification problems , The code has been adjusted , To solve the problem of regression .
Code and analysis
Guide pack
Use KFold Cross validation
stacking Base model contains 4 Kind of (GBDT、ET、RF、ADA)
The meta model is LinearRegression
The evaluation index of the regression model is r2_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor as GBDT
from sklearn.ensemble import ExtraTreesRegressor as ET
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.ensemble import AdaBoostRegressor as ADA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
About why KFold, This article mention :
KFlod It is applicable to user regression type data division
stratifiedKFlod Applicable to classification data division
And in the experiment also found ,stratifiedKFlod.split(X_train,y_train) Of y_train Not continuous data , So I can't use , Only use KFold
Data loading
Read file and use train_test_split Divide the data set . The data type after division is Dataframe, And because of the subsequent use array Make it convenient , Therefore, data conversion is required after division . At the same time, you need to record Dataframe Column name of , It will be used later .
df = pd.read_csv("500.csv")
X = df.iloc[:, :6]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_cols = X_train.columns
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
The original data (500.csv) Here's the picture , Yes 500 Samples , The first six parameters are characteristics , The last two are output . For simplicity , This case only studies how to predict the output of the last column through the first six features .X_train by (400,6)
The first layer model
models = [GBDT(n_estimators=100),
RF(n_estimators=100),
ET(n_estimators=100),
ADA(n_estimators=100)]
X_train_stack = np.zeros((X_train.shape[0], len(models)))
X_test_stack = np.zeros((X_test.shape[0], len(models)))
here , Two new ndarray, It can be seen that its size is (400,4) and (100,4).
combination stacking To understand :
For each model X_train Cross validation was carried out , When it's over, you'll get (400,1) The predicted value of the size .4 The second model is (400,4), As the input of the second layer model training data .
While cross checking, each discount will be verified X_test The forecast , Finally, the average is obtained (100,1) The predicted value of the size ,4 The second model is (100,4), As a second level model test Data input .
The first layer trains and obtains the data required by the second layer
# 10 fold stacking
n_folds = 10
kf = KFold(n_splits=n_folds)
for i, model in enumerate(models):
X_stack_test_n = np.zeros((X_test.shape[0], n_folds))
for j, (train_index, test_index) in enumerate(kf.split(X_train)):
tr_x = X_train[train_index]
tr_y = y_train[train_index]
model.fit(tr_x, tr_y)
# Generate stacking Training data set
X_train_stack[test_index, i] = model.predict(X_train[test_index])
X_stack_test_n[:, j] = model.predict(X_test)
# Generate stacking Test data set
X_test_stack[:, i] = X_stack_test_n.mean(axis=1)
First, the first layer i loop , It's right 4 Model cycles .
Here is a new definition X_stack_test_n Used to store 10 In cross validation , Per pair X_test(100,6) The forecast , obtain (100,1) Result . Storage 10 The next size is (100,10). You can see the last line of code for this 10 The column data is averaged , obtain (100,1) The data of . Final 4 Models get (100,4), As a second level model test Data input .
Then look at the second floor j loop , It's going on 10 Crossover verification .
train_index, test_index The training and test numbers for each fold verification are recorded respectively . If not set shuffle=True, Then for 0-399 Data is taken in sequence instead of randomly 40 As a test set partition . Because my original data is randomly distributed , So it doesn't matter whether the cards are shuffled or not .
Due to the previous conversion X_train The data type is array, So here we can directly tr_x = X_train[train_index] Take out the data at the corresponding position .
adopt train_index Take out the data for model training , adopt test_index The extracted data is used for model prediction .
The last two or three lines of code give the predicted results to the previously created array in .
Layer 2 model training
model_second = LinearRegression()
model_second.fit(X_train_stack,y_train)
pred = model_second.predict(X_test_stack)
print("R2:", r2_score(y_test, pred))

It's clear here , The second level of training is simple .
Training input uses X_train_stack(400,4),y_train(400,1)
Test the trained model X_test_stack(100,4) Get the results pred(100,1)
Then get the evaluation index r2_score that will do
Base model indicators
# GBDT
model_1 = models[0]
model_1.fit(X_train,y_train)
pred_1 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_1))
# RF
model_2 = models[1]
model_2.fit(X_train, y_train)
pred_2 = model_2.predict(X_test)
print("R2:", r2_score(y_test, pred_2))
# ET
model_3 = models[2]
model_3.fit(X_train, y_train)
pred_3 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_3))
# ADA
model_4 = models[3]
model_4.fit(X_train, y_train)
pred_4 = model_4.predict(X_test)
print("R2:", r2_score(y_test, pred_4))

Conclusion
The learning result of the final integrated model is obviously better than 4 Base model .
边栏推荐
- 物联网僵尸网络Gafgyt家族与物联网设备后门漏洞利用
- ArcGIS构建发布简单路网Network数据服务及Rest调用测试
- [untitled]
- clickhouse原生監控項,系統錶描述
- Open source internship experience sharing: openeuler software package reinforcement test
- Introduction of 3D Max fine model obj model into ArcGIS pro (II) key points supplement
- 时空数据挖掘:综述
- How to run jenkins build, in multiple servers with ssh-key
- Radar data processing technology
- 双立体柱状图/双y轴
猜你喜欢

双立体柱状图/双y轴

Why have the intelligent investment advisory products collectively taken off the shelves of banks become "chicken ribs"?
测试媒资缓存问题

ICLR'22 Spotlight | 怎样度量神经网络权重中的信息量?

文本识别-SVTR论文解读

报错:Internal error XFS_WANT_CORRUPTED_GOTO at line 1635 of file fs/xfs/libxfs/xfs_alloc.c.

Ten security measures against unauthorized access attacks

jenkins下载插件下载不了,解决办法

1.微信小程序页面跳转方法总结;2. navigateTo堆栈到十层不跳转问题

用yml文件进行conda迁移环境时的报错小结
随机推荐
Qiao NPMS: search for NPM packages
对多态的理解
[original] unable to use the code dialog height and width to solve the screen problem
CA I ah, several times Oh, ah, a sentence IU home Oh
twelve thousand three hundred and forty-five
《ClickHouse原理解析与应用实践》读书笔记(1)
Side sleep ha ha ha
元宇宙可能成为互联网发展的新方向
Testing media cache
Move blog to CSDN
jenkins下载插件下载不了,解决办法
It is urgent for enterprises to protect API security
多态在代码中的体现
Random talk about Clickhouse join
12345
go搭建服务器基础
ncat详细介绍(转载)
Analysis and proposal on the "sour Fox" vulnerability attack weapon platform of the US National Security Agency
1.微信小程序页面跳转方法总结;2. navigateTo堆栈到十层不跳转问题
数字货币:影响深远的创新