当前位置:网站首页>Prediction and regression of stacking integrated model
Prediction and regression of stacking integrated model
2022-06-30 21:22:00 【Black willow smoke】
Preface
About various integration models , There have been many articles done Detailed principle introduction . This article will not be repeated stacking Principle , Directly through a case , Use stacking Integrated models predict regression problems .
This article is based on a stacking Inheritance learning predicts classification problems , The code has been adjusted , To solve the problem of regression .
Code and analysis
Guide pack
Use KFold Cross validation
stacking Base model contains 4 Kind of (GBDT、ET、RF、ADA)
The meta model is LinearRegression
The evaluation index of the regression model is r2_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor as GBDT
from sklearn.ensemble import ExtraTreesRegressor as ET
from sklearn.ensemble import RandomForestRegressor as RF
from sklearn.ensemble import AdaBoostRegressor as ADA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
About why KFold, This article mention :
KFlod It is applicable to user regression type data division
stratifiedKFlod Applicable to classification data division
And in the experiment also found ,stratifiedKFlod.split(X_train,y_train) Of y_train Not continuous data , So I can't use , Only use KFold
Data loading
Read file and use train_test_split Divide the data set . The data type after division is Dataframe, And because of the subsequent use array Make it convenient , Therefore, data conversion is required after division . At the same time, you need to record Dataframe Column name of , It will be used later .
df = pd.read_csv("500.csv")
X = df.iloc[:, :6]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_cols = X_train.columns
X_train = X_train.values
y_train = y_train.values
X_test = X_test.values
The original data (500.csv) Here's the picture , Yes 500 Samples , The first six parameters are characteristics , The last two are output . For simplicity , This case only studies how to predict the output of the last column through the first six features .X_train by (400,6)
The first layer model
models = [GBDT(n_estimators=100),
RF(n_estimators=100),
ET(n_estimators=100),
ADA(n_estimators=100)]
X_train_stack = np.zeros((X_train.shape[0], len(models)))
X_test_stack = np.zeros((X_test.shape[0], len(models)))
here , Two new ndarray, It can be seen that its size is (400,4) and (100,4).
combination stacking To understand :
For each model X_train Cross validation was carried out , When it's over, you'll get (400,1) The predicted value of the size .4 The second model is (400,4), As the input of the second layer model training data .
While cross checking, each discount will be verified X_test The forecast , Finally, the average is obtained (100,1) The predicted value of the size ,4 The second model is (100,4), As a second level model test Data input .
The first layer trains and obtains the data required by the second layer
# 10 fold stacking
n_folds = 10
kf = KFold(n_splits=n_folds)
for i, model in enumerate(models):
X_stack_test_n = np.zeros((X_test.shape[0], n_folds))
for j, (train_index, test_index) in enumerate(kf.split(X_train)):
tr_x = X_train[train_index]
tr_y = y_train[train_index]
model.fit(tr_x, tr_y)
# Generate stacking Training data set
X_train_stack[test_index, i] = model.predict(X_train[test_index])
X_stack_test_n[:, j] = model.predict(X_test)
# Generate stacking Test data set
X_test_stack[:, i] = X_stack_test_n.mean(axis=1)
First, the first layer i loop , It's right 4 Model cycles .
Here is a new definition X_stack_test_n Used to store 10 In cross validation , Per pair X_test(100,6) The forecast , obtain (100,1) Result . Storage 10 The next size is (100,10). You can see the last line of code for this 10 The column data is averaged , obtain (100,1) The data of . Final 4 Models get (100,4), As a second level model test Data input .
Then look at the second floor j loop , It's going on 10 Crossover verification .
train_index, test_index The training and test numbers for each fold verification are recorded respectively . If not set shuffle=True, Then for 0-399 Data is taken in sequence instead of randomly 40 As a test set partition . Because my original data is randomly distributed , So it doesn't matter whether the cards are shuffled or not .
Due to the previous conversion X_train The data type is array, So here we can directly tr_x = X_train[train_index] Take out the data at the corresponding position .
adopt train_index Take out the data for model training , adopt test_index The extracted data is used for model prediction .
The last two or three lines of code give the predicted results to the previously created array in .
Layer 2 model training
model_second = LinearRegression()
model_second.fit(X_train_stack,y_train)
pred = model_second.predict(X_test_stack)
print("R2:", r2_score(y_test, pred))

It's clear here , The second level of training is simple .
Training input uses X_train_stack(400,4),y_train(400,1)
Test the trained model X_test_stack(100,4) Get the results pred(100,1)
Then get the evaluation index r2_score that will do
Base model indicators
# GBDT
model_1 = models[0]
model_1.fit(X_train,y_train)
pred_1 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_1))
# RF
model_2 = models[1]
model_2.fit(X_train, y_train)
pred_2 = model_2.predict(X_test)
print("R2:", r2_score(y_test, pred_2))
# ET
model_3 = models[2]
model_3.fit(X_train, y_train)
pred_3 = model_1.predict(X_test)
print("R2:", r2_score(y_test, pred_3))
# ADA
model_4 = models[3]
model_4.fit(X_train, y_train)
pred_4 = model_4.predict(X_test)
print("R2:", r2_score(y_test, pred_4))

Conclusion
The learning result of the final integrated model is obviously better than 4 Base model .
边栏推荐
- Encryption and decryption and the application of OpenSSL
- 报错:Internal error XFS_WANT_CORRUPTED_GOTO at line 1635 of file fs/xfs/libxfs/xfs_alloc.c.
- How to move forward when facing confusion in scientific research? How to give full play to women's advantages in scientific research?
- [grade evaluator] how to register a grade evaluator? How many passes?
- 软工UML画图
- clickhouse原生監控項,系統錶描述
- How to run jenkins build, in multiple servers with ssh-key
- Auto-created primary key used when not defining a primary key
- jupyter notebook/lab 切换conda环境
- qsort函数和模拟实现qsort函数
猜你喜欢

漫谈Clickhouse Join

Ten security measures against unauthorized access attacks

clickhouse原生监控项,系统表描述

asp.net core JWT传递

How to move forward when facing confusion in scientific research? How to give full play to women's advantages in scientific research?

clickhouse原生監控項,系統錶描述

Et la dégradation du modèle de génération de texte? Simctg vous donne la réponse

Markdown notes concise tutorial

片荒吗?不用下载直接在线免费看的资源来了!2022年收藏夹必须有它!

Spatiotemporal data mining: an overview
随机推荐
文本生成模型退化怎么办?SimCTG 告诉你答案
大学生研究生毕业找工作,该选择哪个方向?
docker安装mysql
科研中遇到迷茫困惑如何向前一步?如何在科研中发挥女性优势?
Flutter 嵌套地狱?不存在的,ConstraintLayout 来解救!
What does grade evaluation mean? What is included in the workflow?
攻防演练中的防泄露全家福
开源实习经验分享:openEuler软件包加固测试
Markdown notes concise tutorial
SqlServer 获取字符串中数字,中文及字符部分数据
Why have the intelligent investment advisory products collectively taken off the shelves of banks become "chicken ribs"?
mysql-批量更新
twelve thousand three hundred and forty-five
一文读懂什么是MySQL索引下推(ICP)
CA I ah, how many times Oh, ah sentence IU home Oh 11111
遇到“word在试图打开文件时遇到错误”怎么办?
片荒吗?不用下载直接在线免费看的资源来了!2022年收藏夹必须有它!
jupyter notebook/lab 切换conda环境
Export El table as is to excel table
Clickhouse Native Monitoring item, System table Description