当前位置:网站首页>随机森林与集成方法学习笔记
随机森林与集成方法学习笔记
2022-07-28 02:20:00 【羊咩咩咩咩咩】
上一篇文章中提到了投票分类器,bagging方法,pasting方法,随机森林等机器学习方法,对于这一类集成方法来说可以称之为使用相同的弱学习模型的集成方法,这会导致模型的单一,与如果模型不合适导致效果不好,所以引入了提升法,提升法是指通过几个弱学习器组合成为一个强学习器的集成方法。
总体思路是对循环训练预测器,每一次都对前序进行一些改变。
Adaboost:它是通过改变分类错误的实例权重后再进行分类,由于全中的改变,模型会偏向于选择权重较大的实例,以此进行循环,直至达到最优情况。
在sklearn.AdaboostClassifier中存在超参数algorithm,用于调整算法,为SAMME时,是基于多类指数损失函数的逐步添加模型,而为SAMME.R为基于概率。
##Adaboost
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada_clf =AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=200,
algorithm='SAMME.R',learning_rate=0.5)
ada_clf.fit(x_train,y_train)另一个集成方法,梯度提升。类似于最速下降法。它是通过针对上一个预测结果的残差值进行计算的。推导过程如下。
##梯度提升
tree_reg1 =DecisionTreeClassifier(max_depth=2)
tree_reg1.fit(x,y)
y2 = y-tree_reg1.predict(x)
tree_reg2 =DecisionTreeClassifier(max_depth=2)
tree_reg2.fit(x,y)
##以此类推,直到误差小于阈值##以上方法的简便形式
from sklearn.ensemble import GradientBoostingRegressor
gbrt =GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1)
gbrt.fit(x,y)同样,类似于最速下降法存在着错过最小值点的情况,一直寻找到最后才发现最小值点在前面,设计了提前停止法。
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
x_train,x_test,y_train,y_test = train_test_split(x,y)
gbrt =GradientBoostingRegressor(max_depth=2,n_estimators=120)
gbrt.fit(x_train,y_train)
errors = [mean_squared_error(y_test,y_pred) for y_pred in gbrt.staged_predict(x_test)]
best_estimators = np.argmin(errors)+1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=best_estimators)
gbrt_best.fit(x_train,y_train)
gbrt = GradientBoostingRegressor(max_depth=2,warm_start=True)
min_test_error =float('inf')
error_going_up=0
for n_estimators in range(1,120):
gbrt.n_estimators =n_estimators
gbrt.fit(x_train,y_train)
y_pred =gbrt.predict(x_test)
test_error = mean_squared_error(y_pred,y_test)
if test_error <min_test_error:
min_test_error = test_error
error_going_up =0
else:
error_going_up +=1
if error_going_up==5:
break;第三种方法:堆叠法又称为层叠泛化法
先将数据集分为测试集,训练集,验证集。
然后先通过测试集对多个预测器进行训练,然后进行测试,当效果较好,将验证集带入模型,输出预测值。然后通过这些预测值结合验证集的y,再次对别的模型进行训练,最后将测试集代入检查效果。
第四种方法:XGBOOST(下一篇再讲)
实例:
(1)加载MNIST数据集,分成测试集,验证集,训练集,训练多个分类器,然后用投票分类器与多个分类器比较效果。
(2)对以上的分类器进行堆叠法操作,然后与投票分类法进行比较,查看集成效果
from sklearn.datasets import fetch_openml
minst = fetch_openml('mnist_784',version=1)
minst.keys()
x,y = minst['data'],minst['target']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=20000)
x_test,x_val,y_test,y_val =train_test_split(x_test,y_test,test_size=0.5)
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.ensemble import ExtraTreesClassifier
rf_clf = RandomForestClassifier()
svm = SVC(probability=True)
ex_clf=ExtraTreesClassifier()
voting_clf_hard = VotingClassifier(estimators=[('rf_clf',rf_clf),('svm',svm),('ex_clf',ex_clf)],voting='hard')
voting_clf_soft= VotingClassifier(estimators=[('rf_clf',rf_clf),('svm',svm),('ex_clf',ex_clf)],voting='soft')
from sklearn.metrics import accuracy_score
for model in (rf_clf,svm,ex_clf,voting_clf_hard,voting_clf_soft):
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print(model,accuracy_score(y_pred,y_test))y_pred_rf = rf_clf.predict(x_val)
y_pred_svm =svm.predict(x_val)
y_pred_ex =ex_clf.predict(x_val)
x_val_new = np.empty((len(x_val),3))
list=[y_pred_rf,y_pred_svm,y_pred_ex]
for index,value in enumerate(list):
x_val_new[:,index]=value
rf_clf_new =RandomForestClassifier(n_estimators=500,oob_score=True)
rf_clf_new.fit(x_val_new,y_val)
rf_clf_new.oob_score_
y_pred_rf = rf_clf.predict(x_test)
y_pred_svm =svm.predict(x_test)
y_pred_ex =ex_clf.predict(x_test)
x_test_new = np.empty((len(x_val),3))
list=[y_pred_rf,y_pred_svm,y_pred_ex]
for index,value in enumerate(list):
x_val_new[:,index]=value
accuracy_score(rf_clf_new.predict(x_test_new),y_test)边栏推荐
- Docker advanced -redis cluster configuration in docker container
- CNN中的混淆矩阵 | PyTorch系列(二十三)
- Ci/cd from hardware programming to software platform
- 牛客-TOP101-BM340
- JS 事件对象 offsetX/Y clientX Y PageX Y
- 关于权重衰退和丢弃法
- els 键盘信息
- [wechat applet development (VI)] draw the circular progress bar of the music player
- 一次跨域问题的记录
- GBase8s如何在有外键关系的表中删除数据
猜你喜欢

线程基础

分布式事务——Senta(一)
![[stream] parallel stream and sequential stream](/img/e1/b8728962c14df56241aa6973c0c706.png)
[stream] parallel stream and sequential stream

会议OA项目之我的审批&&签字功能

CNN中的混淆矩阵 | PyTorch系列(二十三)

"29 years old, general function test, how do I get five offers in a week?"

嵌入式分享合集22

43.js -- scope chain

Redis AOF log persistence

Data center construction (III): introduction to data center architecture
随机推荐
注意,这些地区不能参加7月NPDP考试
ROS的调试经验
[stream] parallel stream and sequential stream
els 定时器
What kind of job is social phobia suitable for? Can you do we media?
The applet has obtained the total records and user locations in the database collection. How to use aggregate.geonear to arrange the longitude and latitude from near to far?
Unexpected harvest of epic distributed resources, from basic to advanced are full of dry goods, big guys are strong!
Note that these regions cannot take the NPDP exam in July
MySQL essay
蓝桥杯原题
Is it you who are not suitable for learning programming?
Data Lake: database data migration tool sqoop
Docker高级篇-Docker容器内Redis集群配置
满满干货赶紧进来!!!轻松掌握C语言中的函数
谈一谈百度 科大讯飞 云知声的语音合成功能
Opengauss Developer Day 2022 sincerely invites you to visit the "database kernel SQL Engine sub forum" of Yunhe enmo
数据中台建设(三):数据中台架构介绍
NPDP考生!7月31号考试要求在这里看!
NPDP candidates! The exam requirements for July 31 are here!
43.js -- scope chain