当前位置:网站首页>Decision tree and random forest learning notes (1)
Decision tree and random forest learning notes (1)
2022-07-28 03:11:00 【Sheep Baa Baa Baa Baa Baa】
One , Decision tree
Decision tree is a machine learning model used to solve regression or classification functions , Decision tree is also an important part of random forest . There are three algorithms for decision trees :ID3,C4.5,CART Algorithm , about ID3 Come on , What determines its judgment is information gain ,C4.5 The judgment condition of is information gain ratio , and CART It is to build a decision tree by judging the Gini coefficient .
There is also a pruning problem in the decision tree , The main purpose of this problem is to solve the problem of model over fitting . The first is decision tree pruning , He calculates the empirical entropy of each node , Then by comparing the loss function of the leaf node with the loss function of its parent node after removing the leaf node , If the loss function decreases , Then prune , Otherwise keep . The second is CART prune , He calculates the loss function when there is only one root node and the loss function when there is a complete tree , Because there are penalties a, So in these two s When the formula is equal , You can find an optimal solution a, Then calculate the loss function value of each node , If it's bigger , Then subtract , Compared with small , The retention .
Build decision tree , The decision tree comes from sklearn.tree.DecisionTreeClassifier, Test through the data set of iris
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris =load_iris()
X =iris.data[:,2:]
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)You can visualize the decision tree through the following code .
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
out_file=os.path.join(IMAGES_PATH, "iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names =iris.target_names,
rounded=True,
filled=True)Can pass accuracy_score To judge the accuracy , You can also calculate mse Value view loss function size .
from sklearn.metrics import accuracy_score
y_pred = tree_clf.predict(x_test)
accuracy_score(y_pred,y_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)It can also be done through predict_proba To calculate the probability
tree_clf.predict_proba([[5,1.5]])![]()
example 1: Train and fine tune a decision tree model for the satellite data training set
## Reading data
from sklearn.datasets import make_moons
data =make_moons(n_samples=10000,noise=0.4)
## Distinguish between training set and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data[0],data[1],random_state=42,test_size=0.3)
## Optimize parameters through grid search
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
model =DecisionTreeClassifier()
params = {'max_leaf_nodes':[2,3,4,5,6,7,8,9]}
gridsearchcv = GridSearchCV(model,params,cv=3)
gridsearchcv.fit(x_train,y_train).best_params_
## Judge the correctness of the decision tree model
from sklearn.metrics import accuracy_score
y_pred = gridsearchcv.predict(x_test)
accuracy_score(y_pred,y_test)Two , Random forest and ensemble learning
Ensemble learning is achieved by aggregating the predictions of a set of predictors , The prediction result obtained is better than that of a predictor .
We can prove it by building a voter , First , Voting is divided into hard voting and soft voting , Hard voting means that the majority is subordinate to the minority , Soft voting means averaging the probabilities of each predictor , Judgment after comparison .
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(estimators=[('log_clf',log_clf),('rnd_clf',rnd_clf),('svm_clf',svm_clf)],voting='hard')
voting_clf.fit(x_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
It can be seen that ensemble learning is basically better than a single predictor .
Integrated learning is divided into bagging Algorithm and pasting Algorithm , The difference between the two is that the former is to put the sample back , The latter is to take samples and not put them back .
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred = bag_clf.predict(x_test)
accuracy_score(y_pred,y_test)When bootstrap=False when , yes pasting Algorithm .
Here we can find a problem : When we put the sample back , We must miss some data in the process of training the model , We call these data outsourcing (oob), We can go through oob_score_ To use the data not participating in the training for example prediction , And get an evaluation score . Finally, you can pass the oob_decision_function_ To show the instance possibility .
bag_clf = BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,max_samples=100,oob_score=True,n_jobs=-1,bootstrap=True)
bag_clf.fit(x_train,y_train)
bag_clf.oob_score_## Use unused data to evaluate the model
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test,y_pred)
bag_clf.oob_decision_function_Random forest is a kind of ensemble learning , It is essentially the integration of decision trees . Can pass bagging To simulate the , It can also be used. RandomForestClassifier To establish the .
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(x_train,y_train)
y_pred_rf = rnd_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)
bag_clf =BaggingClassifier(DecisionTreeClassifier(splitter='random',max_leaf_nodes=16),
n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred_bag = bag_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)Another important use of random forests , The importance of features can be calculated . Can pass feature_importance_ Calculate the importance of each feature .
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf =RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(iris['data'],iris['target'])
for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
print(name,score)
You can see from the picture above ,sepal Of length And width It's not that important .
边栏推荐
- 机器人工程是否有红利期
- 谈一谈百度 科大讯飞 云知声的语音合成功能
- 综合 案例
- 【stream】stream流基础知识
- CSDN Top1 "how does a Virgo procedural ape" become a blogger with millions of fans through writing?
- Opengauss source code, what ide tools are used to manage, edit and debug?
- P6118 [joi 2019 final] solution to the problem of Zhenzhou City
- Skills in writing English IEEE papers
- 汇总了50多场面试,4-6月面经笔记和详解(含核心考点及6家大厂)
- “29岁,普通功能测试,我是如何在一周内拿到5份Offer的?”
猜你喜欢

Web服务器

TFX airflow experience

42.js -- precompiled

综合 案例

My approval & signature function of conference OA project

Intelligent industrial design software company Tianfu C round financing of hundreds of millions of yuan

Explanation of CNN circular training | pytorch series (XXII)
![[stream] parallel stream and sequential stream](/img/e1/b8728962c14df56241aa6973c0c706.png)
[stream] parallel stream and sequential stream

MySQL index learning

Superparameter adjustment and experiment - training depth neural network | pytorch series (26)
随机推荐
NPDP考生!7月31号考试要求在这里看!
app 自动化 环境搭建(一)
CSDN Top1 "how does a Virgo procedural ape" become a blogger with millions of fans through writing?
社恐适合什么工作?能做自媒体吗?
ECCV 2022 | open source for generative knowledge distillation of classification, detection and segmentation
Development and design logic of rtsp/onvif protocol easynvr video platform one click upgrade scheme
Yiwen teaches you to distinguish between continuous integration, continuous delivery and continuous deployment
The applet has obtained the total records and user locations in the database collection. How to use aggregate.geonear to arrange the longitude and latitude from near to far?
exness:日本物价上涨收入下降,英镑/日元突破 165
上位机与MES对接的几种方式
Using pytorch's tensorboard visual deep learning indicators | pytorch series (25)
数据湖:海量日志采集引擎Flume
Docker advanced -redis cluster configuration in docker container
clientY vs pageY
Explanation of CNN circular training | pytorch series (XXII)
ELS displays a random square
4、 Analysis of solid state disk storage technology (paper)
CNN中的混淆矩阵 | PyTorch系列(二十三)
R 笔记 MICE
Niuke-top101-bm340