当前位置:网站首页>Decision tree and random forest learning notes (1)
Decision tree and random forest learning notes (1)
2022-07-28 03:11:00 【Sheep Baa Baa Baa Baa Baa】
One , Decision tree
Decision tree is a machine learning model used to solve regression or classification functions , Decision tree is also an important part of random forest . There are three algorithms for decision trees :ID3,C4.5,CART Algorithm , about ID3 Come on , What determines its judgment is information gain ,C4.5 The judgment condition of is information gain ratio , and CART It is to build a decision tree by judging the Gini coefficient .
There is also a pruning problem in the decision tree , The main purpose of this problem is to solve the problem of model over fitting . The first is decision tree pruning , He calculates the empirical entropy of each node , Then by comparing the loss function of the leaf node with the loss function of its parent node after removing the leaf node , If the loss function decreases , Then prune , Otherwise keep . The second is CART prune , He calculates the loss function when there is only one root node and the loss function when there is a complete tree , Because there are penalties a, So in these two s When the formula is equal , You can find an optimal solution a, Then calculate the loss function value of each node , If it's bigger , Then subtract , Compared with small , The retention .
Build decision tree , The decision tree comes from sklearn.tree.DecisionTreeClassifier, Test through the data set of iris
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris =load_iris()
X =iris.data[:,2:]
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)You can visualize the decision tree through the following code .
from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
out_file=os.path.join(IMAGES_PATH, "iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names =iris.target_names,
rounded=True,
filled=True)Can pass accuracy_score To judge the accuracy , You can also calculate mse Value view loss function size .
from sklearn.metrics import accuracy_score
y_pred = tree_clf.predict(x_test)
accuracy_score(y_pred,y_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)It can also be done through predict_proba To calculate the probability
tree_clf.predict_proba([[5,1.5]])![]()
example 1: Train and fine tune a decision tree model for the satellite data training set
## Reading data
from sklearn.datasets import make_moons
data =make_moons(n_samples=10000,noise=0.4)
## Distinguish between training set and test set
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data[0],data[1],random_state=42,test_size=0.3)
## Optimize parameters through grid search
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
model =DecisionTreeClassifier()
params = {'max_leaf_nodes':[2,3,4,5,6,7,8,9]}
gridsearchcv = GridSearchCV(model,params,cv=3)
gridsearchcv.fit(x_train,y_train).best_params_
## Judge the correctness of the decision tree model
from sklearn.metrics import accuracy_score
y_pred = gridsearchcv.predict(x_test)
accuracy_score(y_pred,y_test)Two , Random forest and ensemble learning
Ensemble learning is achieved by aggregating the predictions of a set of predictors , The prediction result obtained is better than that of a predictor .
We can prove it by building a voter , First , Voting is divided into hard voting and soft voting , Hard voting means that the majority is subordinate to the minority , Soft voting means averaging the probabilities of each predictor , Judgment after comparison .
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(estimators=[('log_clf',log_clf),('rnd_clf',rnd_clf),('svm_clf',svm_clf)],voting='hard')
voting_clf.fit(x_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
It can be seen that ensemble learning is basically better than a single predictor .
Integrated learning is divided into bagging Algorithm and pasting Algorithm , The difference between the two is that the former is to put the sample back , The latter is to take samples and not put them back .
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred = bag_clf.predict(x_test)
accuracy_score(y_pred,y_test)When bootstrap=False when , yes pasting Algorithm .
Here we can find a problem : When we put the sample back , We must miss some data in the process of training the model , We call these data outsourcing (oob), We can go through oob_score_ To use the data not participating in the training for example prediction , And get an evaluation score . Finally, you can pass the oob_decision_function_ To show the instance possibility .
bag_clf = BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,max_samples=100,oob_score=True,n_jobs=-1,bootstrap=True)
bag_clf.fit(x_train,y_train)
bag_clf.oob_score_## Use unused data to evaluate the model
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test,y_pred)
bag_clf.oob_decision_function_Random forest is a kind of ensemble learning , It is essentially the integration of decision trees . Can pass bagging To simulate the , It can also be used. RandomForestClassifier To establish the .
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(x_train,y_train)
y_pred_rf = rnd_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)
bag_clf =BaggingClassifier(DecisionTreeClassifier(splitter='random',max_leaf_nodes=16),
n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred_bag = bag_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)Another important use of random forests , The importance of features can be calculated . Can pass feature_importance_ Calculate the importance of each feature .
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf =RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(iris['data'],iris['target'])
for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
print(name,score)
You can see from the picture above ,sepal Of length And width It's not that important .
边栏推荐
- Games101 review: ray tracing
- 谈一谈百度 科大讯飞 云知声的语音合成功能
- Is the securities account given by qiniu safe? Can qiniu open an account and buy funds
- Skills in writing English IEEE papers
- TFX airflow experience
- 方案分享 | 高手云集 共同探索重口音AI语音识别
- Day 8 of DL
- Selenium+pytest+allure comprehensive exercise
- clientY vs pageY
- 嵌入式分享合集22
猜你喜欢

Commissioning experience of ROS

42.js -- precompiled

Which of the four solutions of distributed session do you think is the best?

别再用 offset 和 limit 分页了,性能太差!

Docker advanced -redis cluster configuration in docker container

Data Lake: database data migration tool sqoop

Unexpected harvest of epic distributed resources, from basic to advanced are full of dry goods, big guys are strong!

写英文IEEE论文的技巧

一次跨域问题的记录

Flutter God operation learning (full level introduction)
随机推荐
Superparameter adjustment and experiment - training depth neural network | pytorch series (26)
Interview experience: first tier cities move bricks and face software testing posts. 5000 is enough
els 定时器
意外收获史诗级分布式资源,从基础到进阶都干货满满,大佬就是强!
Ah Han's story
优炫数据库客户端如何认证
@The function of valid (cascade verification) and the explanation of common constraint annotations
Design and practice of unified security authentication for microservice architecture
写英文IEEE论文的技巧
Promise object
GBase8s如何在有外键关系的表中删除数据
上位机与MES对接的几种方式
Data Lake: each module component
Introduction to the reduce() function in JS
Why is it that when logging in, you clearly use the account information already in the database, but still display "user does not exist"?
JS 事件对象 offsetX/Y clientX Y PageX Y
机器人工程是否有红利期
[stream] parallel stream and sequential stream
别再用 offset 和 limit 分页了,性能太差!
ROS的调试经验