当前位置：网站首页>Decision tree and random forest learning notes (1)

Decision tree and random forest learning notes (1)

2022-07-28 03:11:00 【Sheep Baa Baa Baa Baa Baa】

One , Decision tree

Decision tree is a machine learning model used to solve regression or classification functions , Decision tree is also an important part of random forest . There are three algorithms for decision trees ：ID3,C4.5,CART Algorithm , about ID3 Come on , What determines its judgment is information gain ,C4.5 The judgment condition of is information gain ratio , and CART It is to build a decision tree by judging the Gini coefficient .

There is also a pruning problem in the decision tree , The main purpose of this problem is to solve the problem of model over fitting . The first is decision tree pruning , He calculates the empirical entropy of each node , Then by comparing the loss function of the leaf node with the loss function of its parent node after removing the leaf node , If the loss function decreases , Then prune , Otherwise keep . The second is CART prune , He calculates the loss function when there is only one root node and the loss function when there is a complete tree , Because there are penalties a, So in these two s When the formula is equal , You can find an optimal solution a, Then calculate the loss function value of each node , If it's bigger , Then subtract , Compared with small , The retention .

Build decision tree , The decision tree comes from sklearn.tree.DecisionTreeClassifier, Test through the data set of iris

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris  =load_iris()
X =iris.data[:,2:]
y = iris.target
tree_clf = DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

You can visualize the decision tree through the following code .

from sklearn.tree import export_graphviz
export_graphviz(tree_clf,
               out_file=os.path.join(IMAGES_PATH, "iris_tree.dot"),
               feature_names=iris.feature_names[2:],
               class_names =iris.target_names,
               rounded=True,
               filled=True)

Can pass accuracy_score To judge the accuracy , You can also calculate mse Value view loss function size .

from sklearn.metrics import accuracy_score
y_pred = tree_clf.predict(x_test)
accuracy_score(y_pred,y_test)
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test,y_pred)

It can also be done through predict_proba To calculate the probability

tree_clf.predict_proba([[5,1.5]])

example 1： Train and fine tune a decision tree model for the satellite data training set

## Reading data 
from sklearn.datasets import make_moons
data =make_moons(n_samples=10000,noise=0.4)
## Distinguish between training set and test set 
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data[0],data[1],random_state=42,test_size=0.3)
## Optimize parameters through grid search 
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
model  =DecisionTreeClassifier()
params = {'max_leaf_nodes':[2,3,4,5,6,7,8,9]}
gridsearchcv = GridSearchCV(model,params,cv=3)
gridsearchcv.fit(x_train,y_train).best_params_
## Judge the correctness of the decision tree model 
from sklearn.metrics import accuracy_score
y_pred = gridsearchcv.predict(x_test)
accuracy_score(y_pred,y_test)

Two , Random forest and ensemble learning

Ensemble learning is achieved by aggregating the predictions of a set of predictors , The prediction result obtained is better than that of a predictor .

We can prove it by building a voter , First , Voting is divided into hard voting and soft voting , Hard voting means that the majority is subordinate to the minority , Soft voting means averaging the probabilities of each predictor , Judgment after comparison .

from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()
voting_clf = VotingClassifier(estimators=[('log_clf',log_clf),('rnd_clf',rnd_clf),('svm_clf',svm_clf)],voting='hard')
voting_clf.fit(x_train,y_train)
from sklearn.metrics import accuracy_score
for clf in (log_clf,rnd_clf,svm_clf,voting_clf):
    clf.fit(x_train,y_train)
    y_pred = clf.predict(x_test)
    print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

It can be seen that ensemble learning is basically better than a single predictor .

Integrated learning is divided into bagging Algorithm and pasting Algorithm , The difference between the two is that the former is to put the sample back , The latter is to take samples and not put them back .

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bag_clf = BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred = bag_clf.predict(x_test)
accuracy_score(y_pred,y_test)

When bootstrap=False when , yes pasting Algorithm .

Here we can find a problem ： When we put the sample back , We must miss some data in the process of training the model , We call these data outsourcing （oob）, We can go through oob_score_ To use the data not participating in the training for example prediction , And get an evaluation score . Finally, you can pass the oob_decision_function_ To show the instance possibility .

bag_clf = BaggingClassifier(
    DecisionTreeClassifier(),n_estimators=500,max_samples=100,oob_score=True,n_jobs=-1,bootstrap=True)
bag_clf.fit(x_train,y_train)
bag_clf.oob_score_## Use unused data to evaluate the model 
from sklearn.metrics import accuracy_score
y_pred = bag_clf.predict(x_test)
accuracy_score(y_test,y_pred)
bag_clf.oob_decision_function_

Random forest is a kind of ensemble learning , It is essentially the integration of decision trees . Can pass bagging To simulate the , It can also be used. RandomForestClassifier To establish the .

from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(x_train,y_train)
y_pred_rf = rnd_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)
bag_clf  =BaggingClassifier(DecisionTreeClassifier(splitter='random',max_leaf_nodes=16),
                           n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)
bag_clf.fit(x_train,y_train)
y_pred_bag = bag_clf.predict(x_test)
accuracy_score(y_pred_rf,y_test)

Another important use of random forests , The importance of features can be calculated . Can pass feature_importance_ Calculate the importance of each feature .

from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf =RandomForestClassifier(n_estimators=500,n_jobs=-1)
rnd_clf.fit(iris['data'],iris['target'])
for name,score in zip(iris['feature_names'],rnd_clf.feature_importances_):
    print(name,score)

You can see from the picture above ,sepal Of length And width It's not that important .

原网站

版权声明
本文为[Sheep Baa Baa Baa Baa Baa]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280220156833.html

当前位置：网站首页>Decision tree and random forest learning notes (1)

Decision tree and random forest learning notes (1)

边栏推荐

猜你喜欢

随机推荐