当前位置：网站首页>Decision tree (hunt, ID3, C4.5, cart)

Decision tree (hunt, ID3, C4.5, cart)

2022-06-11 04:33:00 【Don't wait for brother shy to develop】

Decision tree

1、 Decision tree
2、 Decision tree algorithm
3、 Decision tree algorithm for Iris Data sets are classified

1、 Decision tree

Decision tree is one of the ten classical data mining algorithms , It is a tree structure similar to flow chart , The rule is if…then… Thought , be used for Prediction of numerical dependent variables and Classification of discrete dependent variables . The decision tree algorithm is simple and intuitive , Easy to explain , And in practical application, it has the speed advantage that other algorithms can't match .

The decision tree method is classifying 、 Prediction and rule extraction are widely used . stay 20 century 70 The late S and 80 In the early s , Machine learning researchers J.Ross Quinlan The decision tree algorithm is developed , Called iterative bisector （Iterative Dichotomiser, ID3）, The decision tree has been greatly developed in the field of machine learning .Quinlan Later, it was proposed that ID3 In the subsequent C4.5 Algorithm , Become the performance benchmark of the new supervised learning algorithm .1984 In, several statisticians put forward CART Classification algorithm .

Classification decision tree model is a kind of tree structure that describes the classification of instances . The decision tree consists of nodes (node) And the directed side (directed edge) form . There are two types of nodes ： Internal and leaf nodes . Internal nodes represent a feature or attribute , Leaf node represents a class .

Decision tree classification process ：** Start at the root node , Test a feature of the instance , According to the test results , Assign an instance to its child nodes ; At this time , A value corresponding to the characteristic of each child node . Test and assign instances recursively , Until the leaf node is reached .** Finally, the instance is divided into leaf node classes . As shown in the following figure is a schematic diagram of a decision tree , The circle and triangle in the figure represent the internal nodes and leaf nodes respectively .

Characteristics of decision tree ：
Tree structure , It can classify the data well ;
Each path from the root node to the leaf node of the decision tree constructs a rule ;
It has the characteristics of mutual exclusion and completeness , That is, each sample is covered by and can only be covered by one path ;
As long as the amount of data provided is large enough, it is true , Through data mining patterns , You can construct a decision tree .

2、 Decision tree algorithm

2.1 Hunt Algorithm

Hunt The algorithm is a decision tree construction algorithm using local optimal strategy , It is also the basis of many decision tree algorithms , Include ID3、C4.5 and CART etc. .

stay Hunt In the algorithm, , By successively dividing the training records into purer subsets , Build a decision tree recursively .

set up $D_t$ Is associated with the node t Associated training record set , Algorithm steps ：

If $D_t$ All records in belong to the same class $y_t$ , be t It's a leaf node , use $y_t$ Mark
If $D_t$ Contains records belonging to multiple classes , Then select an attribute test condition , Divide records into smaller subsets . For each output of the test condition , Create a child node , And according to the test results $D_t$ The records in are distributed to child nodes . then , For each child node , Call the algorithm recursively .

We use the example of credit card fraud to illustrate the algorithm , We need to predict whether credit card users will repay their arrears on time , Or credit card fraud . For this question , The training data set can be constructed by examining the credit card usage records of previous borrowers . In the table below , Each record contains the personal information of the credit card , And whether the borrower defaults on the loan .

Tid	Refund	Marital Status	Taxable Income	Cheat
1	Yes	Single	125K	No
2	No	Married	100K	No
3	No	Single	70K	No
4	Yes	Married	120K	No
5	No	Divorced	95K	Yes
6	No	Married	60K	No
7	Yes	Divorced	220K	No
8	No	Single	85K	Yes
9	No	Married	75K	No
10	No	Single	90K	Yes

As shown in the table above , The initial decision tree of the classification problem has only one node , The class label is “ cheat = no ”( Here's the picture )

This means that most users will not commit credit card fraud . The tree needs further refinement , Because the root node also contains “ cheat = no ” and “ cheat = yes ” Records of two classes . according to “Refund” Testing conditions , These records are divided into smaller subsets , As shown in the figure below .

Next , Recursively call... For each child of the root node Hunt Algorithm . It can be seen from the training data set that ,Refund by “yes” Of users have paid off their arrears on time , therefore , The left child of the root node is the leaf node , Marked as “ cheat = no ”( Here's the picture )

For the right child , We need to continue calling recursively Hunt Algorithm , Until all records belong to the same class . The decision tree formed by each recursive call is shown in the following figure .

2.2 The problem of building a decision tree

2.2.1 How to specify test conditions for different types of attributes

Splitting based on nominal attributes

Demultiplexing ： Stroke fraction （ Output number ） It depends on the number of different attribute values of the attribute

Binary partition ： The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values $2^{k-1}-1$ Methods

Splitting based on ordinal attributes

Demultiplexing ： Stroke fraction （ Output number ） It depends on the number of different attribute values of the attribute
Binary partition ： The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values $2^{k-1}-1$ Methods

Partition based on continuous attributes

Binary partition ：（A<v）or （A≥v）. Consider all the dividing points , Choose an optimal partition point v.
Demultiplexing ：vi≤A<vi+1 （i=1,…,k）

2.2.2 How to choose the best partition

As the division process continues , We hope that the samples contained in the branch nodes of the decision tree belong to the same class as much as possible , That is, the purity of the node is getting higher and higher . The higher the purity , The more skewed the class distribution , The better the division results .

As shown in the figure above , When a class in a node $C_0$ Of the total 90% when , The purity of this node is high . therefore , In order to determine the effect divided by a certain attribute , We need to compare the pre partition ( Father node ) And after division ( All son nodes ) The degree of reduction in impurity , The more you reduce , The better the division . So we need to quantify this impurity mathematically .

Decision trees have three ways to partition data sets ： Information gain method (ID3), Information gain rate (C4.5), gini index (CART)

2.3 Information gain algorithm

2.3.1 Principle of information gain algorithm

As shown in the following table , We have a computer store , And some of its user information , The data in the graph can be considered as a data set , The age 、 income 、 hobby 、 Credit is the attribute of the data set , among “ Whether there is a tendency to buy computers ” Label the data class , Our goal is to establish a model to classify these users into two categories , That is, the tendency to buy computers and the tendency not to buy computers , At this point, the decision tree algorithm can be used to solve the problem .

id	Age	income	hobby	credit	Buy
1	green	high	no	in	no
2	green	high	no	optimal	no
3	in	high	no	in	yes
4	The old	in	no	in	yes
5	The old	low	yes	in	yes
6	The old	low	yes	optimal	no
7	in	low	yes	optimal	yes
8	green	in	no	in	no
9	green	low	yes	in	yes
10	The old	in	yes	in	yes
11	green	in	yes	optimal	yes
12	in	in	no	optimal	yes
13	in	high	yes	in	yes
14	The old	in	no	optimal	no

1948 year , Shannon Put forward “ Information entropy ” The concept of . The amount of information in a piece of information is directly related to its uncertainty , It takes a lot of information to figure out an uncertain thing . entropy (entropy) A measure of the uncertainty of a random variable , The greater the entropy , Indicates that the greater the uncertainty .

Hypothetical variables S Yes $S_i(i=1,2,...,n)$ In this case , $p_i$ It means the first one i The probability of the situation , So the random variable S Entropy is defined as ：
$Entropy(S)=-\sum_{i=1}^{c}p_i\log_{2}{p_i}$
The unit of entropy is bits (bit). Maximum entropy , Then the uncertainty of the variable is maximum . Back to the example of buying a computer , In the result of whether to buy a computer , Data sets D Yes 9 individual “ yes ”,5 individual “ no ”, So the data “ Buy a computer ” The entropy of is ：

In random variables S Given the conditions , A random variable A The conditional entropy Entropy(A|S) Defined as ：
$Entropy(A|S)=-\sum_{i=1}^{c}p_i\cdot Entropy(A|S=s_i)$
The information gain is ： To learn that feature X And make the classification Y The degree to which the uncertainty of information is reduced . If the information gain of a feature is large , It means that this feature has a great impact on the results , features A The data set D The information gain of is expressed as ：

among , $S_v$ by A The value is v when S Subset .

Take the data set of buying a computer as an example , Let's calculate the information gain of age . As you can see from the diagram , stay 14 Among the age characteristics of the data , Youth have 5 people , Middle age has 4 people , The elderly have 5 people . Calculate the information entropy of the three cases respectively , Then add the information entropy to get $Info_{age}(D):$
$Info_{age}(D)=\frac{5}{14}I(2,3)+ \frac{4}{14}I(0,4)+\frac{5}{14}I(3,2)=0.694$
therefore ,Gain(age) The information gain of is ：

ID3 The core of the algorithm is to apply the information gain criterion to feature selection on each node of the decision tree . The specific way is ： Start at the root node , Calculate the information gain of all possible features for the node , The feature with the maximum information gain is selected as the feature of the node , The child nodes are constructed by the different values of the feature ; Recursively call the above methods to the child nodes to build the decision tree , Until the information gain of all features is very small or no features can be selected .

According to the method of calculating information increment above , Information increment of other features can be obtained ：

The information gain of age is the largest , by 0.246bit, Select age as the first root node for classification ; And then in each subspecies , Each division is made according to the information gain of its characteristics , Recursively form the sample decision tree on each partition , As shown in the figure below .

2.3.2 Measurement of purity of other nodes

We know , stay ID3 Information gain is used to select features in the algorithm , Give priority to features with large information gain .ID3 Entropy model based on information theory , It involves a lot of logarithmic calculations . Can we simplify the model without losing the advantages of entropy model ？ To solve this problem ,CART The classification tree algorithm uses The gini coefficient Instead of information gain ratio . In particular , In the classification problem , The expression of Gini coefficient is ：
$GINI(t)=1-\sum_{j}^{}[p(j|t)]^2$
among ,p( j | t) Is at the node t in , class j Probability of occurrence . When the class distribution is balanced ,Gini Maximum value reached $1 - 1/n_c)$ . On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity .

Pictured above , When node t in 0 Classes $C_1$ 、6 Classes $C_2$ when ,

When node t in 3 Classes $C_1$ 、3 Classes $C_2$ when ,

besides , Classification error (Classification Error) value It is also one of the general methods to measure the purity of nodes . The expression of the classification error value is as follows ：

When the class distribution is balanced ,Gini Maximum value reached $ (1 - 1/n_c) $
On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity

2.3.3 Comparison between purity measures

Binary classification problem ：

2.4 C4.5 Algorithm

C4.5 Algorithm is also a classical algorithm for generating decision tree , yes ID3 An extension and optimization of the algorithm .C4.5 Algorithm to ID3 The algorithm is improved as follows ：

Select the split attribute through the information gain rate , Overcome ID3 The algorithm tends to choose attributes with multiple attribute values as split attributes through information gain
Can handle discrete and continuous attribute types , That is, continuous attributes can be discretized
Prune after constructing the decision tree
Capable of processing training data with missing attribute values

For discrete features ,C4.5 The algorithm does not directly use information gain , Instead, use the gain rate (gain ratio) To select the optimal branching criteria , The gain rate is defined as follows ：

among ,

It is called the branch standard T The intrinsic value of . As a branch standard, the more values can be taken for attributes ,SplitINFO The greater the value of .

It should be noted that ： The gain rate criterion may have a preference for attributes with a small number of values , therefore C4.5 The algorithm does not directly select the attribute with the largest gain rate as the branch criterion , Instead, we first find out the attributes whose information gain is higher than the average level from the candidate attributes , Then select the attribute with the highest gain rate from it .

In the following data , The gain rate of user income is ：
$SplitINFO=-\frac{4}{14}\log_{2}{\frac{4}{14}} -\frac{6}{14}\log_{2}{\frac{6}{14}} -\frac{4}{14}\log_{2}{\frac{4}{14}}=1.557$

2.5 CART Algorithm

Classification and regression trees (Classification And Regression Tree,CART) Model from Breiman Et al. 1984 in , It is a widely used decision tree learning method .CART Also selected by features 、 The formation and pruning of trees , It can be used for both classification and regression .

CART It's a random variable given the input X Output random variables under the condition of Y Learning method of conditional probability distribution of .CART Suppose the decision tree is a binary tree , The value of internal node characteristics is “ yes ” and “ no ”, The value of the left branch is “ yes ” The branch of , The right branch has a value of “ no ” The branch of . Such a decision tree is equivalent to recursively bisecting each feature , Will input space ( Feature space ) Divided into finite elements , And the probability distribution of the prediction is determined on these units , That is, the conditional probability distribution of the output under the given input conditions .CART The algorithm consists of the following steps ：

Decision tree generation ： Generating decision tree based on training data set , The decision tree should be as large as possible .
Decision tree pruning ： It is used to verify that the data set prunes the generated tree and selects the optimal subtree , In this case, the minimum loss function is used as the pruning criterion .

As shown in the following table and figure , Mower manufacturers want to find a way to divide urban households into those who are willing to buy ride mowers and those who are unwilling to buy them . A random sample of families in this city 24 A sample of non owner households . The independent variable here is after a meal such as (x1) And grassland area (x2), The categories are owned and non owned .

id	income	Grassland area	Have
1	60	18.4	yes
2	85.5	16.8	yes
3	64.8	21.6	yes
4	61.5	20.8	yes
5	87	23.6	yes
6	110.1	19.2	yes
7	108	17.6	yes
17	84	17.6	no
18	49.2	17.6	no
19	59.4	16	no
20	66	18.4	no
21	47.4	16.4	no
22	33	18.8	no
23	51	14	no
24	63	14.8	no

As shown in the figure below , In the use of CART When the algorithm , use first $x_2$ =19 To classify . From the graph, we can directly find that the two rectangular parts are more homogeneous ( That is, points of the same category are more clustered together ).

By recursively bisecting each feature , The decision-making diagram after multiple divisions is shown in the figure below , The following decision tree is formed . We know by observation that , Every rectangle is homogeneous , Points that contain a category . The algorithm divides the node into two sub nodes each time .

CART and ID3 equally , Small segmentation exists , That is, over fitting , To solve this problem , The extra long trees should be pruned .

The purpose of pruning is to avoid over fitting of the decision tree model . Because the decision tree algorithm always classifies training samples as correctly as possible in the learning process , Will keep dividing the nodes , Cause too many branches of the whole tree , This leads to over fitting . There are two pruning strategies for decision trees ： Pre pruning and post pruning .

pre-pruning ： Pre pruning is the process of constructing a decision tree , First, estimate each node before partition , If the current node partition can not improve the generalization ability of the decision tree model , The current node is not divided and marked as a leaf node .
After pruning ： Post pruning is to construct the whole decision tree first , Then investigate the non leaf nodes from bottom to top , If the corresponding subtree of the node is replaced by a leaf node, the generalization ability can be improved , Replace this node with a leaf node .

3、 Decision tree algorithm for Iris Data sets are classified

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pd
iris=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.20,random_state=30,shuffle=True)
#criterion Default is 'gini'
clf=tree.DecisionTreeClassifier(criterion='entropy')
clf=clf.fit(X_train,y_train)
plt.figure(dpi=150)
# feature_names=iris.feature_names Set the feature name displayed by the decision tree species 
tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names)

Insert picture description here

#  Forecast data [6,5,5,2] Categories 
print(' data [6,5,5,2] Categories ：',clf.predict([[6,5,5,2]]))
print(' data [5.1,3.5,1.4,0.2] Categories ：',clf.predict([[5.1,3.5,1.4,0.2]]))
print(' The label of the test set :\b',y_test)
y_pre=clf.predict(X_test)
print(' Predicted test set label ：\b',y_pre)
print(' The accuracy of the model is :',clf.score(X_test,y_test))