当前位置:网站首页>Decision tree (hunt, ID3, C4.5, cart)

Decision tree (hunt, ID3, C4.5, cart)

2022-06-11 04:33:00 Don't wait for brother shy to develop

1、 Decision tree

   Decision tree is one of the ten classical data mining algorithms , It is a tree structure similar to flow chart , The rule is if…then… Thought , be used for Prediction of numerical dependent variables and Classification of discrete dependent variables . The decision tree algorithm is simple and intuitive , Easy to explain , And in practical application, it has the speed advantage that other algorithms can't match .

   The decision tree method is classifying 、 Prediction and rule extraction are widely used . stay 20 century 70 The late S and 80 In the early s , Machine learning researchers J.Ross Quinlan The decision tree algorithm is developed , Called iterative bisector (Iterative Dichotomiser, ID3), The decision tree has been greatly developed in the field of machine learning .Quinlan Later, it was proposed that ID3 In the subsequent C4.5 Algorithm , Become the performance benchmark of the new supervised learning algorithm .1984 In, several statisticians put forward CART Classification algorithm .

   Classification decision tree model is a kind of tree structure that describes the classification of instances . The decision tree consists of nodes (node) And the directed side (directed edge) form . There are two types of nodes : Internal and leaf nodes . Internal nodes represent a feature or attribute , Leaf node represents a class .

   Decision tree classification process :** Start at the root node , Test a feature of the instance , According to the test results , Assign an instance to its child nodes ; At this time , A value corresponding to the characteristic of each child node . Test and assign instances recursively , Until the leaf node is reached .** Finally, the instance is divided into leaf node classes . As shown in the following figure is a schematic diagram of a decision tree , The circle and triangle in the figure represent the internal nodes and leaf nodes respectively .

image-20220609160347293

Characteristics of decision tree :

  • Tree structure , It can classify the data well ;
  • Each path from the root node to the leaf node of the decision tree constructs a rule ;
  • It has the characteristics of mutual exclusion and completeness , That is, each sample is covered by and can only be covered by one path ;
  • As long as the amount of data provided is large enough, it is true , Through data mining patterns , You can construct a decision tree .

2、 Decision tree algorithm

2.1 Hunt Algorithm

  Hunt The algorithm is a decision tree construction algorithm using local optimal strategy , It is also the basis of many decision tree algorithms , Include ID3、C4.5 and CART etc. .

   stay Hunt In the algorithm, , By successively dividing the training records into purer subsets , Build a decision tree recursively .

   set up D t D_t Dt Is associated with the node t Associated training record set , Algorithm steps :

  • If D t D_t Dt All records in belong to the same class y t y_t yt, be t It's a leaf node , use y t y_t yt Mark
  • If D t D_t Dt Contains records belonging to multiple classes , Then select an attribute test condition , Divide records into smaller subsets . For each output of the test condition , Create a child node , And according to the test results D t D_t Dt The records in are distributed to child nodes . then , For each child node , Call the algorithm recursively .

   We use the example of credit card fraud to illustrate the algorithm , We need to predict whether credit card users will repay their arrears on time , Or credit card fraud . For this question , The training data set can be constructed by examining the credit card usage records of previous borrowers . In the table below , Each record contains the personal information of the credit card , And whether the borrower defaults on the loan .

TidRefundMarital StatusTaxable IncomeCheat
1YesSingle125KNo
2NoMarried100KNo
3NoSingle70KNo
4YesMarried120KNo
5NoDivorced95KYes
6NoMarried60KNo
7YesDivorced220KNo
8NoSingle85KYes
9NoMarried75KNo
10NoSingle90KYes

   As shown in the table above , The initial decision tree of the classification problem has only one node , The class label is “ cheat = no ”( Here's the picture )

image-20220609161545592

   This means that most users will not commit credit card fraud . The tree needs further refinement , Because the root node also contains “ cheat = no ” and “ cheat = yes ” Records of two classes . according to “Refund” Testing conditions , These records are divided into smaller subsets , As shown in the figure below .

image-20220609161720491

   Next , Recursively call... For each child of the root node Hunt Algorithm . It can be seen from the training data set that ,Refund by “yes” Of users have paid off their arrears on time , therefore , The left child of the root node is the leaf node , Marked as “ cheat = no ”( Here's the picture )

image-20220609161918816

   For the right child , We need to continue calling recursively Hunt Algorithm , Until all records belong to the same class . The decision tree formed by each recursive call is shown in the following figure .

image-20220609162022532

image-20220609162028587

2.2 The problem of building a decision tree

2.2.1 How to specify test conditions for different types of attributes

Splitting based on nominal attributes

  • Demultiplexing : Stroke fraction ( Output number ) It depends on the number of different attribute values of the attribute

image-20220609162447538

  • Binary partition : The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values 2 k − 1 − 1 2^{k-1}-1 2k11 Methods

image-20220609162528574

image-20220609162532952

image-20220609162540385

Splitting based on ordinal attributes

  • Demultiplexing : Stroke fraction ( Output number ) It depends on the number of different attribute values of the attribute

    image-20220609162643476

  • Binary partition : The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values 2 k − 1 − 1 2^{k-1}-1 2k11 Methods

image-20220609162716709

Partition based on continuous attributes

  • Binary partition :(A<v)or (A≥v). Consider all the dividing points , Choose an optimal partition point v.

    image-20220609162818953

  • Demultiplexing :vi≤A<vi+1 (i=1,…,k)

image-20220609162824305

2.2.2 How to choose the best partition

   As the division process continues , We hope that the samples contained in the branch nodes of the decision tree belong to the same class as much as possible , That is, the purity of the node is getting higher and higher . The higher the purity , The more skewed the class distribution , The better the division results .

image-20220609163004711

   As shown in the figure above , When a class in a node C 0 C_0 C0 Of the total 90% when , The purity of this node is high . therefore , In order to determine the effect divided by a certain attribute , We need to compare the pre partition ( Father node ) And after division ( All son nodes ) The degree of reduction in impurity , The more you reduce , The better the division . So we need to quantify this impurity mathematically .

   Decision trees have three ways to partition data sets : Information gain method (ID3), Information gain rate (C4.5), gini index (CART)

2.3 Information gain algorithm

2.3.1 Principle of information gain algorithm

   As shown in the following table , We have a computer store , And some of its user information , The data in the graph can be considered as a data set , The age 、 income 、 hobby 、 Credit is the attribute of the data set , among “ Whether there is a tendency to buy computers ” Label the data class , Our goal is to establish a model to classify these users into two categories , That is, the tendency to buy computers and the tendency not to buy computers , At this point, the decision tree algorithm can be used to solve the problem .

id Age income hobby credit Buy
1 green high no in no
2 green high no optimal no
3 in high no in yes
4 The old in no in yes
5 The old low yes in yes
6 The old low yes optimal no
7 in low yes optimal yes
8 green in no in no
9 green low yes in yes
10 The old in yes in yes
11 green in yes optimal yes
12 in in no optimal yes
13 in high yes in yes
14 The old in no optimal no

  1948 year , Shannon Put forward “ Information entropy ” The concept of . The amount of information in a piece of information is directly related to its uncertainty , It takes a lot of information to figure out an uncertain thing . entropy (entropy) A measure of the uncertainty of a random variable , The greater the entropy , Indicates that the greater the uncertainty .

   Hypothetical variables S Yes S i ( i = 1 , 2 , . . . , n ) S_i(i=1,2,...,n) Si(i=1,2,...,n) In this case , p i p_i pi It means the first one i The probability of the situation , So the random variable S Entropy is defined as :
E n t r o p y ( S ) = − ∑ i = 1 c p i log ⁡ 2 p i Entropy(S)=-\sum_{i=1}^{c}p_i\log_{2}{p_i} Entropy(S)=i=1cpilog2pi
   The unit of entropy is bits (bit). Maximum entropy , Then the uncertainty of the variable is maximum . Back to the example of buying a computer , In the result of whether to buy a computer , Data sets D Yes 9 individual “ yes ”,5 individual “ no ”, So the data “ Buy a computer ” The entropy of is :

image-20220609164159142

   In random variables S Given the conditions , A random variable A The conditional entropy Entropy(A|S) Defined as :
E n t r o p y ( A ∣ S ) = − ∑ i = 1 c p i ⋅ E n t r o p y ( A ∣ S = s i ) Entropy(A|S)=-\sum_{i=1}^{c}p_i\cdot Entropy(A|S=s_i) Entropy(AS)=i=1cpiEntropy(AS=si)
   The information gain is : To learn that feature X And make the classification Y The degree to which the uncertainty of information is reduced . If the information gain of a feature is large , It means that this feature has a great impact on the results , features A The data set D The information gain of is expressed as :

image-20220609164528135

   among , S v S_v Sv by A The value is v when S Subset .

image-20220609165239572

   Take the data set of buying a computer as an example , Let's calculate the information gain of age . As you can see from the diagram , stay 14 Among the age characteristics of the data , Youth have 5 people , Middle age has 4 people , The elderly have 5 people . Calculate the information entropy of the three cases respectively , Then add the information entropy to get I n f o a g e ( D ) : Info_{age}(D): Infoage(D):
I n f o a g e ( D ) = 5 14 I ( 2 , 3 ) + 4 14 I ( 0 , 4 ) + 5 14 I ( 3 , 2 ) = 0.694 Info_{age}(D)=\frac{5}{14}I(2,3)+ \frac{4}{14}I(0,4)+\frac{5}{14}I(3,2)=0.694 Infoage(D)=145I(2,3)+144I(0,4)+145I(3,2)=0.694
   therefore ,Gain(age) The information gain of is :

image-20220609165450139

  ID3 The core of the algorithm is to apply the information gain criterion to feature selection on each node of the decision tree . The specific way is : Start at the root node , Calculate the information gain of all possible features for the node , The feature with the maximum information gain is selected as the feature of the node , The child nodes are constructed by the different values of the feature ; Recursively call the above methods to the child nodes to build the decision tree , Until the information gain of all features is very small or no features can be selected .

   According to the method of calculating information increment above , Information increment of other features can be obtained :

image-20220609165753061

image-20220609165739037

   The information gain of age is the largest , by 0.246bit, Select age as the first root node for classification ; And then in each subspecies , Each division is made according to the information gain of its characteristics , Recursively form the sample decision tree on each partition , As shown in the figure below .

image-20220609165919519

2.3.2 Measurement of purity of other nodes

   We know , stay ID3 Information gain is used to select features in the algorithm , Give priority to features with large information gain .ID3 Entropy model based on information theory , It involves a lot of logarithmic calculations . Can we simplify the model without losing the advantages of entropy model ? To solve this problem ,CART The classification tree algorithm uses The gini coefficient Instead of information gain ratio . In particular , In the classification problem , The expression of Gini coefficient is :
G I N I ( t ) = 1 − ∑ j [ p ( j ∣ t ) ] 2 GINI(t)=1-\sum_{j}^{}[p(j|t)]^2 GINI(t)=1j[p(jt)]2
   among ,p( j | t) Is at the node t in , class j Probability of occurrence . When the class distribution is balanced ,Gini Maximum value reached ( 1 − 1 / n c ) (1 - 1/n_c) (11/nc) . On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity .

image-20220609170516218

Pictured above , When node t in 0 Classes C 1 C_1 C1、6 Classes C 2 C_2 C2 when ,

image-20220609170557216

image-20220609170615349

When node t in 3 Classes C 1 C_1 C1、3 Classes C 2 C_2 C2 when ,

image-20220609170631491

besides , Classification error (Classification Error) value It is also one of the general methods to measure the purity of nodes . The expression of the classification error value is as follows :

image-20220609170758318

  • When the class distribution is balanced ,Gini Maximum value reached $ (1 - 1/n_c) $
  • On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity

image-20220609170917736

image-20220609170929828

2.3.3 Comparison between purity measures

Binary classification problem :

image-20220609171025191

2.4 C4.5 Algorithm

  C4.5 Algorithm is also a classical algorithm for generating decision tree , yes ID3 An extension and optimization of the algorithm .C4.5 Algorithm to ID3 The algorithm is improved as follows :

  • Select the split attribute through the information gain rate , Overcome ID3 The algorithm tends to choose attributes with multiple attribute values as split attributes through information gain
  • Can handle discrete and continuous attribute types , That is, continuous attributes can be discretized
  • Prune after constructing the decision tree
  • Capable of processing training data with missing attribute values

   For discrete features ,C4.5 The algorithm does not directly use information gain , Instead, use the gain rate (gain ratio) To select the optimal branching criteria , The gain rate is defined as follows :

image-20220609171454707

among ,

image-20220609171507526

It is called the branch standard T The intrinsic value of . As a branch standard, the more values can be taken for attributes ,SplitINFO The greater the value of .

   It should be noted that : The gain rate criterion may have a preference for attributes with a small number of values , therefore C4.5 The algorithm does not directly select the attribute with the largest gain rate as the branch criterion , Instead, we first find out the attributes whose information gain is higher than the average level from the candidate attributes , Then select the attribute with the highest gain rate from it .

   In the following data , The gain rate of user income is :
S p l i t I N F O = − 4 14 log ⁡ 2 4 14 − 6 14 log ⁡ 2 6 14 − 4 14 log ⁡ 2 4 14 = 1.557 SplitINFO=-\frac{4}{14}\log_{2}{\frac{4}{14}} -\frac{6}{14}\log_{2}{\frac{6}{14}} -\frac{4}{14}\log_{2}{\frac{4}{14}}=1.557 SplitINFO=144log2144146log2146144log2144=1.557
image-20220609171943424

2.5 CART Algorithm

   Classification and regression trees (Classification And Regression Tree,CART) Model from Breiman Et al. 1984 in , It is a widely used decision tree learning method .CART Also selected by features 、 The formation and pruning of trees , It can be used for both classification and regression .

  CART It's a random variable given the input X Output random variables under the condition of Y Learning method of conditional probability distribution of .CART Suppose the decision tree is a binary tree , The value of internal node characteristics is “ yes ” and “ no ”, The value of the left branch is “ yes ” The branch of , The right branch has a value of “ no ” The branch of . Such a decision tree is equivalent to recursively bisecting each feature , Will input space ( Feature space ) Divided into finite elements , And the probability distribution of the prediction is determined on these units , That is, the conditional probability distribution of the output under the given input conditions .CART The algorithm consists of the following steps :

  • Decision tree generation : Generating decision tree based on training data set , The decision tree should be as large as possible .
  • Decision tree pruning : It is used to verify that the data set prunes the generated tree and selects the optimal subtree , In this case, the minimum loss function is used as the pruning criterion .

   As shown in the following table and figure , Mower manufacturers want to find a way to divide urban households into those who are willing to buy ride mowers and those who are unwilling to buy them . A random sample of families in this city 24 A sample of non owner households . The independent variable here is after a meal such as (x1) And grassland area (x2), The categories are owned and non owned .

id income Grassland area Have
16018.4 yes
285.516.8 yes
364.821.6 yes
461.520.8 yes
58723.6 yes
6110.119.2 yes
710817.6 yes
178417.6 no
1849.217.6 no
1959.416 no
206618.4 no
2147.416.4 no
223318.8 no
235114 no
246314.8 no

image-20220609172813587

   As shown in the figure below , In the use of CART When the algorithm , use first x 2 x_2 x2=19 To classify . From the graph, we can directly find that the two rectangular parts are more homogeneous ( That is, points of the same category are more clustered together ).

image-20220609173033369

   By recursively bisecting each feature , The decision-making diagram after multiple divisions is shown in the figure below , The following decision tree is formed . We know by observation that , Every rectangle is homogeneous , Points that contain a category . The algorithm divides the node into two sub nodes each time .

image-20220609173246086

image-20220609173313127

  CART and ID3 equally , Small segmentation exists , That is, over fitting , To solve this problem , The extra long trees should be pruned .

   The purpose of pruning is to avoid over fitting of the decision tree model . Because the decision tree algorithm always classifies training samples as correctly as possible in the learning process , Will keep dividing the nodes , Cause too many branches of the whole tree , This leads to over fitting . There are two pruning strategies for decision trees : Pre pruning and post pruning .

  • pre-pruning : Pre pruning is the process of constructing a decision tree , First, estimate each node before partition , If the current node partition can not improve the generalization ability of the decision tree model , The current node is not divided and marked as a leaf node .
  • After pruning : Post pruning is to construct the whole decision tree first , Then investigate the non leaf nodes from bottom to top , If the corresponding subtree of the node is replaced by a leaf node, the generalization ability can be improved , Replace this node with a leaf node .

3、 Decision tree algorithm for Iris Data sets are classified

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pd
iris=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.20,random_state=30,shuffle=True)
#criterion Default is 'gini'
clf=tree.DecisionTreeClassifier(criterion='entropy')
clf=clf.fit(X_train,y_train)
plt.figure(dpi=150)
# feature_names=iris.feature_names Set the feature name displayed by the decision tree species 
tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names)

 Insert picture description here

 Insert picture description here

#  Forecast data [6,5,5,2] Categories 
print(' data [6,5,5,2] Categories :',clf.predict([[6,5,5,2]]))
print(' data [5.1,3.5,1.4,0.2] Categories :',clf.predict([[5.1,3.5,1.4,0.2]]))
print(' The label of the test set :\b',y_test)
y_pre=clf.predict(X_test)
print(' Predicted test set label :\b',y_pre)
print(' The accuracy of the model is :',clf.score(X_test,y_test))

 Insert picture description here

原网站

版权声明
本文为[Don't wait for brother shy to develop]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206110412011402.html