当前位置:网站首页>Decision tree (hunt, ID3, C4.5, cart)
Decision tree (hunt, ID3, C4.5, cart)
2022-06-11 04:33:00 【Don't wait for brother shy to develop】
Decision tree
1、 Decision tree
Decision tree is one of the ten classical data mining algorithms , It is a tree structure similar to flow chart , The rule is if…then… Thought , be used for Prediction of numerical dependent variables and Classification of discrete dependent variables . The decision tree algorithm is simple and intuitive , Easy to explain , And in practical application, it has the speed advantage that other algorithms can't match .
The decision tree method is classifying 、 Prediction and rule extraction are widely used . stay 20 century 70 The late S and 80 In the early s , Machine learning researchers J.Ross Quinlan The decision tree algorithm is developed , Called iterative bisector (Iterative Dichotomiser, ID3), The decision tree has been greatly developed in the field of machine learning .Quinlan Later, it was proposed that ID3 In the subsequent C4.5 Algorithm , Become the performance benchmark of the new supervised learning algorithm .1984 In, several statisticians put forward CART Classification algorithm .
Classification decision tree model is a kind of tree structure that describes the classification of instances . The decision tree consists of nodes (node) And the directed side (directed edge) form . There are two types of nodes : Internal and leaf nodes . Internal nodes represent a feature or attribute , Leaf node represents a class .
Decision tree classification process :** Start at the root node , Test a feature of the instance , According to the test results , Assign an instance to its child nodes ; At this time , A value corresponding to the characteristic of each child node . Test and assign instances recursively , Until the leaf node is reached .** Finally, the instance is divided into leaf node classes . As shown in the following figure is a schematic diagram of a decision tree , The circle and triangle in the figure represent the internal nodes and leaf nodes respectively .

Characteristics of decision tree :
- Tree structure , It can classify the data well ;
- Each path from the root node to the leaf node of the decision tree constructs a rule ;
- It has the characteristics of mutual exclusion and completeness , That is, each sample is covered by and can only be covered by one path ;
- As long as the amount of data provided is large enough, it is true , Through data mining patterns , You can construct a decision tree .
2、 Decision tree algorithm
2.1 Hunt Algorithm
Hunt The algorithm is a decision tree construction algorithm using local optimal strategy , It is also the basis of many decision tree algorithms , Include ID3、C4.5 and CART etc. .
stay Hunt In the algorithm, , By successively dividing the training records into purer subsets , Build a decision tree recursively .
set up D t D_t Dt Is associated with the node t Associated training record set , Algorithm steps :
- If D t D_t Dt All records in belong to the same class y t y_t yt, be t It's a leaf node , use y t y_t yt Mark
- If D t D_t Dt Contains records belonging to multiple classes , Then select an attribute test condition , Divide records into smaller subsets . For each output of the test condition , Create a child node , And according to the test results D t D_t Dt The records in are distributed to child nodes . then , For each child node , Call the algorithm recursively .
We use the example of credit card fraud to illustrate the algorithm , We need to predict whether credit card users will repay their arrears on time , Or credit card fraud . For this question , The training data set can be constructed by examining the credit card usage records of previous borrowers . In the table below , Each record contains the personal information of the credit card , And whether the borrower defaults on the loan .
| Tid | Refund | Marital Status | Taxable Income | Cheat |
|---|---|---|---|---|
| 1 | Yes | Single | 125K | No |
| 2 | No | Married | 100K | No |
| 3 | No | Single | 70K | No |
| 4 | Yes | Married | 120K | No |
| 5 | No | Divorced | 95K | Yes |
| 6 | No | Married | 60K | No |
| 7 | Yes | Divorced | 220K | No |
| 8 | No | Single | 85K | Yes |
| 9 | No | Married | 75K | No |
| 10 | No | Single | 90K | Yes |
As shown in the table above , The initial decision tree of the classification problem has only one node , The class label is “ cheat = no ”( Here's the picture )

This means that most users will not commit credit card fraud . The tree needs further refinement , Because the root node also contains “ cheat = no ” and “ cheat = yes ” Records of two classes . according to “Refund” Testing conditions , These records are divided into smaller subsets , As shown in the figure below .

Next , Recursively call... For each child of the root node Hunt Algorithm . It can be seen from the training data set that ,Refund by “yes” Of users have paid off their arrears on time , therefore , The left child of the root node is the leaf node , Marked as “ cheat = no ”( Here's the picture )

For the right child , We need to continue calling recursively Hunt Algorithm , Until all records belong to the same class . The decision tree formed by each recursive call is shown in the following figure .


2.2 The problem of building a decision tree
2.2.1 How to specify test conditions for different types of attributes
Splitting based on nominal attributes
- Demultiplexing : Stroke fraction ( Output number ) It depends on the number of different attribute values of the attribute

- Binary partition : The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values 2 k − 1 − 1 2^{k-1}-1 2k−1−1 Methods



Splitting based on ordinal attributes
Demultiplexing : Stroke fraction ( Output number ) It depends on the number of different attribute values of the attribute

Binary partition : The number of divisions is 2, This division should consider creating k All of the binary partitions of attribute values 2 k − 1 − 1 2^{k-1}-1 2k−1−1 Methods

Partition based on continuous attributes
Binary partition :(A<v)or (A≥v). Consider all the dividing points , Choose an optimal partition point v.

Demultiplexing :vi≤A<vi+1 (i=1,…,k)

2.2.2 How to choose the best partition
As the division process continues , We hope that the samples contained in the branch nodes of the decision tree belong to the same class as much as possible , That is, the purity of the node is getting higher and higher . The higher the purity , The more skewed the class distribution , The better the division results .

As shown in the figure above , When a class in a node C 0 C_0 C0 Of the total 90% when , The purity of this node is high . therefore , In order to determine the effect divided by a certain attribute , We need to compare the pre partition ( Father node ) And after division ( All son nodes ) The degree of reduction in impurity , The more you reduce , The better the division . So we need to quantify this impurity mathematically .
Decision trees have three ways to partition data sets : Information gain method (ID3), Information gain rate (C4.5), gini index (CART)
2.3 Information gain algorithm
2.3.1 Principle of information gain algorithm
As shown in the following table , We have a computer store , And some of its user information , The data in the graph can be considered as a data set , The age 、 income 、 hobby 、 Credit is the attribute of the data set , among “ Whether there is a tendency to buy computers ” Label the data class , Our goal is to establish a model to classify these users into two categories , That is, the tendency to buy computers and the tendency not to buy computers , At this point, the decision tree algorithm can be used to solve the problem .
| id | Age | income | hobby | credit | Buy |
|---|---|---|---|---|---|
| 1 | green | high | no | in | no |
| 2 | green | high | no | optimal | no |
| 3 | in | high | no | in | yes |
| 4 | The old | in | no | in | yes |
| 5 | The old | low | yes | in | yes |
| 6 | The old | low | yes | optimal | no |
| 7 | in | low | yes | optimal | yes |
| 8 | green | in | no | in | no |
| 9 | green | low | yes | in | yes |
| 10 | The old | in | yes | in | yes |
| 11 | green | in | yes | optimal | yes |
| 12 | in | in | no | optimal | yes |
| 13 | in | high | yes | in | yes |
| 14 | The old | in | no | optimal | no |
1948 year , Shannon Put forward “ Information entropy ” The concept of . The amount of information in a piece of information is directly related to its uncertainty , It takes a lot of information to figure out an uncertain thing . entropy (entropy) A measure of the uncertainty of a random variable , The greater the entropy , Indicates that the greater the uncertainty .
Hypothetical variables S Yes S i ( i = 1 , 2 , . . . , n ) S_i(i=1,2,...,n) Si(i=1,2,...,n) In this case , p i p_i pi It means the first one i The probability of the situation , So the random variable S Entropy is defined as :
E n t r o p y ( S ) = − ∑ i = 1 c p i log 2 p i Entropy(S)=-\sum_{i=1}^{c}p_i\log_{2}{p_i} Entropy(S)=−i=1∑cpilog2pi
The unit of entropy is bits (bit). Maximum entropy , Then the uncertainty of the variable is maximum . Back to the example of buying a computer , In the result of whether to buy a computer , Data sets D Yes 9 individual “ yes ”,5 individual “ no ”, So the data “ Buy a computer ” The entropy of is :

In random variables S Given the conditions , A random variable A The conditional entropy Entropy(A|S) Defined as :
E n t r o p y ( A ∣ S ) = − ∑ i = 1 c p i ⋅ E n t r o p y ( A ∣ S = s i ) Entropy(A|S)=-\sum_{i=1}^{c}p_i\cdot Entropy(A|S=s_i) Entropy(A∣S)=−i=1∑cpi⋅Entropy(A∣S=si)
The information gain is : To learn that feature X And make the classification Y The degree to which the uncertainty of information is reduced . If the information gain of a feature is large , It means that this feature has a great impact on the results , features A The data set D The information gain of is expressed as :

among , S v S_v Sv by A The value is v when S Subset .

Take the data set of buying a computer as an example , Let's calculate the information gain of age . As you can see from the diagram , stay 14 Among the age characteristics of the data , Youth have 5 people , Middle age has 4 people , The elderly have 5 people . Calculate the information entropy of the three cases respectively , Then add the information entropy to get I n f o a g e ( D ) : Info_{age}(D): Infoage(D):
I n f o a g e ( D ) = 5 14 I ( 2 , 3 ) + 4 14 I ( 0 , 4 ) + 5 14 I ( 3 , 2 ) = 0.694 Info_{age}(D)=\frac{5}{14}I(2,3)+ \frac{4}{14}I(0,4)+\frac{5}{14}I(3,2)=0.694 Infoage(D)=145I(2,3)+144I(0,4)+145I(3,2)=0.694
therefore ,Gain(age) The information gain of is :

ID3 The core of the algorithm is to apply the information gain criterion to feature selection on each node of the decision tree . The specific way is : Start at the root node , Calculate the information gain of all possible features for the node , The feature with the maximum information gain is selected as the feature of the node , The child nodes are constructed by the different values of the feature ; Recursively call the above methods to the child nodes to build the decision tree , Until the information gain of all features is very small or no features can be selected .
According to the method of calculating information increment above , Information increment of other features can be obtained :


The information gain of age is the largest , by 0.246bit, Select age as the first root node for classification ; And then in each subspecies , Each division is made according to the information gain of its characteristics , Recursively form the sample decision tree on each partition , As shown in the figure below .

2.3.2 Measurement of purity of other nodes
We know , stay ID3 Information gain is used to select features in the algorithm , Give priority to features with large information gain .ID3 Entropy model based on information theory , It involves a lot of logarithmic calculations . Can we simplify the model without losing the advantages of entropy model ? To solve this problem ,CART The classification tree algorithm uses The gini coefficient Instead of information gain ratio . In particular , In the classification problem , The expression of Gini coefficient is :
G I N I ( t ) = 1 − ∑ j [ p ( j ∣ t ) ] 2 GINI(t)=1-\sum_{j}^{}[p(j|t)]^2 GINI(t)=1−j∑[p(j∣t)]2
among ,p( j | t) Is at the node t in , class j Probability of occurrence . When the class distribution is balanced ,Gini Maximum value reached ( 1 − 1 / n c ) (1 - 1/n_c) (1−1/nc) . On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity .

Pictured above , When node t in 0 Classes C 1 C_1 C1、6 Classes C 2 C_2 C2 when ,


When node t in 3 Classes C 1 C_1 C1、3 Classes C 2 C_2 C2 when ,

besides , Classification error (Classification Error) value It is also one of the general methods to measure the purity of nodes . The expression of the classification error value is as follows :

- When the class distribution is balanced ,Gini Maximum value reached $ (1 - 1/n_c) $
- On the contrary, when there is only one class ,Gini The value reaches the minimum value 0, The greater the purity


2.3.3 Comparison between purity measures
Binary classification problem :

2.4 C4.5 Algorithm
C4.5 Algorithm is also a classical algorithm for generating decision tree , yes ID3 An extension and optimization of the algorithm .C4.5 Algorithm to ID3 The algorithm is improved as follows :
- Select the split attribute through the information gain rate , Overcome ID3 The algorithm tends to choose attributes with multiple attribute values as split attributes through information gain
- Can handle discrete and continuous attribute types , That is, continuous attributes can be discretized
- Prune after constructing the decision tree
- Capable of processing training data with missing attribute values
For discrete features ,C4.5 The algorithm does not directly use information gain , Instead, use the gain rate (gain ratio) To select the optimal branching criteria , The gain rate is defined as follows :

among ,

It is called the branch standard T The intrinsic value of . As a branch standard, the more values can be taken for attributes ,SplitINFO The greater the value of .
It should be noted that : The gain rate criterion may have a preference for attributes with a small number of values , therefore C4.5 The algorithm does not directly select the attribute with the largest gain rate as the branch criterion , Instead, we first find out the attributes whose information gain is higher than the average level from the candidate attributes , Then select the attribute with the highest gain rate from it .
In the following data , The gain rate of user income is :
S p l i t I N F O = − 4 14 log 2 4 14 − 6 14 log 2 6 14 − 4 14 log 2 4 14 = 1.557 SplitINFO=-\frac{4}{14}\log_{2}{\frac{4}{14}} -\frac{6}{14}\log_{2}{\frac{6}{14}} -\frac{4}{14}\log_{2}{\frac{4}{14}}=1.557 SplitINFO=−144log2144−146log2146−144log2144=1.557
2.5 CART Algorithm
Classification and regression trees (Classification And Regression Tree,CART) Model from Breiman Et al. 1984 in , It is a widely used decision tree learning method .CART Also selected by features 、 The formation and pruning of trees , It can be used for both classification and regression .
CART It's a random variable given the input X Output random variables under the condition of Y Learning method of conditional probability distribution of .CART Suppose the decision tree is a binary tree , The value of internal node characteristics is “ yes ” and “ no ”, The value of the left branch is “ yes ” The branch of , The right branch has a value of “ no ” The branch of . Such a decision tree is equivalent to recursively bisecting each feature , Will input space ( Feature space ) Divided into finite elements , And the probability distribution of the prediction is determined on these units , That is, the conditional probability distribution of the output under the given input conditions .CART The algorithm consists of the following steps :
- Decision tree generation : Generating decision tree based on training data set , The decision tree should be as large as possible .
- Decision tree pruning : It is used to verify that the data set prunes the generated tree and selects the optimal subtree , In this case, the minimum loss function is used as the pruning criterion .
As shown in the following table and figure , Mower manufacturers want to find a way to divide urban households into those who are willing to buy ride mowers and those who are unwilling to buy them . A random sample of families in this city 24 A sample of non owner households . The independent variable here is after a meal such as (x1) And grassland area (x2), The categories are owned and non owned .
| id | income | Grassland area | Have |
|---|---|---|---|
| 1 | 60 | 18.4 | yes |
| 2 | 85.5 | 16.8 | yes |
| 3 | 64.8 | 21.6 | yes |
| 4 | 61.5 | 20.8 | yes |
| 5 | 87 | 23.6 | yes |
| 6 | 110.1 | 19.2 | yes |
| 7 | 108 | 17.6 | yes |
| 17 | 84 | 17.6 | no |
| 18 | 49.2 | 17.6 | no |
| 19 | 59.4 | 16 | no |
| 20 | 66 | 18.4 | no |
| 21 | 47.4 | 16.4 | no |
| 22 | 33 | 18.8 | no |
| 23 | 51 | 14 | no |
| 24 | 63 | 14.8 | no |

As shown in the figure below , In the use of CART When the algorithm , use first x 2 x_2 x2=19 To classify . From the graph, we can directly find that the two rectangular parts are more homogeneous ( That is, points of the same category are more clustered together ).

By recursively bisecting each feature , The decision-making diagram after multiple divisions is shown in the figure below , The following decision tree is formed . We know by observation that , Every rectangle is homogeneous , Points that contain a category . The algorithm divides the node into two sub nodes each time .


CART and ID3 equally , Small segmentation exists , That is, over fitting , To solve this problem , The extra long trees should be pruned .
The purpose of pruning is to avoid over fitting of the decision tree model . Because the decision tree algorithm always classifies training samples as correctly as possible in the learning process , Will keep dividing the nodes , Cause too many branches of the whole tree , This leads to over fitting . There are two pruning strategies for decision trees : Pre pruning and post pruning .
- pre-pruning : Pre pruning is the process of constructing a decision tree , First, estimate each node before partition , If the current node partition can not improve the generalization ability of the decision tree model , The current node is not divided and marked as a leaf node .
- After pruning : Post pruning is to construct the whole decision tree first , Then investigate the non leaf nodes from bottom to top , If the corresponding subtree of the node is replaced by a leaf node, the generalization ability can be improved , Replace this node with a leaf node .
3、 Decision tree algorithm for Iris Data sets are classified
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.model_selection import train_test_split
import pandas as pd
iris=load_iris()
X_train,X_test,y_train,y_test=train_test_split(iris.data,iris.target,test_size=0.20,random_state=30,shuffle=True)
#criterion Default is 'gini'
clf=tree.DecisionTreeClassifier(criterion='entropy')
clf=clf.fit(X_train,y_train)
plt.figure(dpi=150)
# feature_names=iris.feature_names Set the feature name displayed by the decision tree species
tree.plot_tree(clf,feature_names=iris.feature_names,class_names=iris.target_names)


# Forecast data [6,5,5,2] Categories
print(' data [6,5,5,2] Categories :',clf.predict([[6,5,5,2]]))
print(' data [5.1,3.5,1.4,0.2] Categories :',clf.predict([[5.1,3.5,1.4,0.2]]))
print(' The label of the test set :\b',y_test)
y_pre=clf.predict(X_test)
print(' Predicted test set label :\b',y_pre)
print(' The accuracy of the model is :',clf.score(X_test,y_test))

边栏推荐
- 国际琪貨:做正大主帐户风险有那些
- Unity 伤害值的显示
- 使用工具类按一定规则读取Excel文件
- 如何快速寻找STM32系列单片机官方例程
- Game Mathematics: calculate the points on the plane in the screen points (God's perspective)
- Unity 遮挡剔除
- Vulkan official example interpretation raytracingshadows & use the model here (1)
- MindManager22专业版思维导图工具
- 精益产品开发体系最佳实践及原则
- JVM(3):类加载器分类、双亲委派机制
猜你喜欢

JVM(7):动态链接、方法的调用、四种方法调用指令区分非虚方法和虚方法、invokedynamic指令的使用

Exness: liquidity series - order block, imbalance (II)

Guanghetong LTE CAT6 module fm101-cg, which supports CBRS band, took the lead in obtaining FCC certification

Guanghetong 5g module fg650-cn and fm650-cn series are produced in full scale, accelerating the efficient implementation of 5g UWB applications

Unity 地图映射

Exness: Liquidity Series - order Block, Unbalanced (II)

众昂矿业:氟化工是萤石的主要消耗领域

精益产品开发体系最佳实践及原则

Mindmanager22 professional mind mapping tool

Guanghetong successfully won the bidding for China Unicom Yanfei CAT1 customized module with the largest share
随机推荐
Check the digital tube with a multimeter
Record an ES accident
详解 | 晶振的构造及工作原理
JVM(2):内存结构、类的加载过程
传说使用Shader的ID来设置Shader属性能提高效率:)
梅州植物组培实验室建设资料整理
Commissioning experience and reliability design of brushless motor
用万用表检测数码管
Unity 编辑器扩展 保存位置
MySQL lock summary
Analysis of hidden dangers in the construction of Fuzhou chemical laboratory
梅州二恶英实验室建设注意事项分享
Hiredis determines the master node
正大国际琪貨:交易市场
MySQL锁总结
PHP regular use case
JVM(3):类加载器分类、双亲委派机制
Guanghetong officially released the annual theme of 2022 5g Huanxin: Five Forces co drive · Jizhi future
Unity map mapping
Implementation of unity transport mechanism