当前位置：网站首页>Summary of machine learning + pattern recognition learning (IV) -- decision tree

Summary of machine learning + pattern recognition learning (IV) -- decision tree

2022-06-12 07:28:00 【Nidiya is trying】

One 、 Definition and introduction

1、 Classification decision tree model is a kind of description A tree structure that classifies instances , from node + There is a directional side form . Nodes are divided into internal nodes ( Represents a feature / attribute ) And leaf nodes ( Represents a class ).【 The overall structure of the tree includes ： Root node 、 Non leaf node 、 leaf node 、 Branch 】

2、if-then The rules ： Each path from the root node to the leaf node constructs a rule , The characteristics of the internal nodes on the path correspond to the conditions of the rules , The classes of leaf nodes correspond to the conclusions of rules .if-then Important properties of rule set ： Mutually exclusive and complete . The consciousness of this rule starts from the root node , Finally, there is only one path to a specific leaf node , And there will be no instances that cannot be classified .

3、 Decision tree learning objectives ： Build a decision tree model based on the training data set , Make the decision tree model classify correctly , And has good generalization ability .

4、 The essence of decision tree learning ： Generalize the classification rules from the training data set , Find the classification rules between features and labels .

5、 Loss function of decision tree ： It is usually a regularized maximum likelihood function . Generally, heuristic methods are used , That is, local optimum , Select the optimal feature under the current condition as the division rule .

6、 Decision tree at different stages ：

（1） Training phase ： With a given data set structure Decision tree .

（2） Testing phase ： Start at the root node , According to the classification attributes of the decision tree, it is divided down layer by layer , Until the leaf node , Finally, the classification results are obtained .

7、 The concept is introduced ：

（1） entropy ： It's a random variable A measure of uncertainty . The greater the entropy , The greater the uncertainty . Entropy of random variables ：

$H(X)=H(p)=-\sum_{i=1}^{n}p_{i}\log p_{i}$ ,

It can be seen from the formula that , A random variable The entropy of is only related to its probability , That is, it depends only on its distribution , It has nothing to do with its value .

（2） Conditional entropy ： Given random variables Under the condition of , A random variable A measure of uncertainty . The formula is ：

$H(Y\mid X)=\sum_{i=1}^{n}p_{i} H(Y\mid X=x_{i})$ .

When the probabilities in entropy and conditional entropy are estimated by data , Corresponding to the empirical entropy / Conditional entropy .

（3） Information gain ： To learn that feature And make the class Information about The degree of uncertainty reduction . The formula is ：

$g(D,A)=H(D)-H(D\mid A)$ ,

Understood as a ： Due to features The training data set The degree to which the uncertainty of classification is reduced . Information gain is the degree of entropy reduction .

Two 、 Build decision tree （ The construction of decision tree is the process of recursively selecting the optimal feature , It is also the process of dividing the feature space .）

The learning algorithm of decision tree includes ： feature selection 、 Decision tree generation 、 Decision tree pruning . These three parts are described below ：

（ One ） Feature selection and decision tree generation ： Decide which feature to use to divide the feature space , A decision tree is generated based on a feature selection algorithm . Different decision tree generation algorithms have different feature selection rules , The common criterion is information gain / Information gain ratio .

1、ID3 Algorithm ： With Information gain Criteria selection features . The principle is to select a feature , Make the entropy of the remaining data sets decrease to the greatest extent / Entropy decreases the most / The purer the remaining data sets , The better .

Step A: Calculation D The experience of the entropy , $H(D)=-\sum_{k=1}^{K}\frac{\left | C_{k} \right |}{\left |D \right |}\log_{2}\frac{\left | C_{k} \right |}{\left |D \right |}$

Step B: Calculating characteristics A Yes D Entropy of empirical conditions , $H(D\mid A) = \sum_{i=1}^{n}\frac{\left | D_{i} \right |}{\left |D \right |}H(D_{i})=-\sum_{i=1}^{n}\frac{\left | D_{i} \right |}{\left |D \right |}\sum_{k=1}^{K}\frac{\left | D_{ik} \right |}{\left |D_{i} \right |}\log_{2}\frac{\left | D_{ik} \right |}{\left |D_{i} \right |}$

Step C: Information gain is calculated by subtracting empirical conditional entropy from empirical entropy , $g(D,A)=H(D)-H(D\mid A)$

2、C4.5 Algorithm ： With Information gain ratio Criteria selection features . Because if the information gain is used as the feature of dividing the training set , There is a problem of choosing more features （ Because according to the information gain principle , This feature with more values adds to the “ The level of confusion ”, Selecting this feature can reduce the confusion of the data set , Maximize information gain , The purer the remaining data sets to be classified .

Information gain ratio ： $g_{R}(D,A)=\frac{g(D,A)}{H_{A}(D)}$ , among $H_{A}(D)=-\sum_{i=1}^{n}\frac{\left | D_{i} \right |}{\left | D \right |}\log_{2}\frac{\left | D_{i} \right |}{\left | D \right |}$ , For the characteristic The number of values .

3、CART Decision tree ： With The gini coefficient Criteria selection features , When used in regression, the least square error criterion . Suppose the decision tree is a binary tree , The value of internal node characteristics is “ yes ” and “ no ”, Then Gini coefficient ：

$Gini(D,A)=\frac{\left |D_{1} \right |}{\left |D \right |}Gini(D_{1})+\frac{\left |D_{2} \right |}{\left |D \right |}Gini(D_{2})$ ,

The Gini index ,

$Gini(p)=\sum_{k=1}^{K}p_{k}(1-p_{k})=1-\sum_{k=1}^{K}(p_{k})^{2}$

Intuitive to see ,Gini(p) Reflected from the dataset D Two samples randomly selected from , The probability of the inconsistency of its class marks . therefore ,Gini(p) The smaller it is , Then the dataset D The higher the purity .

（ Two ） Decision tree pruning

1、 prune ： If the decision tree considers too much how to improve the correct classification of training data , Too complex decisions will be constructed, resulting in over fitting . At this time, the decision tree should be pruned ： Prune some subtrees or nodes from the generated decision tree , And its root node or parent node is regarded as a new leaf node . The purpose of pruning is not to minimize the loss function , The purpose of pruning is to achieve a better generalization ability .

2、 Loss function ： Pruning is often achieved by minimizing the overall loss function of the decision tree . The loss function of the decision tree is ：

$C_{\alpha }(T)=-\sum_{t=1}^{T}N_{t}\, H_{t}(T)+\alpha \left |T\right |$ ,

among $N_{t}$ Represents the leaf node Number of samples included , $\left |T\right |$ Indicates the number of leaf nodes ,

And the empirical entropy ：

$H_{t}(T)=\sum_{k}^{}\frac{N_{tk}}{N_{t}}\log\frac{N_{tk}}{N_{t}}$ ,

remember $C(T)=\sum_{t=1}^{\left | T \right |}N_{t}\, H_{t}(T)=-\sum_{t=1}^{\left | T \right |}\sum_{k=1}^{K}N_{tk}\log\frac{N_{tk}}{N_{t}}$ ,

Then there are ：

$C_{\alpha }(T)=C(T)+\alpha \left | T \right |$ ,

Understanding of formulas , We know empirical entropy $H_{t}(T)$ Indicates the chaos degree of the leaf node , Considering all the leaf nodes of the whole tree , And the number of samples in each leaf node is different , May adopt C(T) Measure the effect of the model on the training set whole Classification error . But if only C(T) As the optimization objective function , The model will divide each branch into the smallest possible parts , So that the empirical entropy of each leaf node is 0, So that C(T) by 0, This eventually leads to over fitting of the model . reflection ： If the model is divided into the smallest , Then the model will be very complicated , It must contain many branches and many subtrees , If we can add a penalty term related to the complexity of the model to the loss function , Take into account the complexity of the model , Can this over fitting problem be solved ？ And for a tree , The more leaf nodes, the more complex , Therefore, the complexity of a tree can be measured by the number of leaf nodes , Re pass $\alpha$ “ Penalty factor ” To determine the impact between complexity and overall error , Got it. $C_{\alpha }(T)$ . You know , The larger $\alpha$ The leaf node tree of the tree will be modified to a large extent “ punishment ”, Thus a simpler model , contrary , smaller $\alpha$ You will get a more complex model .

3、 Two pruning strategies

（1） pre-pruning ：

In the process of constructing the decision tree , First, estimate each node before partition , If the current node partition can not improve the pan China performance of the decision tree model , The current node is not divided and marked as a leaf node .

（2） After pruning ：

First, a complete decision tree is generated from the training set , Then the non leaf nodes are examined from bottom to top , If the subtree corresponding to this node is replaced by the leaf node, the generalization performance will be improved , Then replace the subtree with a leaf node .

The contrast between the two strategies ：

① Post pruning decision trees usually retain more branches than pre pruning decision trees ;
② The risk of under fitting of post pruning decision tree is very small , Generalization performance is often better than pre pruning decision tree ;
③ The training time cost of post pruning decision tree is much larger than that of non pruning decision tree and pre pruning decision tree .

Reference resources ：

https://blog.csdn.net/am290333566/article/details/81187562

https://blog.csdn.net/u012328159/article/details/79285214

原网站

版权声明
本文为[Nidiya is trying]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010557067776.html