当前位置:网站首页>Summary of machine learning + pattern recognition learning (IV) -- decision tree
Summary of machine learning + pattern recognition learning (IV) -- decision tree
2022-06-12 07:28:00 【Nidiya is trying】
One 、 Definition and introduction
1、 Classification decision tree model is a kind of description A tree structure that classifies instances , from node + There is a directional side form . Nodes are divided into internal nodes ( Represents a feature / attribute ) And leaf nodes ( Represents a class ).【 The overall structure of the tree includes : Root node 、 Non leaf node 、 leaf node 、 Branch 】
2、if-then The rules : Each path from the root node to the leaf node constructs a rule , The characteristics of the internal nodes on the path correspond to the conditions of the rules , The classes of leaf nodes correspond to the conclusions of rules .if-then Important properties of rule set : Mutually exclusive and complete . The consciousness of this rule starts from the root node , Finally, there is only one path to a specific leaf node , And there will be no instances that cannot be classified .
3、 Decision tree learning objectives : Build a decision tree model based on the training data set , Make the decision tree model classify correctly , And has good generalization ability .
4、 The essence of decision tree learning : Generalize the classification rules from the training data set , Find the classification rules between features and labels .
5、 Loss function of decision tree : It is usually a regularized maximum likelihood function . Generally, heuristic methods are used , That is, local optimum , Select the optimal feature under the current condition as the division rule .
6、 Decision tree at different stages :
(1) Training phase : With a given data set structure Decision tree .
(2) Testing phase : Start at the root node , According to the classification attributes of the decision tree, it is divided down layer by layer , Until the leaf node , Finally, the classification results are obtained .
7、 The concept is introduced :
(1) entropy : It's a random variable A measure of uncertainty . The greater the entropy , The greater the uncertainty . Entropy of random variables :
,
It can be seen from the formula that , A random variable
The entropy of is only related to its probability , That is, it depends only on its distribution , It has nothing to do with its value .
(2) Conditional entropy : Given random variables
Under the condition of , A random variable
A measure of uncertainty . The formula is :
.
When the probabilities in entropy and conditional entropy are estimated by data , Corresponding to the empirical entropy / Conditional entropy .
(3) Information gain : To learn that feature
And make the class
Information about The degree of uncertainty reduction . The formula is :
,
Understood as a : Due to features
The training data set
The degree to which the uncertainty of classification is reduced . Information gain is the degree of entropy reduction .
Two 、 Build decision tree ( The construction of decision tree is the process of recursively selecting the optimal feature , It is also the process of dividing the feature space .)
The learning algorithm of decision tree includes : feature selection 、 Decision tree generation 、 Decision tree pruning . These three parts are described below :
( One ) Feature selection and decision tree generation : Decide which feature to use to divide the feature space , A decision tree is generated based on a feature selection algorithm . Different decision tree generation algorithms have different feature selection rules , The common criterion is information gain / Information gain ratio .
1、ID3 Algorithm : With Information gain Criteria selection features . The principle is to select a feature , Make the entropy of the remaining data sets decrease to the greatest extent / Entropy decreases the most / The purer the remaining data sets , The better .
Step A: Calculation D The experience of the entropy ,
Step B: Calculating characteristics A Yes D Entropy of empirical conditions ,
Step C: Information gain is calculated by subtracting empirical conditional entropy from empirical entropy ,
2、C4.5 Algorithm : With Information gain ratio Criteria selection features . Because if the information gain is used as the feature of dividing the training set , There is a problem of choosing more features ( Because according to the information gain principle , This feature with more values adds to the “ The level of confusion ”, Selecting this feature can reduce the confusion of the data set , Maximize information gain , The purer the remaining data sets to be classified .
Information gain ratio :
, among
,
For the characteristic
The number of values .
3、CART Decision tree : With The gini coefficient Criteria selection features , When used in regression, the least square error criterion . Suppose the decision tree is a binary tree , The value of internal node characteristics is “ yes ” and “ no ”, Then Gini coefficient :
,
The Gini index ,

Intuitive to see ,Gini(p) Reflected from the dataset D Two samples randomly selected from , The probability of the inconsistency of its class marks . therefore ,Gini(p) The smaller it is , Then the dataset D The higher the purity .
( Two ) Decision tree pruning
1、 prune : If the decision tree considers too much how to improve the correct classification of training data , Too complex decisions will be constructed, resulting in over fitting . At this time, the decision tree should be pruned : Prune some subtrees or nodes from the generated decision tree , And its root node or parent node is regarded as a new leaf node . The purpose of pruning is not to minimize the loss function , The purpose of pruning is to achieve a better generalization ability .
2、 Loss function : Pruning is often achieved by minimizing the overall loss function of the decision tree . The loss function of the decision tree is :
,
among
Represents the leaf node
Number of samples included ,
Indicates the number of leaf nodes ,
And the empirical entropy :
,
remember
,
Then there are :
,
Understanding of formulas , We know empirical entropy
Indicates the chaos degree of the leaf node , Considering all the leaf nodes of the whole tree , And the number of samples in each leaf node is different , May adopt
Measure the effect of the model on the training set whole Classification error . But if only
As the optimization objective function , The model will divide each branch into the smallest possible parts , So that the empirical entropy of each leaf node is 0, So that
by 0, This eventually leads to over fitting of the model . reflection : If the model is divided into the smallest , Then the model will be very complicated , It must contain many branches and many subtrees , If we can add a penalty term related to the complexity of the model to the loss function , Take into account the complexity of the model , Can this over fitting problem be solved ? And for a tree , The more leaf nodes, the more complex , Therefore, the complexity of a tree can be measured by the number of leaf nodes , Re pass
“ Penalty factor ” To determine the impact between complexity and overall error , Got it.
. You know , The larger
The leaf node tree of the tree will be modified to a large extent “ punishment ”, Thus a simpler model , contrary , smaller
You will get a more complex model .
3、 Two pruning strategies
(1) pre-pruning :
In the process of constructing the decision tree , First, estimate each node before partition , If the current node partition can not improve the pan China performance of the decision tree model , The current node is not divided and marked as a leaf node .
(2) After pruning :
First, a complete decision tree is generated from the training set , Then the non leaf nodes are examined from bottom to top , If the subtree corresponding to this node is replaced by the leaf node, the generalization performance will be improved , Then replace the subtree with a leaf node .
The contrast between the two strategies :
① Post pruning decision trees usually retain more branches than pre pruning decision trees ;
② The risk of under fitting of post pruning decision tree is very small , Generalization performance is often better than pre pruning decision tree ;
③ The training time cost of post pruning decision tree is much larger than that of non pruning decision tree and pre pruning decision tree .
Reference resources :
边栏推荐
- Golang quickly generates model and queryset of database tables
- [Li Kou] curriculum series
- 1.3-1.9 summary
- Freshmen are worried about whether to get a low salary of more than 10000 yuan from Huawei or a high salary of more than 20000 yuan from the Internet
- RT thread studio learning (x) mpu9250
- Golang 快速生成数据库表的 model 和 queryset
- 我人生中的第一个需求——Excel数据批量上传到数据库
- There is no solid line connection between many devices in Proteus circuit simulation design diagram. How are they realized?
- Detailed explanation of 14 registers in 8086CPU
- RT thread studio learning summary
猜你喜欢

Modelarts training task 1

基于eNSP加防火墙的千人中型校园/企业网络规划与设计(附所有配置命令)

Principle and application of PWM

Formatting the generalization forgetting trade off in continuous learning

2022电工(初级)考试题库及模拟考试

RT thread studio learning (VIII) connecting Alibaba cloud IOT with esp8266

RT thread studio learning (VII) using multiple serial ports

Federated meta learning with fast convergence and effective communication

Demonstrate "topic communication, action communication, service communication and parameter server" with a small turtle case

Thoroughly understand the "rotation matrix / Euler angle / quaternion" and let you experience the beauty of three-dimensional rotation
随机推荐
2022R2移动式压力容器充装试题模拟考试平台操作
基于eNSP加防火墙的千人中型校园/企业网络规划与设计(附所有配置命令)
晶闸管,它是很重要的,交流控制器件
Velocity autocorrelation function lammps v.s MATALB
5 lines of code identify various verification codes
@DateTimeFormat @JsonFormat 的区别
Hongmeng OS first training
RT thread studio learning (VII) using multiple serial ports
Lambda function perfect use guide
Adaptive personalized federated learning paper interpretation + code analysis
[college entrance examination] prospective college students look at it, choose the direction and future, and grasp it by themselves
2022年G3锅炉水处理复训题库及答案
Jackson XML is directly converted to JSON without writing entity classes manually
Why must coordinate transformations consist of publishers / subscribers of coordinate transformation information?
lambda 函数完美使用指南
RT thread studio learning summary
GD32F4(5):GD32F450时钟配置为200M过程分析
Keil installation of C language development tool for 51 single chip microcomputer
Demonstrate "topic communication, action communication, service communication and parameter server" with a small turtle case
Thoroughly understand the "rotation matrix / Euler angle / quaternion" and let you experience the beauty of three-dimensional rotation