当前位置:网站首页>Chapter IV decision tree
Chapter IV decision tree
2022-07-08 01:07:00 【Intelligent control and optimization decision Laboratory of Cen】
Which parts does the decision tree learning algorithm include ? What are the commonly used algorithms ?
The learning algorithm of decision tree mainly includes three parts :
1. feature selection 2. The generation of trees 3. Pruning of trees
There are several commonly used algorithms :
1.ID3 2. C4.5 3.CART
Root node of decision tree 、 What do internal nodes and leaf nodes represent respectively ?
The leaf node corresponds to the decision result , Each other node corresponds to an attribute test ; The sample set contained in each node is divided into sub nodes according to the result of attribute test ; The root node contains the complete set of samples . The path from the root node to each leaf node corresponds to a decision test sequence .
What are the criteria for feature selection ( How to choose the best partition attribute )?
Feature selection is mainly based on information gain and information gain ratio
Information gain
First, the definition of information entropy is : E n t ( D ) = − ∑ k = 1 ∣ y ∣ p k Ent(D)=-\sum_{k=1}^{|y|}p_k Ent(D)=−∑k=1∣y∣pk l o g 2 p k log_2p_k log2pk
The information gain is : G a i n ( D , a ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D,a)=Ent(D)-\sum_{v=1}^{V}\frac{|D^v|}{|D|}Ent(D^v) Gain(D,a)=Ent(D)−∑v=1V∣D∣∣Dv∣Ent(Dv)
Information gain ratio
The gain ratio is defined as : G a i n Gain Gain_ r a t i o ( D , a ) = G a i n ( D , a ) I V ( a ) , ratio(D,a)=\frac{Gain(D,a)}{IV(a)}, ratio(D,a)=IV(a)Gain(D,a),
among : I V ( a ) = − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ l o g 2 ∣ D v ∣ ∣ D ∣ IV(a)=-\sum_{v=1}^{V}\frac{|D^v|}{|D|}log_2\frac{|D^v|}{|D|} IV(a)=−∑v=1V∣D∣∣Dv∣log2∣D∣∣Dv∣
How to prevent overfitting in decision trees ?
Pruning is a decision tree learning algorithm “ An important means of over fitting ”. In order to prevent the decision tree from learning the training samples too well , The risk of over fitting can be reduced by actively removing some branches . The basic strategies of pruning mainly include “ pre-pruning ” and “ After pruning ”
pre-pruning
It refers to the generation process of decision tree , Estimate each node before partitioning , If the current node partition cannot improve the generalization performance of decision tree , Then the partition is stopped and the current node is marked as a leaf node .
After pruning
After pruning, a complete decision tree is generated from the training set , Then the non leaf nodes are examined from bottom to top , If the subtree corresponding to this node is replaced by leaf nodes, the generalization ability of the decision tree can be improved , Replace the subtree with a leaf node .
How to deal with continuous values and missing values ?
1. Continuous value processing
Because the number of values of continuous attributes is no longer limited , therefore , It is impossible to divide nodes directly according to the values of continuous attributes . here , Discretization technology can be used continuously . The simplest strategy is to use dichotomy to deal with continuous attributes , That's exactly what it is. C4.5 The mechanism used in the decision tree algorithm .
2. Missing value processing
Incomplete samples are often encountered in real tasks , That is, some attribute values of the sample are missing . For example, due to the cost of diagnosis and testing 、 Privacy protection and other factors , The value of some attributes of the patient's medical data ( Such as HIV test result ) Unknown ; Especially when the number of attributes is large , There are often a large number of samples with missing values . If you simply give up incomplete samples , Only samples without missing values can be used for learning , Obviously, it is a great waste of data information .
1. So how to choose the partition attribute when the attribute value is missing ?
ρ \rho ρ Indicates the proportion of samples without missing values , Given training set D D D And attribute a a a, D ^ \hat{D} D^ Express D D D In attributes a a a Sample subset with no missing values on .
G a i n ( D , a ) = ρ ∗ G a i n ( D ^ , a ) Gain(D,a)=\rho*Gain(\hat{D},a) Gain(D,a)=ρ∗Gain(D^,a)
2. Given partition properties , If the value of the sample on the attribute is missing , How to divide the samples ?
If sample x x x In dividing attributes a a a The value on is known , Will x x x Transfer in the child node corresponding to its value , And the sample weight is maintained as w x w_x wx; If sample X X X In dividing attributes a a a Unknown value on , Will X X X Transfer in all child nodes at the same time , And the sample weight lies in the attribute value a v a^v av The corresponding sub nodes are adjusted to r ^ v ∗ w x \hat{r}_v*{w_x} r^v∗wx, among r ^ v \hat{r}_v r^v Indicates the attribute in the sample without missing value a a a Top value a v a^v av The proportion of the sample of .
边栏推荐
- Cascade-LSTM: A Tree-Structured Neural Classifier for Detecting Misinformation Cascades(KDD20)
- 6. Dropout application
- String usage in C #
- 4.交叉熵
- 9.卷积神经网络介绍
- [note] common combined filter circuit
- 1.线性回归
- 基于卷积神经网络的恶意软件检测方法
- 2. Nonlinear regression
- Interface test advanced interface script use - apipost (pre / post execution script)
猜你喜欢
第四期SFO销毁,Starfish OS如何对SFO价值赋能?
How to write mark down on vscode
y59.第三章 Kubernetes从入门到精通 -- 持续集成与部署(三二)
SDNU_ACM_ICPC_2022_Summer_Practice(1~2)
8. Optimizer
14. Draw network model structure
13. Enregistrement et chargement des modèles
13. Model saving and loading
2. Nonlinear regression
Codeforces Round #804 (Div. 2)(A~D)
随机推荐
SDNU_ACM_ICPC_2022_Summer_Practice(1~2)
Stock account opening is free of charge. Is it safe to open an account on your mobile phone
130. 被圍繞的區域
Implementation of adjacency table of SQLite database storage directory structure 2-construction of directory tree
Cancel the down arrow of the default style of select and set the default word of select
DNS series (I): why does the updated DNS record not take effect?
Codeforces Round #804 (Div. 2)
Share a latex online editor | with latex common templates
FOFA-攻防挑战记录
NVIDIA Jetson test installation yolox process record
[Yugong series] go teaching course 006 in July 2022 - automatic derivation of types and input and output
13. Enregistrement et chargement des modèles
Redis, do you understand the list
ThinkPHP kernel work order system source code commercial open source version multi user + multi customer service + SMS + email notification
图像数据预处理
How to use education discounts to open Apple Music members for 5 yuan / month and realize member sharing
What has happened from server to cloud hosting?
股票开户免费办理佣金最低的券商,手机上开户安全吗
A speed Limited large file transmission tool for every major network disk
Codeforces Round #804 (Div. 2)(A~D)