当前位置：网站首页>Ml4 self study notes

Ml4 self study notes

2022-07-29 06:16:00 【19-year-old flower girl】

Decision tree algorithm

Put the conditions with good differentiation effect on the front node . The tree model can do both classification and regression .
Tree composition
Decision tree training and testing
How to segment feature points , What features are selected for each node , How to cut . Pass a measure .
The measure ： entropy . Look at the result of entropy after a division . I hope the entropy of each branch will decrease much better after classification .
Information gain ： Used to select features . The feature makes the entropy of classification based on this feature much smaller , That is, the degree of information gain is large , The better the features you choose .

Decision tree construction example

data features The goal is
First select the root node , Use information gain calculation .
Calculate the entropy of raw data
Yes 4 Features are used to calculate the entropy after division .

Calculate entropy by weighting according to the probability of category

The remaining entropy is also calculated in this way , Select the root node with high information gain ; Then select the next node in this way .

ID3,C4.5,CART,GINI coefficient

Insert picture description here

stay ID3 in , If there is a column id（1,2,3…n）, And if so id Classify each group of data into one category , Is to determine the , Entropy is zero , Is the best classification result , But with id Classification is meaningless , Cannot pass id Number to determine the branch .
So the proposed information gain rate C4.5. Consider the entropy of itself .id There are too many categories , Cause its own entropy to be very large , Divide the entropy of itself by the classification number , It will make the information gain rate very small , It's solved ID3 The problem of , It is commonly used at present .
CART In order to GINI Coefficient to measure , Just look at the formula .

Continuous values in attribute values

If you score two , It is possible between every data , Calculate entropy against several possibilities , Choose the division method with the lowest entropy , This process is data discretization .
Insert picture description here

Decision tree pruning strategy

reason ： Prevent over fitting （ There is only one sample on each leaf , Achieve 100%）
pruning strategy ： pre-pruning 、 After pruning
X[2]<=2.45 With this feature <=2.45 Based on ,GINI The coefficient is calculated according to the formula ,sample: Total number of samples ,value: The number of each category
Insert picture description here

pre-pruning ： Prune while building （ Limit the maximum depth of the tree 、 Limit the number of leaf nodes 、 Limit the number of leaf node samples and information gain ）（ These data are determined by experiments , How deep is it , How many leaf nodes are there ）
After pruning ： Prune after building the decision tree
According to the formula ,C(T) Is the current point GINI coefficient *sample,|T_leaf| There are several branches from the current node . Take green circle calculation as an example , First, calculate if this node does not branch again C_a(T), Then calculate the... Of the following two leaf nodes respectively C_a(T) Add and compare with unbranched , The bigger, the worse .α The significance of the definition lies in whether the more emphasis is placed on entropy or the number of leaf nodes ,α The larger the size, the more attention should be paid to the number of leaf nodes , That is, we have paid attention to fitting ,α Smaller means more attention is paid to the accuracy of results , Don't pay too much attention to fitting

How does the decision tree solve the problem

Solve the problem of classification , It's simple , Just follow the branch , Finally, we get a class .
The return question . The measure ： We hope that the variance of each branch is the smallest , The final result is the average value of the sample labels in the current node （ Such as age ）.

原网站

版权声明
本文为[19-year-old flower girl]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290519569262.html