当前位置：网站首页>AI zhetianchuan ml novice decision tree

AI zhetianchuan ml novice decision tree

2022-07-08 00:55:00 【Teacher, I forgot my homework】

Decision tree learning is one of the first machine learning methods proposed , Because it is easy to use and has strong interpretability , It is still widely used .

One 、 Decision tree learning foundation

（ Relatively simple , Pass by with one ）

example ： Enjoy a sport

For a new day , Whether you can enjoy sports ？

Classical objective problem for decision tree learning

Classification problems with non numerical characteristics
Discrete features
There is no similarity
Characteristic disorder
Another example ： Fruits
Color ： Red 、 green 、 yellow
size ： Small 、 in 、 Big
shape ： spherical 、 Long and thin
taste ： sweet 、 acid

The sample shows

List of attributes instead of numeric vectors
For example, enjoy sports ：
6 Value attribute ： The weather 、 temperature 、 humidity 、 wind 、 The water temperature 、 Forecast the weather
A weather example of a day ：{ Fine 、 warm 、 commonly 、 strong 、 warm 、 unchanged }
For example, fruit ：
4 Value tuples ： Color 、 size 、 shape 、 taste
An example of a fruit ： { red 、 spherical 、 Small 、 sweet }

The training sample

Decision tree concept

Development history of decision tree - Milepost

• 1966, from Hunt First put forward

• 1970’s~1980’s

• CART from Friedman, Breiman Put forward

• ID3 from Quinlan Put forward

• since 1990’s since

• contrastive study 、 Algorithm improvements (Mingers, Dietterich, Quinlan, etc.)

• The most widely used decision tree algorithm : C4.5 from Quinlan stay 1993 The year of the year

Two 、 Classic decision tree algorithm – ID3

The top-down , Greedy search

A recursive algorithm

Core cycle :

A : Find the next step The best Decision attributes
take A As the current node decision attribute
Pair attribute A (vi ) Each value of , Create a new child node corresponding to it
The training samples are allocated to each node according to the attribute value
If The training samples are perfectly classified , Then exit the loop , Otherwise, enter the recursive state and continue to probe down to split the new leaf node

Such as ：outlook Is the best decision attribute we found , Let's take it as the current root node , according to outlook That is, we will have different branches and different child nodes according to weather conditions , such as sunny、overcast、rain, If it comes to overcast It's over / There are labels , It means that it has been divided , If it hasn't been divided , We divide a batch of data into sunny、rain On the corresponding branch ,humidity、wind, Enter recursion and continue execution . At this time humidity Is the best attribute we choose .

problem ：

What kind of familiarity is the best decision attribute ？
What is the perfect classification of training samples ？

Q1: Which attribute is the best ?

For example, there are two properties A1( humidity ) and A2( wind )（64 Data ,29 A good example 35 A negative example ）

A1 by true（ The humidity is high ） There are 21 just 5 negative by false（ Low humidity ） There are 8 just 30 negative
A2 by true（ Strong wind ） There are 18 just 33 negative by false（ The wind is small ） There are 11 just 2 negative
notes ： The positive example is to exercise , The negative example is not going .

At this point in the end A1 and A2（ Humidity and wind ） Which is the root node / The best decision attribute is good ？

The basic principle of the current best node selection ： concise
---- We prefer to use simple trees with fewer nodes

Pictured above , Suppose there are two kinds of fruits , Watermelon and lemon . We use color color As the root node, it only needs to be based on green still yellow Then we can judge whether it is watermelon or lemon . And if we choose size size , The big one must be watermelon , And the small one may be a small watermelon , So we should judge according to the color .

Attribute selection and node promiscuity (Impurity)

Basic principles of attribute selection ： concise
—— We prefer to use simple trees with fewer nodes
At each node N On , We choose an attribute T, Make the data reaching the current derived node as much as possible “ pure ”
The purity (purity) – Promiscuity (impurity)

As shown in the first example above , After Color , Then the green on the left is watermelon, and the yellow on the right is lemon

And in the second Size Small Node , There are watermelon and lemon , It's not enough pure .

How to measure promiscuity ？

1. entropy (Entropy) ( Widely used )

gif.latex?Entropy%28N%29%20%3D%20-%5Csum_%7Bj%7D%5E%7B%7D%28p%28w_%7Bj%7D%29*%7Blog_%7B2%7D%7D%5E%7Bp%28w_%7Bj%7D%29%7D%29

node N The entropy of is expressed as different values under this node gif.latex?p%28w_%7Bj%7D%29 （w: With size Size, for example w1 It's big w2 It's small , gif.latex?p%28w_%7B1%7D%29 by w1 Probability , gif.latex?p%28w_%7B2%7D%29 by w2 Probability .）, every last gif.latex?p*log%7B_%7B2%7D%7D%5E%7Bp%7D The opposite of and .

Definition : 0log0=0
In information theory , Entropy measures the purity of information / Promiscuity , Or the uncertainty of information
It can be seen from the above figure , When entropy is maximum, the probability is half , That is, the most uncertain at this time .
Normal distribution – Has the largest entropy

Calculate and judge the following example ：

You can see A1 A2 All are 29+,35-, Close to half , Entropy should be close to 1 Of .

$gif.latex?Entropy%28S%29%3D-%5Cfrac%7B29%7D%7B64%7D*%7Blog_%7B2%7D%7D%5E%7B%5Cfrac%7B29%7D%7B64%7D%7D-%5Cfrac%7B35%7D%7B64%7D*%7Blog_%7B2%7D%7D%5E%7B%5Cfrac%7B35%7D%7B64%7D%7D%20%3D%200.993$

The blue line ：1 The probability of class time is 1,2 The probability of class time is 0.5,16 The probability of class time is 1/16

Red thread ：1 The maximum entropy is 0,2 The maximum entropy is 1, Unlimited growth .（ The maximum is uniformly distributed ）

2. Gini Promiscuity ( Duda Tend to use Gini Promiscuity )

In economics and sociology, people use The gini coefficient To measure the balance of a country's development .

gif.latex?i%28N%29%3D%5Csum_%7Bi%5Cneq%20j%7D%5E%7B%7D%28p%28w_%7Bi%7D%29*p%28w_%7Bj%7D%29%29%20%3D%201-%5Csum_%7Bj%7D%5E%7B%7Dp%5E%7B2%7D%28w_%7Bj%7D%29

Represents the product of all different cases （ If it is A、B In both cases p(A)*p(B), If it is A、B、C In three cases p(A)*p(B)+p(A)*p(C)+p(B)*p(C)）

perhaps 1- The product of the same case （ gif.latex?1-p%5E%7B2%7D%28A%29-p%5E%7B2%7D%28B%29-p%5E%7B2%7D%28C%29 ）

n = number of class（ It is also because it is evenly distributed , Each part is 1/n）

There is an upper limit , Cap of 1.

3. Misclassification confounding

gif.latex?i%28N%29%20%3D%201-max%20%5C%2C%20P%28w_%7Bj%7D%29

1- The probability of the largest class , namely 1- Most of the those who enter ( It's right ,1- It's wrong. )

Measure the change of promiscuity -- Information gain (IG)

Due to A The expected reduction of entropy caused by sorting

$gif.latex?Gain%28S%2CA%29%5Cequiv%20Entropy%28S%29-%5Csum_%7Bv%5Cepsilon%20Values%28A%29%7D%5E%7B%7D%5Cfrac%7B%5Cleft%20%7C%20S_%7Bv%7D%20%5Cright%20%7C%7D%7B%5Cleft%20%7C%20S%20%5Cright%20%7C%7DEntropy%28S_%7Br%7D%29$

original S The entropy of - Pass attribute A Expected entropy after classification

Calculated , so A1（ humidity ） Of IG( Information gain ) Bigger , That means we get more information ( The reduced entropy is more ), We choose A1 As root node .

example ： Choose which attribute to use as the root node according to the following table

Outlook Of Gain Maximum , Choose it as the root node .

Q2: When to return ( Stop splitting nodes )？

“ If the training samples are perfectly classified ”

• situation 1: If all data in the current subset There are exactly the same output categories , So stop

• situation 2: If all data in the current subset Have exactly the same input characteristics , So stop

such as ： a sunny day - No wind - The humidity is normal - Appropriate temperature , Finally, some went and some didn't . At this point, even if it does not end, there is no way , Because the available information has been used up . It means ：1、 The data is noisy noise. Need to be cleaned , If there is too much noise, the data quality is not good enough .2、 Missing important Feature, For example, I missed whether there were classes on that day , There is no way to go out to play when there is a class .

Possible situation 3: If The information gain of all attribute splitting is 0, So stop Is this a good idea ？（No）

As shown in the figure above, the first node of the tree cannot be found at this time (IG All are 0), If 3 That's right , At this time, the decision tree cannot be built .

If we don't want this condition , It's all the same ,IG All are 0, When there are multiple maxima, a random relationship is constructed （ At this time, it is also perfectly classified ）

That is to say ID3 Only the above two situations will stop splitting , If IG=0, Just choose one at random .

ID3 The hypothesis space of algorithm search

Suppose the space is complete （ It can handle both disjunction and conjunction of attributes ）

The objective function must be in the hypothesis space

Output a single hypothesis （ Walk down a path of trees ）

No more than 20 A question ( Based on experience , commonly feature No more than 20 individual , If the tree is too complex, it is easy to produce over fitting )

No backtracking （ With A1 Be the root node , There is no way to return to see A2 How about being a root node ）

Local optimum

Use all the data of the subset in each step ( For example, the weight update strategy in the gradient descent algorithm is to update each data once , That is to use only one piece of data at a time )

Data driven search options

Robust to noise data

ID3 Inductive bias in （Inductive Bias）

Hypothetical space H It acts on the sample set X Upper

There is no restriction on the hypothetical space

Prefer trees with greater information gain for attributes close to the root node

Try to find the shortest tree

The bias of the algorithm lies in some preferences for certain assumptions ( Search bias ), Instead of hypothetical space H Make restrictions ( Describe the offset ).

Okam razor （Occam’s Razor）*： Prefer the shortest hypothesis consistent with the data

CART ( Classification and regression trees )

A general framework :

Build a decision tree based on the training data
The decision tree will gradually divide the training set into smaller and smaller subsets
When the subset is pure, it will no longer split
Or accept an imperfect decision

Many decision tree algorithms are based on this framework , Include ID3、C4.5 wait .

3、 ... and 、 Over fitting problem

Pictured above ,b Than a Better , But if it looks like c Like every point is perfectly fitted , Error rate for 0, But if there is a new point ：

here c The error of the algorithm is too large , It matches the training data too much , So its generalization ability on more unseen instances of the test decreases .

An extreme example of decision tree over fitting ：

Each leaf node corresponds to a single training sample —— Each training sample is perfectly classified
The whole tree is equivalent to a simple implementation of a data lookup algorithm

namely At this time, the tree is a data lookup table , It can be compared with hash table in data structure . This means that only the data in the table can be searched , No data can be found , There is no generalization ability .

Four 、 How to avoid over fitting

There are two ways to avoid over fitting for decision trees

When data fragmentation is not statistically significant ( For example, there are few examples ) when , Stop growing ： pre-pruning
Build a complete tree , Then do back pruning

in application , Generally, pre pruning is faster , Then the accuracy of trees obtained by pruning is higher .

For pre pruning

It is difficult to estimate the size of the tree

pre-pruning : Based on the number of samples

Usually a node does not continue to split , When ：
The number of training samples arriving at a node is less than a specific proportion of the training set ( for example 5%)
No matter how promiscuous or error rate
reason ： Decisions based on too few data samples will bring large errors and generalization errors ( elephant )

Such as the 100 Data , To the current node, only 5 It's data , Don't continue to divide downward , No matter how positive or negative , It shows that the score is too detailed .

pre-pruning : Threshold based on information gain

Set a small threshold , Stop splitting if the following conditions are met ：
advantage : 1. All training data are used 2. Leaf nodes can be at any level in the tree
shortcoming : It's hard to set a good threshold

choose IG The big ones are nodes , If IG The biggest one is still a little small , gif.latex?%5CDelta%20i%28s%29%5Cleq%20%5Cbeta Then stop .

For post pruning :

• How to choose “ The best ” The tree of ？
Test the effect on another validation set
• MDL (Minimize Description Length Minimum description length ):
minimize ( size(tree) + size(misclassifications(tree)) ) The sum of tree size and error rate is the smallest

After pruning : Wrong lower pruning

• Divide the data set into training set and verification set
        • Verification set :
                • Known labels
                • The test results
                • Do not do model updates on the validation set !
• Pruning until it is cut again will damage the performance :
        • Test on the validation set and cut out every possible node ( And the subtree with it as its root ) Influence
        • Greedily remove a node that can improve the accuracy of the verification set
After cutting a node , How to define a new leaf node ( father ) The label of ？

Label assignment strategy of new leaf nodes after pruning

• Assign values to the most common categories (60 individual data,50 individual +10 individual -, It would be +)
• Label this node with multiple categories
Each category has a support (5/6:1/6) ( According to the number of each tag in the training set )
When testing : According to probability (5/6:1/6) Select a category or select multiple labels
• If it is a regression tree ( Numerical labels ), Can do average or weighted average
• ……

Errors reduce the effect of pruning

From back to front , A little pruning on the validation set .

After pruning : Prune after rules

1. Transform the tree into an equivalent set of rules
• e.g. if (outlook=sunny)^ (humidity=high) then playTennis = no
2. Prune each rule , Remove the rule antecedents that can improve the accuracy of the rule
• i.e. (outlook=sunny)60%, (humidity=high)85%,(outlook=sunny)^ (humidity=high)80%
3. Sort the rules into a sequence ( Verification set ： Sort from high to low according to the accuracy of the rules )
4. Use the final rule in the sequence to classify the samples （ Check whether the rule sequence is satisfied in turn , That is, go down one by one on the rule set until you can judge ）
( notes ： After the rules are pruned , It may no longer be able to recover into a tree )
A widely used method , for example C4.5

Why convert the decision tree into rules before pruning ?

• Context independent
        • otherwise , If the subtree is pruned , There are two choices :
                • Delete the node completely
                • Keep it
• Do not distinguish between root nodes and leaf nodes
• Improve readability

5、 ... and 、 Expand : Decision tree learning in real scenes

1. Continuous attribute values

• Establish some discrete attribute values , Interval , Easy to establish branches

• Optional strategies :

• I. Choose the intermediate value of values that are adjacent but have different decisions

Upper figure 48 And 60,80 And 90 Between yes no Changed , We can take the middle value

(Fayyad stay 1991 It was proved in that the threshold meeting such conditions can gain information IG Maximize )

• II. Consider probability gif.latex?x_%7Bs%7D%3D%281-P%29x_%7Bl%7D+Px_%7Bu%7D

Such as normal distribution probability density function image , The length of the interval can be different , But its area / The quantity is the same .