当前位置:网站首页>AI zhetianchuan ml novice decision tree

AI zhetianchuan ml novice decision tree

2022-07-08 00:55:00 Teacher, I forgot my homework

Decision tree learning is one of the first machine learning methods proposed , Because it is easy to use and has strong interpretability , It is still widely used .

One 、 Decision tree learning foundation

( Relatively simple , Pass by with one ) 

example : Enjoy a sport  

d972f41ac8a44f71bb5e763657a8b4f4.png

  For a new day , Whether you can enjoy sports ?

Classical objective problem for decision tree learning

  • Classification problems with non numerical characteristics
  • Discrete features
  • There is no similarity
  • Characteristic disorder

Another example : Fruits

  • Color : Red 、 green 、 yellow
  • size : Small 、 in 、 Big
  • shape : spherical 、 Long and thin
  • taste : sweet 、 acid

The sample shows

List of attributes instead of numeric vectors

For example, enjoy sports :

  • 6 Value attribute : The weather 、 temperature 、 humidity 、 wind 、 The water temperature 、 Forecast the weather
  • A weather example of a day :{ Fine 、 warm 、 commonly 、 strong 、 warm 、 unchanged }

For example, fruit :

  • 4 Value tuples : Color 、 size 、 shape 、 taste
  • An example of a fruit : { red 、 spherical 、 Small 、 sweet }

The training sample

2adc8f7245e6408bbd3f5bd0dd30de8f.png

Decision tree concept

2da9d2a862c44013b6e67342cd0e58bc.png

Development history of decision tree - Milepost

• 1966, from Hunt First put forward

• 1970’s~1980’s

        • CART from Friedman, Breiman Put forward

        • ID3 from Quinlan Put forward

• since 1990’s since

        • contrastive study 、 Algorithm improvements (Mingers, Dietterich, Quinlan, etc.)

        • The most widely used decision tree algorithm : C4.5 from Quinlan stay 1993 The year of the year

Two 、 Classic decision tree algorithm – ID3

The top-down , Greedy search

A recursive algorithm

Core cycle :

  • A : Find the next step The best Decision attributes
  • take A As the current node decision attribute
  • Pair attribute A (vi ) Each value of , Create a new child node corresponding to it
  • The training samples are allocated to each node according to the attribute value
  • If The training samples are perfectly classified , Then exit the loop , Otherwise, enter the recursive state and continue to probe down to split the new leaf node

2da9d2a862c44013b6e67342cd0e58bc.png

Such as :outlook Is the best decision attribute we found , Let's take it as the current root node , according to outlook That is, we will have different branches and different child nodes according to weather conditions , such as sunny、overcast、rain, If it comes to overcast It's over / There are labels , It means that it has been divided , If it hasn't been divided , We divide a batch of data into sunny、rain On the corresponding branch ,humidity、wind, Enter recursion and continue execution . At this time humidity Is the best attribute we choose .

problem :

  1. What kind of familiarity is the best decision attribute ?
  2. What is the perfect classification of training samples ?

Q1: Which attribute is the best ?

For example, there are two properties A1( humidity ) and A2( wind )(64 Data ,29 A good example 35 A negative example )

5739d455935c41bba48f81cb0a56eff0.png

A1 by true( The humidity is high ) There are 21 just 5 negative by false( Low humidity ) There are 8 just 30 negative

A2 by true( Strong wind ) There are 18 just 33 negative by false( The wind is small ) There are 11 just 2 negative

notes : The positive example is to exercise , The negative example is not going .

At this point in the end A1 and A2( Humidity and wind ) Which is the root node / The best decision attribute is good ?

The basic principle of the current best node selection : concise

                        ---- We prefer to use simple trees with fewer nodes

31d6386359f245e8a91899145f804c26.png

Pictured above , Suppose there are two kinds of fruits , Watermelon and lemon . We use color color As the root node, it only needs to be based on green still yellow Then we can judge whether it is watermelon or lemon . And if we choose size size , The big one must be watermelon , And the small one may be a small watermelon , So we should judge according to the color .

Attribute selection and node promiscuity (Impurity)

Basic principles of attribute selection : concise

                —— We prefer to use simple trees with fewer nodes

At each node N On , We choose an attribute T, Make the data reaching the current derived node as much as possible “ pure ”

The purity (purity) – Promiscuity (impurity)

As shown in the first example above , After Color , Then the green on the left is watermelon, and the yellow on the right is lemon

And in the second Size Small Node , There are watermelon and lemon , It's not enough pure .

How to measure promiscuity ?

1.  entropy (Entropy) ( Widely used )

81f5c2e0e20542a9a2150b0bb47d38db.png

gif.latex?Entropy%28N%29%20%3D%20-%5Csum_%7Bj%7D%5E%7B%7D%28p%28w_%7Bj%7D%29*%7Blog_%7B2%7D%7D%5E%7Bp%28w_%7Bj%7D%29%7D%29

node N The entropy of is expressed as different values under this node gif.latex?p%28w_%7Bj%7D%29(w: With size Size, for example w1 It's big w2 It's small ,gif.latex?p%28w_%7B1%7D%29 by w1 Probability ,gif.latex?p%28w_%7B2%7D%29 by w2 Probability .), every last gif.latex?p*log%7B_%7B2%7D%7D%5E%7Bp%7D The opposite of and .

  • Definition : 0log0=0
  • In information theory , Entropy measures the purity of information / Promiscuity , Or the uncertainty of information
  • It can be seen from the above figure , When entropy is maximum, the probability is half , That is, the most uncertain at this time .
  • Normal distribution – Has the largest entropy

Calculate and judge the following example :

ee26fc3cb2b947df83e6268013b353e3.png

  You can see A1 A2 All are 29+,35-, Close to half , Entropy should be close to 1 Of .

 gif.latex?Entropy%28S%29%3D-%5Cfrac%7B29%7D%7B64%7D*%7Blog_%7B2%7D%7D%5E%7B%5Cfrac%7B29%7D%7B64%7D%7D-%5Cfrac%7B35%7D%7B64%7D*%7Blog_%7B2%7D%7D%5E%7B%5Cfrac%7B35%7D%7B64%7D%7D%20%3D%200.993

 543b8ad2ff53425f88785809b2d512b0.png

The blue line :1 The probability of class time is 1,2 The probability of class time is 0.5,16 The probability of class time is 1/16

Red thread :1 The maximum entropy is 0,2 The maximum entropy is 1, Unlimited growth .( The maximum is uniformly distributed )

2. Gini Promiscuity ( Duda Tend to use Gini Promiscuity )

In economics and sociology, people use The gini coefficient To measure the balance of a country's development .

gif.latex?i%28N%29%3D%5Csum_%7Bi%5Cneq%20j%7D%5E%7B%7D%28p%28w_%7Bi%7D%29*p%28w_%7Bj%7D%29%29%20%3D%201-%5Csum_%7Bj%7D%5E%7B%7Dp%5E%7B2%7D%28w_%7Bj%7D%29

Represents the product of all different cases ( If it is A、B In both cases p(A)*p(B), If it is A、B、C In three cases p(A)*p(B)+p(A)*p(C)+p(B)*p(C))

perhaps  1- The product of the same case (gif.latex?1-p%5E%7B2%7D%28A%29-p%5E%7B2%7D%28B%29-p%5E%7B2%7D%28C%29

cdbeaaa832384320a578d6f58c01d118.png

n = number of class( It is also because it is evenly distributed , Each part is 1/n)

There is an upper limit , Cap of 1.

3. Misclassification confounding

gif.latex?i%28N%29%20%3D%201-max%20%5C%2C%20P%28w_%7Bj%7D%29

1- The probability of the largest class , namely 1- Most of the those who enter ( It's right ,1- It's wrong. )

c6c419b394d84658ae7d11d5d22e43c0.png

Measure the change of promiscuity -- Information gain (IG)

Due to A The expected reduction of entropy caused by sorting

gif.latex?Gain%28S%2CA%29%5Cequiv%20Entropy%28S%29-%5Csum_%7Bv%5Cepsilon%20Values%28A%29%7D%5E%7B%7D%5Cfrac%7B%5Cleft%20%7C%20S_%7Bv%7D%20%5Cright%20%7C%7D%7B%5Cleft%20%7C%20S%20%5Cright%20%7C%7DEntropy%28S_%7Br%7D%29

  original S The entropy of - Pass attribute A Expected entropy after classification

7f734aef4044497ca7804ca64dd866b5.png Calculated , so A1( humidity ) Of IG( Information gain ) Bigger , That means we get more information ( The reduced entropy is more ), We choose A1 As root node .

example : Choose which attribute to use as the root node according to the following table a90f33b2c4754ba996a587c022834497.png

399e65d702434c33af72e7ecf25749d3.png

Outlook Of Gain Maximum , Choose it as the root node . 

Q2: When to return ( Stop splitting nodes )? 

“ If the training samples are perfectly classified ”

situation 1: If all data in the current subset There are exactly the same output categories , So stop

 5e8a0c2cb1ea42a09a72cc99fa41aff2.png

situation 2: If all data in the current subset Have exactly the same input characteristics , So stop

         such as : a sunny day - No wind - The humidity is normal - Appropriate temperature , Finally, some went and some didn't . At this point, even if it does not end, there is no way , Because the available information has been used up . It means :1、 The data is noisy noise. Need to be cleaned , If there is too much noise, the data quality is not good enough .2、 Missing important Feature, For example, I missed whether there were classes on that day , There is no way to go out to play when there is a class .

Possible situation 3: If The information gain of all attribute splitting is 0, So stop     Is this a good idea ?(No)

c0d155ab4ab34f018628963218508607.png

As shown in the figure above, the first node of the tree cannot be found at this time (IG All are 0), If 3 That's right , At this time, the decision tree cannot be built .

If we don't want this condition , It's all the same ,IG All are 0, When there are multiple maxima, a random relationship is constructed ( At this time, it is also perfectly classified )

fe404292ebab4260b2d862d558c030f7.png

  That is to say ID3 Only the above two situations will stop splitting , If IG=0, Just choose one at random .

ID3 The hypothesis space of algorithm search

218d11057e6d4821b72f4e334a9cd3b5.png

  • Suppose the space is complete ( It can handle both disjunction and conjunction of attributes )

         The objective function must be in the hypothesis space

  • Output a single hypothesis ( Walk down a path of trees )

         No more than 20 A question ( Based on experience , commonly feature No more than 20 individual , If the tree is too complex, it is easy to produce over fitting )

  • No backtracking ( With A1 Be the root node , There is no way to return to see A2 How about being a root node )

         Local optimum

  • Use all the data of the subset in each step ( For example, the weight update strategy in the gradient descent algorithm is to update each data once , That is to use only one piece of data at a time )

         Data driven search options

         Robust to noise data

ID3 Inductive bias in (Inductive Bias)

  • Hypothetical space H It acts on the sample set X Upper

         There is no restriction on the hypothetical space

  • Prefer trees with greater information gain for attributes close to the root node

         Try to find the shortest tree

         The bias of the algorithm lies in some preferences for certain assumptions ( Search bias ), Instead of hypothetical space H Make restrictions ( Describe the offset ).

         Okam razor (Occam’s Razor)*: Prefer the shortest hypothesis consistent with the data

CART ( Classification and regression trees )

A general framework :

  • Build a decision tree based on the training data
  • The decision tree will gradually divide the training set into smaller and smaller subsets
  • When the subset is pure, it will no longer split
  • Or accept an imperfect decision

Many decision tree algorithms are based on this framework , Include ID3、C4.5 wait .

3、 ... and 、 Over fitting problem

500d7b94489e4e8d82053b990d3cfdac.png

Pictured above ,b Than a Better , But if it looks like c Like every point is perfectly fitted , Error rate for 0, But if there is a new point :

547d84e5a6354cd3bf0d3350baac6172.png

here c The error of the algorithm is too large , It matches the training data too much , So its generalization ability on more unseen instances of the test decreases .

An extreme example of decision tree over fitting :

  • Each leaf node corresponds to a single training sample —— Each training sample is perfectly classified
  • The whole tree is equivalent to a simple implementation of a data lookup algorithm

namely At this time, the tree is a data lookup table , It can be compared with hash table in data structure . This means that only the data in the table can be searched , No data can be found , There is no generalization ability .

6b7736c6e9b842e2b686804803d21517.png

Four 、 How to avoid over fitting

There are two ways to avoid over fitting for decision trees

  • When data fragmentation is not statistically significant ( For example, there are few examples ) when , Stop growing : pre-pruning
  • Build a complete tree , Then do back pruning

in application , Generally, pre pruning is faster , Then the accuracy of trees obtained by pruning is higher . 

For pre pruning

It is difficult to estimate the size of the tree

pre-pruning : Based on the number of samples

Usually a node does not continue to split , When :

  • The number of training samples arriving at a node is less than a specific proportion of the training set ( for example 5%)
  • No matter how promiscuous or error rate
  • reason : Decisions based on too few data samples will bring large errors and generalization errors ( elephant )

Such as the 100 Data , To the current node, only 5 It's data , Don't continue to divide downward , No matter how positive or negative , It shows that the score is too detailed .

pre-pruning : Threshold based on information gain

  • Set a small threshold , Stop splitting if the following conditions are met :

gif.latex?%5CDelta%20i%28s%29%5Cleq%20%5Cbeta

  • advantage : 1. All training data are used   2. Leaf nodes can be at any level in the tree
  • shortcoming : It's hard to set a good threshold

choose IG The big ones are nodes , If IG The biggest one is still a little small ,gif.latex?%5CDelta%20i%28s%29%5Cleq%20%5Cbeta  Then stop .

For post pruning :

• How to choose “ The best ” The tree of ?

         Test the effect on another validation set

• MDL (Minimize Description Length Minimum description length ):

        minimize ( size(tree) + size(misclassifications(tree)) ) The sum of tree size and error rate is the smallest

After pruning : Wrong lower pruning

• Divide the data set into training set and verification set

        • Verification set :

                • Known labels

                • The test results

                • Do not do model updates on the validation set !

• Pruning until it is cut again will damage the performance :

        • Test on the validation set and cut out every possible node ( And the subtree with it as its root ) Influence

        • Greedily remove a node that can improve the accuracy of the verification set

After cutting a node , How to define a new leaf node ( father ) The label of ?

Label assignment strategy of new leaf nodes after pruning

• Assign values to the most common categories (60 individual data,50 individual +10 individual -, It would be +)

• Label this node with multiple categories

         Each category has a support (5/6:1/6) ( According to the number of each tag in the training set )

         When testing : According to probability (5/6:1/6) Select a category or select multiple labels

• If it is a regression tree ( Numerical labels ), Can do average or weighted average

• ……

Errors reduce the effect of pruning

40e3a17d34434e518de4a2052e295a15.png

From back to front , A little pruning on the validation set . 

After pruning : Prune after rules

1. Transform the tree into an equivalent set of rules

        • e.g. if (outlook=sunny)^ (humidity=high) then playTennis = no

2. Prune each rule , Remove the rule antecedents that can improve the accuracy of the rule

        • i.e. (outlook=sunny)60%, (humidity=high)85%,(outlook=sunny)^                 (humidity=high)80%

3. Sort the rules into a sequence ( Verification set : Sort from high to low according to the accuracy of the rules )

4. Use the final rule in the sequence to classify the samples ( Check whether the rule sequence is satisfied in turn , That is, go down one by one on the rule set until you can judge )

( notes : After the rules are pruned , It may no longer be able to recover into a tree )

A widely used method , for example C4.5

Why convert the decision tree into rules before pruning ?

• Context independent

        • otherwise , If the subtree is pruned , There are two choices :

                • Delete the node completely

                • Keep it

• Do not distinguish between root nodes and leaf nodes

• Improve readability

5、 ... and 、 Expand : Decision tree learning in real scenes

1. Continuous attribute values

1e36d7da4f2e4afd8637be2d393a5538.png

• Establish some discrete attribute values , Interval , Easy to establish branches

• Optional strategies :

        • I. Choose the intermediate value of values that are adjacent but have different decisions  2

             Upper figure 48 And 60,80 And 90 Between yes no Changed , We can take the middle value

            (Fayyad stay 1991 It was proved in that the threshold meeting such conditions can gain information IG Maximize )

        • II. Consider probability  gif.latex?x_%7Bs%7D%3D%281-P%29x_%7Bl%7D+Px_%7Bu%7D 

              Such as normal distribution probability density function image , The length of the interval can be different , But its area / The quantity is the same .

2. Attribute with too many values

problem :

• deviation : If the attribute has many values , According to the information gain IG, Will be selected first

        • e.g. In the example of enjoying sports , Take every day of the year as an attribute ( It's like looking up a watch again )

• One possible solution : use GainRatio ( Gain ratio ) To replace

0802ab3c9cd44c42985ce7daf7b81393.png

3. Unknown attribute value

ff498a8414eb461b99f809659f38bfdb.png

We don't know what some characteristics are , You can choose what you don't know as common , Or assign a value according to probability .

4. Attribute with cost

39dc601d1c90448290288ac1816a71cd.png

That is, if feature many ,IG It can be divided by a penalty term , The cost can be divided by cost...

6、 ... and 、 Other information

• Decision tree is probably the simplest and most frequently used algorithm

        • Easy to understand

        • Easy to implement

        • Easy to use

        • The calculation cost is small

• Decision making forest :

        • from C4.5 Many decision trees are generated

• Updated algorithm :C5.0 http://www.rulequest.com/see5-info.html

• Ross Quinlan The home page of : http://www.rulequest.com/Personal/

原网站

版权声明
本文为[Teacher, I forgot my homework]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/189/202207072310193695.html