当前位置：网站首页>Chapter 4 decision tree and random forest

Chapter 4 decision tree and random forest

2022-07-27 03:58:00 【Sang zhiweiluo 0208】

Catalog

1 Information entropy

1.1 entropy

1.2 Joint entropy

1.3 Conditional entropy

1.4 Relative entropy

1.5 Mutual information

1.6 Veen chart

2 Decision tree learning algorithm

2.1 Information gain

2.2 ID3、C4.5 、CART

3 Information gain rate and Gini coefficient

3.1 Definition

3.2 Gini Coefficient related discussion

1 Information entropy

1.1 entropy

Entropy can be understood as the expected value of the uncertainty of the probability distribution . The bigger this is , Indicates that the greater the uncertainty of the probability distribution . It provides us with “ Information ” The smaller , The harder it is for us to use this probability distribution to make a correct judgment . That is, the greater the probability, the more certain , The less entropy .

expression ： $H(X)=-\sum_{x}^{}p(x)logp(x)$

1.2 Joint entropy

(X,Y) The entropy contained

expression ： $H(X,Y)=-\sum_{x,y}^{}p(x,y)logp(x,y)$

1.3 Conditional entropy

(X,Y) The entropy contained , subtract X The entropy that occurs alone , That is to say X Under the premise of occurrence ,Y happen “ new ” Entropy brought about by .

expression ： H(Y|X) perhaps H(X,Y)-H(X)

deduction ：

$H(X,Y)-H(X)=-\sum_{x,y}^{}p(x,y)log p(x,y)+\sum_{x}^{}p(x)logp(x)=-\sum_{x,y}^{}p(x,y)log p(x,y)+\sum_{x}^{}(\sum_{y}^{}p(x,y))log(x)=-\sum_{x,y}^{}p(x,y)log p(x,y)+-\sum_{x,y}^{}p(x,y)log p(x)=-\sum_{x,y}^{}p(x,y)log \frac{p(x,y)}{p(x)}=-\sum_{x,y}^{}p(x,y)log p(y|x)=-\sum_{x}^{}\sum_{y}^{}p(x)p(y|x)logp(y|x)=-\sum_{x}^{}p(x)\sum_{y}^{}p(y|x)logp(y|x)=\sum_{x}^{}p(x)H(Y|X=x)=H(Y|X)$

1.4 Relative entropy

Relative entropy is also called cross entropy 、 Cross entropy 、 Identifying information 、Kullback entropy 、Kullback-Leible Divergence, etc .

set up p(x)、q(x) yes X Two probability distributions of the values in , On the other hand p Yes q The relative entropy of is

$D(p||q)=\sum_{x}^{}p(x)log\frac{p(x)}{q(x)}=E_{p(x)}log\frac{p(x)}{q(x)}$

$pln\frac{p}{q}=p(lnp-lnq)=plnp-plnq=-(-plnp+plnq)=-(H(p)+lnq^{p})=-H(p)-lnq^{p}$

ps: Relative entropy can measure two random variables “ distance ”, Want to minimize the relative entropy , H(p) Is constant , So make $lnq^{p}$ Maximum . Write the sample as $\prod_{i=1}^{m}q_{i}^{p_{i}}$ , Find the maximum , That is to find the maximum likelihood estimation .

1.5 Mutual information

Two random variables X,Y Mutual information of , Defined as X,Y The relative entropy of the product of joint distribution and independent distribution of .

It can also be seen as H(Y) and H(Y|X) The difference between the .

expression ：
$I(X,Y)=D(P(X,Y)||P(X)P(Y))=\sum_{x,y}^{}log\frac{p(x,y)}{p(x)p(y)}$

or ： I(X,Y)=H(Y)-H(Y|X)

or ： I(X,Y)=H(X)+H(Y)-H(X,Y)

Measure the product of joint distribution and independent distribution “ distance ”; if X,Y Are independent of each other , Then mutual information is 0.

The second formula is derived ：
$H(Y)-H(Y|X)=-\sum_{y}^{}p(y)logp(y)+\sum_{x,y}^{}p(x,y)logp(y|x)=-\sum_{y}^{}(\sum_{x}^{}p(x,y))logp(y)+\sum_{x,y}^{}p(x,y))log\frac{p(x,y)}{p(x)}=-\sum_{x,y}^{}p(x,y)logp(y)+\sum_{x,y}^{}p(x,y))log\frac{p(x,y)}{p(x)}=\sum_{x,y}^{}p(x,y)log\frac{p(x,y)}{p(x)p(y)}=I(X,Y)$

The third formula is derived ：

$H(X)+H(Y)-H(X,Y)=-\sum_{x}^{}p(x)log(x)-\sum_{y}^{}p(y)log(y)+\sum_{x,y}^{}p(x,y)log(x,y)=-\sum_{x}^{}(\sum_{y}^{}p(x,y))logp(x)-\sum_{y}^{}(\sum_{x}^{}p(x,y))logp(y)+\sum_{x,y}^{}p(x,y)log(x,y)=-\sum_{x,y}^{}p(x,y)logp(x)-\sum_{x,y}^{}p(x,y)logp(y)+\sum_{x,y}^{}p(x,y)log(x,y)=\sum_{x,y}^{}p(x,y)(logp(x,y)-logp(x)-logp(y))=\sum_{x,y}^{}p(x,y)log\frac{p(x,y)}{p(x)p(y)}$

1.6 Veen chart

2 Decision tree learning algorithm

2.1 Information gain

When the probability in entropy and conditional entropy is estimated by data （ Especially maximum likelihood estimation ） When you get it , The corresponding entropy and conditional entropy are called Experience in entropy and Empirical condition entropy .

The information gain represents the known characteristic A The information that makes the class X The degree to which the uncertainty of information is reduced .

Definition ： features A On the training data set D Information gain of g(D,A), Defined as a set D The experience of the entropy H(D) With the characteristics of A Given the conditions D Entropy of empirical conditions H(D|A) The difference between the , namely g(D,A)=H(D)-H(D|A) .

obviously , This is the training data set D And characteristics A Mutual information of .

2.2 ID3、C4.5 、CART

ID3： Take the rate of information entropy decline as the standard for selecting test attributes , That is to say, the attribute with the highest information gain that has not been used to partition is selected as the partition standard in each node , And then we continue the process , Know that the generated decision tree can perfectly classify training samples .

C4.5：C4.5 The algorithm is ID3 An extension of the algorithm . The information gain rate is used .

CART：CART So is the algorithm ID3 An extension of the algorithm . It uses Gini coefficient .