当前位置：网站首页>Machine learning 3.2 decision tree model learning notes (to be supplemented)

Machine learning 3.2 decision tree model learning notes (to be supplemented)

2022-07-03 11:27:00 【HeartFireY】

3.2.1 Model structure

The decision tree model abstracts the thinking process into a series of discrimination and decision-making processes of known data attributes , Use tree structure to express the logic and relationship of discrimination and a series of discrimination processes , The results of discrimination or decision-making are expressed by leaf nodes .

Please add a picture description

As shown in the figure above , All leaf nodes on a tree represent the final result , Instead of leaf nodes, they represent the points of each decision , That is, the discrimination of a certain attribute of the sample . It is not difficult to find that the decision tree is an extrovert tree model , Each internal node will directly or indirectly affect the final result of the decision . The path from the root node to a leaf node is called the test sequence .

We can describe the construction process of decision tree model as the following process ：

According to some classification rules, the optimal classification features are obtained , Calculate the optimal eigenfunction , And create the partition node of the feature , The data set is divided into Ruo cadre molecular data set according to the division node ;
Reuse discriminant rules in sub datasets , Construct a new node , As a new branch of the tree ;
When generating sub datasets , Pay attention to check whether the data belong to the same category or have no attributes for data division anymore ; If you can still continue to classify , Then recursively execute the above process ; Otherwise, take the current node as the leaf node , End the construction of this branch .

3.2.2 Criteria

The key of constructing decision tree is to reasonably select the sample attributes corresponding to its temporal nodes , Make the samples in the sample book subset corresponding to the node belong to the same category as much as possible , That is, it has the highest purity possible .

The criteria of decision tree model need to meet the following two points ：

If the samples in the corresponding data subset of the node basically belong to the same category , There is no need to further divide the data subset of the node , Otherwise, it is necessary to further divide the data subset of the node , Generate new criteria ;
If the new criterion can basically separate different types of data on the node , So that each child node is data with a single category , Then the criterion is a good rule , Otherwise, it is necessary to re select the criteria .

How to quantify the selection criteria of discriminant attributes of decision tree ？

We generally use information entropy ( entropy ) Quantitative analysis of the purity of the sample set . Information entropy is a kind of measurement index that quantitatively describes the uncertainty of random variables in information theory , It is mainly used to measure the disorder degree of data value , The larger the entropy value, the more disorderly the data value is .

hypothesis $\xi$ For having $n$ Possible values $\{s_1, s_2, \dots, s_n\}$ Discrete random variable , The probability distribution is $P(\xi = s_i) = p_i$ , Then its information entropy is defined as ：
$H(\xi) = - \sum^{n}_{i = 1}p_i\log_2p_i$
When $H(\xi)$ The larger the value of , Indicates a random variable $\xi$ The greater the uncertainty .

For any given two discrete random variables $\xi, \eta$ , Its joint probability distribution is ：
$P(\xi = s_i, \eta = t_i) = p_{ij}, i = 1, 2, \dots, n; j = 1, 2, \dots, m$
be $\eta$ About $\xi$ The entropy of information $H(\eta\ | \ \xi)$ Quantitative representation in known random variables $\xi$ Random variable under the condition of value $\eta$ Uncertainty of value , The calculation formula is random variable $\eta$ Entropy of random variables based on conditional probability distribution $\xi$ Mathematical expectation , That is to say ：
$H(\eta\ | \ \xi) = \sum^{n}_{i = 1}{p_iH(\eta\ | \ \xi = s_i)}$
among , $p_i = P(\xi = s_i), i = 1, 2, \dots, n$ .

For any given set of training samples $D$ , Can set $D$ As a random variable about the value state of the sample label , Thus, a quantitative index can be defined according to the essential connotation of entropy $H (D)$ And then measure $D$ Purity of sample type in , Commonly known as $H (D)$ Is empirical entropy . $H (D)$ The greater the value of , It means $D$ The more cluttered the sample label values contained in the ; On the contrary, the smaller the sample, the purer the label value . $H (D)$ The specific calculation formula is as follows ：
$-\sum^{n}_{k = 1} \frac{|C_k|}{|D|} \log_2 \frac{|C_k|}{|D|}$
among , $n$ Indicates the number of value states of the sample label value ; $C_k$ Represents the training sample set $D$ All dimension values in the are the third dimension $k$ A set of training samples with values , $∣ D ∣$ and $C_k|$ Each represents a set $D$ and $C_k$ Base number of .

For the training sample set $D$ Any attribute on $A$ , But in empirical entropy $H (D)$ A quantitative index is further defined on the basis of $H (D ∣ A)$ To measure the set $D$ In the sample, the attribute $A$ Is the purity after standard division , Usually by $H (D ∣ A)$ For collection $D$ About attributes $A$ Entropy of empirical conditions . $H (D ∣ A)$ The calculation formula of is as follows ：
$\sum^{m}_{i = 1} \frac{|D_i|}{|D|}H(D_i)$
among , $m$ Represents the property $A$ Number of states ; $D_i$ Represents a collection $D$ In order to attribute $A$ It is the subset generated after standard division , That is to say $D$ All properties in $A$ Take the first place $i$ A set of samples consisting of samples in different states .

By empirical entropy $H (D ∣ A)$ The definition of , The smaller the value, the smaller the sample set $D$ The higher the purity , On the basis of empirical entropy, the influence of each attribute as a partition index on the change of empirical entropy of data set can be further calculated , Information gain . Information gain measures known random variables $\xi$ The information makes the random variable $\eta$ The degree to which information uncertainty is reduced .

For any given training sample data set $D$ And an attribute on $A$ , attribute $A$ About set $D$ Information gain of $G (D, A)$ Defined as empirical entropy $H (D)$ And conditional empirical entropy $H (D ∣ A)$ The difference between the , That is to say ：
$G (D, A) = H (D) - H (D ∣ A)$
obviously , For the properties $A$ About set $D$ Information gain of $G (D, A)$ , If its value is larger, it indicates that the attribute $A$ The higher the purity of the divided sample set , It makes the decision tree model have stronger classification ability , Therefore, we can put $G (D, A)$ As a standard, select the appropriate discriminant attribute , The information gain of sample attributes is calculated recursively to realize the construction of decision tree .

When using the information gain index as the selection standard of partition attributes , The selection result is usually biased towards the attribute with more value states . To solve this problem , The simplest idea is to normalize the information gain , Information gain rate can be regarded as a measure of information gain normalization . Information gain rate introduces a gain factor on the basis of information gain , Eliminate the interference of the change of the number of attribute values on the calculation results . say concretely , For the data set given any given training sample $D$ And an attribute on it $A$ , attribute $A$ About set $D$ Information gain rate $G_r(D,A)$ Defined as ：
$G_r(D, A) = \frac{G(D,A)}{Q(A)}$
among $Q (A)$ Is the correction factor , It is calculated from the following formula ：
$-\sum^{m}_{i = 1}\frac{|D_i|}{|D|}\log_2\frac{|D_i|}{|D|}$
obviously , attribute $A$ Number of states $m$ The bigger the value is. , be $Q (A)$ The higher the value , Thus, the influence of bad preference of information gain on a certain type of structure of decision tree can be reduced .

For the decision tree model , You can also use gini index Select the optimal attribute as the division standard . Similar to the concept of , Gini index can be used to measure the purity of data sets . For any given one $m$ classification , Suppose the sample point belongs to the $k$ The probability of a class is $p_k$ , Then about this probability distribution $p$ The Gini index of can be defined as
$\sum^{m}_{k = 1}p_k(1 - p_k)$
That is to say ：
$\sum^m_{k = 1}p_k^2$
Corresponding , For any given sample set $D$ , Its Gini index can be defined as ：
$\sum^{m}_{k = 1}(\frac{|C_k|}{|D|})^2$
among , $C_k$ yes $D$ Belong to No $k$ Sample subset of class ; $m$ Is the number of category schemes .

Sample set $D$ The Gini index of is expressed in $D$ The probability of randomly selecting a sample to be misclassified . obviously , $G i n i (D)$ The smaller the value of , Data sets $D$ The higher the purity of the sample in , Or say $D$ The higher the consistency of sample types in .

If the sample set $D$ According to attributes $A$ Whether to take a possible value $a$ And divided into $D_1$ and $D_2$ Two parts , namely
$D_1 = \{(X, y) \in D | A(X) =a\}, D_2 = D - D_1$
Then in the attribute $A$ Under the condition of dividing attributes , aggregate $D$ The Gini index of can be defined as ：
$\frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)$