当前位置：网站首页>[statistical learning methods] learning notes - improvement methods

[statistical learning methods] learning notes - improvement methods

2022-07-07 12:26:00 【Sickle leek】

Statistical learning methods learning notes —— How to improve

1. How to improve AdaBoost Algorithm
2. AdaBoost Training error analysis of algorithm
3. AdaBoost Explanation of algorithm
- 3.1 Forward distribution algorithm
- 3.2 Forward distribution algorithm and AdaBoost
4. Ascension tree
summary
Add
Content sources

How to improve （boosting） It's a common statistical learning method , It is widely used and effective . In the classification problem , It changes the weight of training samples , Learn multiple classifiers , These classifiers are combined linearly , Improve the performance of classification .

（1） Introduce boosting Methodological ideas and representative boosting Algorithm AdaBoost
（2） Through training error analysis AdaBoost Why can we improve learning accuracy
（3） Explain from the perspective of forward distribution addition model AdaBoost
（4） The last part is about boosting More specific examples of methods ——boosting tree（ Ascension tree ）

1. How to improve AdaBoost Algorithm

boosting The basic idea ：**boosting Based on such an idea ： For a complex task , The judgment is obtained by combining the judgment of multiple experts , It's better than any of the experts alone .** It's actually “ The Three Stooges , Zhuge Liang at the top ” The truth .

Strong to learn ： In probability approximation correct （probably approximately correct,PAC） In the framework of learning , A concept , If there is a polynomial learning algorithm that can learn it , And the accuracy is very high , Then call this concept strongly learnable .

Weak can learn ： A concept , If there is a polynomial learning algorithm that can learn it , The correct rate of learning is only slightly better than random guess , Then it is called weak learnable .

Strong learnable and weak learnable ：Schapire It is proved that strong learnability and weak learnability are equivalent , in other words , stay PAC Under the framework of learning , A concept is strongly learnable if and only if it is weakly learnable .

From weak learning to strong learning ： Can be “ Weak learning ” Upgrade to “ Strong learning ”, Weak learning algorithms are usually much easier than strong learning algorithms . How to implement improvement , It is called the problem to be solved when developing promotion methods . Many lifting algorithms have been proposed , The most representative AdaBoost.

** The lifting method starts from the weak learning algorithm , Learn over and over again , We get a series of weak classifiers （ Also known as base classifier ）, Then combine these weak classifiers , Make a strong classifier .** Most of the promotion methods are to change the probability distribution of training data （ Weight distribution of training data ）, For different training data distribution , Call weak learning algorithm to learn a series of weak classifiers . Here are two questions .
（1） How to change the weight or probability distribution of training data in each round
（2） How to combine weak classifiers into a strong classifier

For the two problems mentioned above ,AdaBoost The approach is
（1） Improve the weight of samples that have been incorrectly classified by the previous round of weak classifiers , Reduce the weight of those samples that are correctly classified . thus , The data that didn't get the right classification , Due to the increase of weight, it has received greater attention from the later round of weak classifiers .
（2） A weighted majority vote , Increase the weight of the weak classifier with small classifier error , Make it a big part in the voting , Reduce the weight of weak classifier with large classification error , Make it play a smaller role in voting .

Algorithm ：AdaBoost
Input ： Training data set $T=(x_1,y_1),(x_2,y_2),...,(x_N,y_N)$ , among $x_i∈\mathcal{X}⊆R^n,y_i\in \{−1,+1\}$ . Weak learning algorithm .
Output ： The final classifier $G (x)$
（1） Initialize the weight distribution of training data
$D_1=(w_{11},...,w_{1i},...,w_{1N}), w_{1i}=\frac{1}{N}, i=1,2,...,N$
here It is assumed that the data has a uniform weight distribution , Each sample has the same learning function in the base classifier .
（2） Yes m=1,2,…,M
（a） The weight distribution is $D_m$ Training data learning , Get the base classifier $G_m(x):\mathcal{X}→\{+1,−1\}$
（b） Calculation $G_m(x)$ The classification error rate on the training set
$e_m=\sum_{i=1}^N P(G_m(x_i)≠y_i)=\sum_{i=1}^N w_{mi} I(G_{m}(x_i)\ne y_i)=\sum_{G_m(x_i)\ne y_i}w_{mi}$
here , $w_{mi}$ It means the first one m In the middle of the round i The weight of each instance , $\sum_{i=1}^N w_{mi}=1$ , indicate $G_m(x)$ The classification error rate on the weighted training data set is $G_m(x)$ Sum of the weights of the misclassified samples .
（c） Calculation $G_m(x)$ The coefficient of ： $α_m=\frac{1}{2} \log{\frac{1−e_m}{e_m}}$ , Logarithm is natural logarithm . $α_m$ Express $G_m(x)$ Importance in the final classifier . $e_m⩽0.5$ when $α_m⩾0$ , and $α_m$ With $e_m$ Decrease and increase , The smaller the classification error, the greater the role of the base classifier in the final classifier .
（d） Update the weight distribution of the training set , Prepare for the next round
$D_{m+1}=(w_{m+1,1},...,w_{m+1,i},...,w_{m+1,N})$
$w_{m_1, i}=\left\{\begin{matrix} \frac{w_{mi}}{z_m}e^{-\alpha _m}, & G_{m}(x_i)=y_i \\ \frac{w_{mi}}{z_m}e^{\alpha _m}, & G_{m}(x_i)\ne y_i \end{matrix}\right.$
$Z_m=\sum_{i=1}^N w_{mi}exp(−α_my_iG_m(x_i))$
among $Z_m$ For normalization factor .
The weight of misclassified samples becomes larger , The weight of correctly classified samples is reduced , The two-phase comparison is magnified $e^{2\alpha_m}=\frac{1−e_m}{e_m}$ times .
（3） Construct a linear combination of base classifiers
$f(x)=\sum_{m=1}^M α_mG_m(x)$
The final classifier is
$G (x) = s i g n (f (x))$
Be careful $α_m$ The sum is not 1.

2. AdaBoost Training error analysis of algorithm

AdaBoost The most basic property is that it can continuously reduce training errors in the learning process .
Theorem 8.1（AdaBoost The error margin of training ）：AdaBoost The training error bound of the final classifier is
$\frac{1}{N}\sum_{i=1}^{N} I(G(x_i)≠y_i)⩽\frac{1}{N}\sum_i exp(−y_if(x_i))=\prod_m Z_m$

Theorem 8.2 （ Dichotomous problem AdaBoost The error margin of training ）
$\prod_{m=1}^M Z_m=\prod_{m=1}^{M} [2\sqrt{e_m(1−e_m)}]=\prod_{m=1}^{M}\sqrt{1−4\gamma_m^2} \le exp(−2\sum_{m=1}^{M}\gamma_m^2)$
here $\gamma_m=\frac{1}{2}−e_m$ .

inference 8.1： If there is $\gamma>0$ , For all m Yes $\gamma_m⩾\gamma$ , be
$\frac{1}{N}\sum_{i=1}^{N} I(G(x_i)≠y_i)⩽exp(−2M\gamma^2)$
This indicates that under these conditions AdaBoost The training error of is decreasing exponentially , This nature is certainly attractive .

Different from some early promotion methods ,AdaBoost It's adaptive , That is, it can adapt to the respective training error rates of weak classifiers . This is also the origin of its name ,Ada yes Adaptive Abbreviation .

3. AdaBoost Explanation of algorithm

It can be said that AdaBoost The algorithm is an additive model 、 The loss model is exponential 、 Class II classification learning method when the learning algorithm is forward distribution algorithm .

3.1 Forward distribution algorithm

Consider the additive model （additive model）
$f(x)=\sum_{m=1}^M β_m b(x; γ_m)$
among , $\gamma_m)$ Is the base function , $\gamma_m$ Is the parameter of the base function , $\beta_m$ Is the coefficient of the basis function .
In a given training set and loss function $L (y, f (x))$ Under the condition of , Learn the addition model $f (x)$ It becomes an empirical risk minimization problem, that is, the minimization of loss function ：
$\min_{β_m,γ_m}\sum_{i=1}^N L(y_i,\sum_{m=1}M β_mb(x_i; γ_m))$
Usually this is a complex optimization problem . The idea of forward step-by-step algorithm to solve this optimization problem is ： From front to back , We're only going to learn one basis function and its coefficients at each step , The objective function of optimization is approached step by step , Then the complexity of optimization can be simplified . Optimize the following losses in each step ：
$\min_{β,γ}\sum_{i=1}^N L(y_i,βb(x_i; γ))$

Given the training data set $T=(x_1,y_1),(x_2,y_2),...,(x_N,y_N)$ , among $x_i∈\mathcal{X}⊆R^n,y_i\in \{−1,+1\}$ . Loss function $L (y, f (x))$ And the set of basis functions $\{b(x; \gamma)\}$ , Learn the addition model $f (x)$ The forward step algorithm of is as follows ：

such , The forward step algorithm will simultaneously solve from $m = 1$ To M All the parameters $\beta_m$ , $\gamma_m$ The optimization problem is simplified to solve each problem step by step $\beta_m$ , $\gamma_m$ Optimization problem .

3.2 Forward distribution algorithm and AdaBoost

From the forward step-by-step algorithm, we can deduce AdaBoost, Use theorem to describe this relationship .
Theorem 8.3：AdaBoost The algorithm is a special case of forward step-by-step addition algorithm . At this time , The model is an additive model composed of basic classifiers , The loss function is exponential .

4. Ascension tree

Lifting tree is a kind of lifting method based on classification tree or regression tree . The ascension tree is considered One of the best performance methods in statistical learning .

4.1 Ascension tree model

The lifting method is actually based on the additive model （ That is, the linear combination of basis functions ） And the forward step-by-step algorithm . The lifting method based on decision tree is called lifting tree （boosting tree）. For classification problems, the decision tree is a binary classification tree , For regression problems, the decision tree is a binary regression tree .
The promotion tree model can be expressed as the addition model of decision tree ：
$f_M(x)=\sum_{m=1}^M T(x; θ_m)$
among , $T(x;θ_m)$ Represents a decision tree , $θ_m$ For the parameters of the decision tree , $M$ For the number of trees .

4.2 Lifting tree algorithm

The lifting tree algorithm adopts the forward step-by-step algorithm . First determine the initial lift tree $f_0(x)=0$ , The first m The model of step is
$f_m(x)=f_{m-1}(x)+T(x;\theta_m)$
among , $f_{m-1}(x)$ For the current model , The parameters of the next decision tree are determined by empirical risk minimization $θ_m$ .
$\hat{\theta}_m=arg\min_{θ_m}\sum_{i=1}{N} L(y_i,f_{m−1}(x_i)+T(x_i;θ_m))$
Because the linear combination of trees can fit the training data well , Even if the relationship between input and output in the data is complex . So the lifting tree is a high-function learning algorithm .
Learning algorithm of lifting tree ： Next, we discuss the lifting tree learning algorithm for different problems , The main difference is that the loss function used is different . Include Use the classification problem of index loss , Use the regression problem of square loss , as well as Use the general decision problem of general loss .

For dichotomies , Just put the AdaBoost The base classifier of is limited to a binary classification tree , It can be said that the lifting tree algorithm at this time is Adaboost A special case of the algorithm .
The ascending tree of the regression problem , Will input space X Divided into J Disjoint regions $R_1,R_2,...,R_J$ , And determine the output constant in each region $c_j$ , Then the tree can be expressed as
$T(x_i;θ)=\sum_{j=1}^J c_j I(x\in R_j)$
among , Parameters $θ=\{(R_1,c_1),(R_2,c_2),...,(R_J,c_J)\}$ Represents the region division of the tree and the constants on each region . $J$ Is the complexity of the regression tree, that is, the number of leaf nodes .
The regression problem lifting tree uses the following forward step-by-step Algorithm ：
$f_0(x)=0 \\\\ f_m(x) = f_{m−1}(x)+T(x;θ_m), m=1,2,...,M \\\\ f_M(x)=\sum_{i=1}^{M} T(x;θ_m)$
In the second step of the forward step-by-step algorithm m Step , Given the current model $f_{m−1}(x)$ , Demand solution
$\hat{\theta}_m=arg\min_{θ_m}\sum_{i=1}^{N} L(y_i,f_{m−1}(x_i)+T(x_i;θ_m))$
obtain $\hat{\theta}_m$ , That is to say m The parameters of a tree .
When square loss is used , $L(y, f(x))=(y-f(x))^2$ , Its loss becomes
$L(y, f_{m−1}(x)+T(x;θ_m))=[y−f_{m−1}(x)−T(x;θ_m)]^2=[r−T(x;θ_m)]^2$
here , $r=y−f_{m−1}(x)$ , Is the residual of the current model fitting data . therefore , For the lifting tree algorithm of regression problem , Simply fit the residuals of the current model .

Algorithm ： The lifting tree algorithm of regression problem

4.3 Gradient rise

The lifting tree uses the addition model and the forward step-by-step algorithm to realize the optimization process of learning , When the loss function is a square loss function and an exponential loss function , Each step of optimization is very simple , But for the general loss function , Often every step of optimization is not so easy . To address this issue ,Freidman Gradient lifting method is proposed （gradient boosting） Algorithm , It's using The approximate method of the steepest descent method , The key is Using the negative gradient of the loss function The value in the current model $-[\frac{\partial L(y,f(x_i))}{\partial f(x_i)}]_{f(x)=f_{m-1}(x)}$ As an approximation of the residuals in the lifting tree algorithm for regression problems , Fit a regression tree .

Algorithm ： Gradient lifting algorithm

summary

The promotion method is a statistical learning method that promotes the weak learning algorithm to the strong learning algorithm .
AdaBoost The model is a linear combination of weak classifiers .
AdaBoost The feature of the algorithm is to learn one basic classifier each time through iteration .
AdaBoost Each iteration can reduce its classification error rate on the training data set .
Lifting tree is a lifting method based on classification tree or regression tree .

Add

Gradient Boosting

Gradient Boosting Machine（ abbreviation GBM）, from Freidman Put forward , It's a kind of An algorithm based on gradient lifting .

classical AdaBoost The algorithm can only deal with the binary classification learning task with exponential loss function （ Of course , Now there are those who can handle multiple classification or regression tasks AdaBoost Algorithm variation ）, and Gradient Boosting Various learning tasks can be handled by setting different differentiable loss functions （ classification 、 Return to 、 Sort, etc ）, The scope of application is greatly expanded .
AdaBoost For outliers （outlier） Is more sensitive , and Gradient Boosting By introducing bagging thought 、 Methods such as adding regular terms can effectively resist the noise in the training data , Better robustness .

Gradient Boosting The basic idea of , Take the negative gradient as a measure of learning mistakes in the previous round , In the next round of learning, we can correct the mistakes made in the previous round by fitting the negative gradient . The key question here is ： Why can we correct the last round of errors by fitting the negative gradient ？Freidman The answer is ： The gradient of function space falls .

GBDT

Gradient lifting tree , Often referred to as GBDT（Gradient Boosting Decision Tree）, Or is it GBT（Gradient Boosting Tree）,GTB（Gradient Tree Boosting ）,GBRT（Gradient Boosting Regression Tree）,MART（Multiple Additive Regression Tree）.GBDT It has been widely used in practical business , It plays an important role in machine learning algorithm .GBDT It's using CART The regression tree in

When solving classification problems ,GBDT The common loss functions are ： Index loss （ Return to Adaboost）、 Negative logarithmic loss . Index loss is easily affected by outliers , And it can only be used for binary classification problems , therefore sklearn in GradientBoostingClassifier The default loss function of is the latter .

In solving the problem of return ,GBDT The common loss functions are ： Mean square error loss 、 Absolute loss 、Huber Loss and quantile loss . about Huber Loss and quantile loss , Often used for robust regression , It can reduce the influence of outliers on the loss function , contrary , The loss of mean square deviation is easily affected by outliers .

XGBoost

XGBoost It's right GBDT The extended implementation of , Not only is it more efficient in speed , And the algorithm derivation has also been improved . The main differences between the two are as follows ：

GBDT With CART Regression tree as a base learner ,XGBoost It supports gbtree、dart and gblinear Three basic learning devices . among , The first two are tree based models ,dart And I think about it dropout thought ,gblinear Is a linear model , This is the time XGBoost It's equivalent to taking L1 and L2 Logical regression of regularized terms （ Classification problem ） Or linear regression （ The return question ）.
GBDT We only use the first derivative information when we optimize ,XGBoost Second order Taylor expansion of the cost function is carried out , First and second derivatives are used at the same time . By the way ,XGBoost The tool supports custom cost functions , As long as the function is first-order and second-order differentiable .
More fundamentally ,GBDT Mainly for loss $L (y, F)$ About $F$ Find gradient , The regression tree is used to fit the negative gradient ;XGBoost Mainly for loss $L (y, F)$ Second order Taylor expansion , Then find the analytical solution , With analytical solution $o b j$ As a standard , Greedy search split Whether the tree $o b j$ better , More direct .
XGBoost Regular terms are added to the cost function , To control the complexity of the model . The regular term contains the number of leaf nodes in the tree 、 Output on each leaf node score Of L2 The sum of the squared magnitudes . from bias-variance tradeoff perspective , Regular terms reduce the value of the model variance, Make the model easier to learn , Prevent over fitting .
XGBoost We learn from random forest , Support column sampling （column subsampling）, That is, randomly select some features for learning , This not only reduces over fitting , It also reduces computation .
There is a missing sample for the value of the feature ,XGBoost Can automatically learn its split direction .
XGBoost Tools support parallelism . It should be noted that , Parallelism here is not tree Granularity parallelism ,XGBoost One iteration is the end of the next iteration （ The first t The cost function of the second iteration contains the preceding t-1 The predicted value of the second iteration ）,XGBoost The parallelism of is on the feature granularity .
We know , One of the most time-consuming steps in learning a decision tree is to sort the values of the features （ Because we want to determine the optimal segmentation point ）,XGBoost Before training , Sort the data in advance , Save as block structure , This structure is reused in subsequent iterations , Greatly reduced computation . This block Structure also makes parallelism possible ： When splitting nodes , You need to calculate the gain for each feature , Finally, the feature with the largest gain is selected to split , Then the gain calculation of each feature can be opened multithreading , Achieve the goal of parallelism .
When the tree node is splitting , We need to calculate the gain of each segmentation point for each feature , That is to enumerate all possible segmentation points by greedy method . When data cannot be loaded into memory at one time or in a distributed situation , Greedy algorithms become inefficient , therefore XGBoost A parallel approximate histogram algorithm is also proposed , For efficient generation of candidate segmentation points .
XGBoost The regression tree and GBDT What is used CART The regression tree is not consistent , The leaf node output value of the latter , Is the mean value of all sample points assigned to the leaf node , The leaf node output value of the former is calculated according to the scoring formula .
Boosting The main concern is to reduce bias, therefore Boosting It can build strong integration based on learners with weak generalization performance ;Bagging The main concern is to reduce variance, So it's not pruning the decision tree 、 The effectiveness of neural network is more obvious .

LightGBM It is an open source library of Microsoft in recent years , Compare with XGBoost, It is faster and uses less memory .LightGBM The improvement points of the system mainly lie in the following aspects ：

Using histogram algorithm to find segmentation points , Reduce memory footprint 、 Speed up
use Leaf-wise The way to build , Improve accuracy
Optimization of parallel learning
Reduce the training set （GOSS）、 Reduce features （EFB）

Content sources

[1] 《 Statistical learning method 》 expericnce
[2] https://www.cnblogs.com/liaohuiqiang/p/10980080.html
[3] 《 Statistical learning method 》 series （8）

原网站

版权声明
本文为[Sickle leek]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207071027576918.html