当前位置：网站首页>In depth analysis of integrated learning xgboost (Continued)

In depth analysis of integrated learning xgboost (Continued)

2022-07-28 23:45:00 【「 25' h 」】

Yes XGBoost Come on , The really difficult part is not to sort out the above algorithm process , But prove that this process can make the model run in the direction of minimizing the objective function . The following obvious problems are included in this process ：

Fitting when building trees $r_{ik} = -\frac{g_{ik}}{h_{ik}}$ What is it ？ What is the significance of fitting it ？
How is the formula of structure fraction and structure fraction gain derived ？ Why building trees like this can improve the effect of the model ？
Why is the output value of the leaf node $w_j$ yes $-\frac{(\sum_{i \in j} g_{ik})}{\sum_{i \in j} h_{ik} + \lambda}$ ？ What's the meaning of this output ？
The first part of the course says XGBoost Fitting is also residual , Where is the residual ？

Define the objective function and the independent variables of the objective function

First , According to the previous definition of the objective function ,XGBoost The objective function in is the objective function for a tree , Instead of aiming at the objective function of a sample or an entire algorithm . also , The objective function of any tree includes three parts ： Loss function $l$ 、 The number of leaves $T$ And regular terms . Specifically ：

Suppose a single tree $f_k$ The objective function of is $O_k$ , All in all $T$ A leaf , Any sample on the tree $i$ The loss function of is $l((y_i,H(x_i))$ , among $H(x_i)$ yes $i$ The prediction result of sample No. on the integrated algorithm . There are... On the tree M Samples , Used in the objective function L2 Regularization （ $\lambda$ Not for 0, $\alpha$ by 0）, also $\gamma$ Not for 0, Then the objective function of the tree is ：

$O_k = \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2$

Our goal is to minimize the objective function , And find an independent variable that minimizes the objective function . For those who use the ordinary loss function Boosting Algorithm , The output of the algorithm $H (x)$ It is constantly changing in the process of iteration , Loss function $l (y, H (x))$ It also keeps getting smaller in iterations ：

$H_k(x_i) = H_{k-1}(x_i) + f_k(x_i)$

$l_k = l(y_i,H_{k-1}(x_i) + f_k(x_i))$

When iterating to the $k$ When the time , In the loss function $y_i$ And $H_{k-1}(x_i)$ It's all constant , Only $f_k(x_i)$ It's a variable. , Therefore, we only need to correct $f_k(x_i)$ Derivation , And find the prediction value that minimizes the overall loss function $f_k(x_i)$ that will do . stay GBDT among , We mentioned , Whether weak evaluator $f_k$ What is the structure 、 What are the rules? 、 How to set up 、 How to fit , As long as its final output value $f_k(x_i)$ Is to make the overall loss function $L$ Minimized $f_k(x_i)$ , As the algorithm iterates step by step , The loss function is bound to become smaller and smaller . therefore , A suitable $f_k(x_i)$ It can not only ensure the continuous reduction of losses , It can also guide the establishment of a single evaluator .

stay XGBoost among , We can also derive the objective function 、 And find an independent variable that minimizes the objective function , But the problem is ,XGBoost There are multiple independent variables in the objective function of ：

$\begin{aligned} O_k &= \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 \\ &= \sum_{i=1}^M l \left( y_i,H_{k-1}(x_i) + \boldsymbol{\color{red}{f_k(x_i)}} \right) + \gamma \boldsymbol{\color{red}T} + \frac{1}{2}\lambda\sum_{j=1}^T\boldsymbol{\color{red}{w_j}}^2 \end{aligned}$

among , $T$ It's No $k$ The total number of leaves on a tree , $f_k(x_i)$ And $w_j$ Are the predicted values of the model output （ The output value on the leaf ）, But in different forms , On any leaf $j$ Samples on the Internet $i$ for , Numerically $f_k(x_i) = w_j$ . Yes XGBoost Come on , Only one variable can be selected as an independent variable , in consideration of $f_k(x_i)$ It is only related to the accuracy of a single sample , and $T$ Only related to tree structure ,XGBoost The final choice of the paper is related to accuracy 、 And variables related to tree structure $w_j$ . meanwhile , If you know the best output value of leaves $w_j$ It can guide the tree to grow into a reasonable structure , But I only know the total amount of leaves $T$ Is unable to guide the establishment .

therefore , solve XGBoost The first step of the objective function , It is to arrange the objective function into $w_j$ The form of expression .

Taylor expands the objective function

In our objective function $O_k$ in , What can be expanded by Taylor is the first part of the loss function $L$ ：

$O_k = \sum_{i=1}^Ml \left( y_i,H_{k-1}(x_i) + f_k(x_i) \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2$

Because of the loss function $l$ There is only one variable in $H_{k-1}(x_i) + f_k(x_i)$ , Therefore, the function can be abbreviated as $l(H_{k-1}(x_i) + f_k(x_i))$ .

According to the second-order Taylor expansion , It is known that ：

$\begin{aligned} f(x) &\approx \sum_{n=0}^{2}\frac{f^{(n)}(a)}{n!}(x-a)^n \\ &\approx f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 \end{aligned}$

Let Taylor expand $x = H_{k-1}(x_i) + f_k(x_i)$ , Let Taylor expand $a = H_{k-1}(x_i)$ , be $x-a) = f_k(x_i)$ . Accordingly , Loss function $l(H_{k-1}(x_i) + f_k(x_i))$ Can be expressed as ：

$\begin{aligned} l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + \frac{\partial{l(H_{k-1}(x_i))}}{\partial{H_{k-1}(x_i)}} * f_k(x_i) + \frac{\partial^2{l(H_{k-1}(x_i))}}{2\partial{H^2_{k-1}(x_i)}} * f^2_k(x_i)\\ \end{aligned}$

stay XGBoost We have defined the first derivative and the second derivative of the loss function ：

$g_{ik} = \frac{\partial{l(y_i,H_{k-1}(x_i))}}{\partial{H_{t-1}(x_i)}}$

$h_{ik} = \frac{\partial^2{l(y_i,H_{k-1}(x_i))}}{\partial{H^2_{t-1}(x_i)}}$

stay XGBoost In the original paper , For the sake of brevity , $g_i$ and $h_i$ There is no subscript $k$ , But we know that ： $g$ And $h$ It needs to be recalculated in each iteration . Here we also refer to the practice in the original paper to remove the subscript $k$ . therefore , The formula after Taylor expansion can be reduced to ：

$\begin{aligned}l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \\ &\approx constant + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \end{aligned}$

It's not hard to find out , In this formula $H_{k-1}(x_i)$ Is constant , So the first part $l(H_{t-1}(x_i))$ It's also a constant , Constants cannot be minimized , Therefore, we can eliminate constants from the objective function . After Taylor expansion , The objective function becomes ：

$\begin{aligned} \tilde{O}_k &= \sum_{i=1}^M\left(g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned}$

Unified argument

Now the first two terms of the objective function respectively represent the $g_if_k(x_i)$ The sum of the , And all samples $h_if^2_k(x_i)$ And multiply 1/2. Don't forget , The only independent variable we choose is $w_j$ , Therefore, we hope to be able to $f_k$ In some way into $w_j$ . It has been mentioned many times before , On any leaf $j$ Samples on the Internet $i$ for , Numerically $f_k(x_i) = w_j$ , We can try to transform from a sample ：

For a single sample $i$ , Suppose this sample is located in the leaf $j$ On , Should have ：

$g_if_k(x_i) = g_iw_j$

For a leaf $j$ , We can calculate all the samples on this leaf $g_iw_j$ The sum of the ：

$\sum_{i \in j} g_iw_j$

And all samples on a leaf $w_j$ It's all consistent , So on a leaf $g_iw_j$ The sum can be transformed into ：

$\begin{aligned}\sum_{i \in j} g_iw_j &= g_1w_j \ + \ g_2w_j \ + \ ... \ + \ g_nw_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w_j\sum_{i \in j} g_i\end{aligned}$

Suppose there are $T$ A leaf , Then all samples on the whole tree $g_iw_j$ The sum is ：

$\sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right)$

therefore ：

$\sum_{i=1}^Mg_if_k(x_i) = \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right)$

Empathy , A single sample $i$ Of $h_if^2_k(x_i)$ It can also be transformed in the same way . For a single sample ：

$h_if^2_k(x_i) = h_iw^2_j$

To a leaf ：

$\begin{aligned}\sum_{i \in j}h_iw^2_j &= h_1w^2_j \ + \ h_2w^2_j \ + \ ... \ + \ h_nw^2_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w^2_j\sum_{i \in j} h_i \end{aligned}$

On the whole tree ：

$\sum_{i=1}^Mh_if^2_k(x_i) = \sum_{j=1}^T \left( w^2_j\sum_{i \in j} h_i \right)$

So for the whole objective function ：

$\begin{aligned} \tilde{O}_k &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &=\sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned}$

It's not hard to find out , Now the regular term can be merged with the part of the original loss function ：

$\begin{aligned} &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i + \frac{1}{2}\lambda w_j^2 \right) + \gamma T \\ &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \right) + \gamma T\end{aligned}$

After the merger , The whole objective function becomes two terms , One is on all leaves （ Loss + Regular ） The sum of the , The other is the total amount of leaves . Now? , We can start to solve the minimum objective function and the corresponding optimal independent variable $w_j$ 了 .

solve XGBoost The objective function of

First , It is impossible to minimize the total number of leaves in the objective function , Excessively reducing the total number of leaves will greatly damage the learning ability of the model , So we can only consider making all the leaves （ Loss + Regular ） The sum is the smallest .

secondly , When the tree is built , Leaves are independent of each other , So on each leaf （ Loss + Regular ） They are also independent of each other . We just need to make every leaf （ Loss + Regular ） All smallest , It can guarantee the of all leaves （ Loss + Regular ） The sum is the smallest . so , We need to minimize the part marked red in the formula ：

$\begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T\end{aligned}$

Leaf weight $w_j$
Name the part marked in red $\mu_j$ , It means leaves $j$ The loss on + Regular . Then there are ：

$\mu_j = w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)$

Now? , To leaves $j$ for , stay $\mu_j$ The last pair of unique arguments $w_j$ Derivation , Then there are ：

$\begin{aligned}\frac{\partial{\mu_j}}{\partial w_j} &= \frac{\partial{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}}{\partial w_j} \\ \\ &= \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda)\end{aligned}$

Let the first derivative be 0, Then there are ：

$\begin{aligned} \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda) &= 0 \\ \\ w_j(\sum_{i \in j} h_i + \lambda) &= -\sum_{i \in j} g_i \\ \\ w_j &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda}\end{aligned}$

You should have found out , For a leaf , Minimize the objective function $w_j$ It's the leaf weight we mentioned before , That is to say XGBoost The output value on the leaf in the mathematical process . If you want to make the output of leaves very close to the leaf weight formula , How to fit each sample ？

Fit value

On any leaf $j$ Samples on the Internet $i$ Come on ：

$\mu_i = w_jg_i + \frac{1}{2}w^2_jh_i$

Put on a leaf $\mu_j$ Into a $\mu_i$ when , In principle, we need to $\mu_j$ Each item in is converted into the corresponding item of a single sample , However, there are problems in converting regular terms ： And $\sum_{i \in j} g_i$ This can directly point to the different items of a single sample , $\lambda$ Is the value set for a leaf , If you want to $\lambda$ It is transformed into a regular term for a single sample , You need to know how many samples there are on the current leaf . However , Fitting occurs before tree building , Therefore, it is impossible to know the total amount of samples on a leaf at this time , So in xgboost In the actual implementation process of , Fitting each leaf does not involve regular terms , Regular terms are used only when calculating structure scores and leaf output values .

Yes $\mu_i$ The only argument on $w_j$ Derivation , Then there are ：

$\begin{aligned}\frac{\partial{\mu_i}}{\partial w_j} &= \frac{\partial{\left( w_jg_i + \frac{1}{2}w^2_jh_i \right)}}{\partial w_j} \\ \\ &= g_i + w_jh_i\end{aligned}$

Let the first derivative be 0, Then there are ：

$\begin{aligned} g_i + w_jh_i &= 0 \\ \\ w_jh_i &= - g_i \\ \\ w_j &= -\frac{g_i}{h_i} \end{aligned}$

For any sample $i$ for , The optimum that minimizes the objective function $w_j$ Is our pseudo residual $r_i$ , That is to say XGBoost The fitting value used for fitting in the mathematical process .

Structure score

Now? , We put the optimal that minimizes the objective function $w_j$ Bring back $\mu_j$ in , see $\mu_j$ How to change ：

$\begin{aligned} \mu_j &= w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \\ &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda} * \sum_{i \in j} g_i + \frac{1}{2}(-\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda})^2 * {\sum_{i \in j} h_i + \lambda}\\ &= -\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \\ &= - \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \end{aligned}$

therefore , Objective function （ Loss on all leaves ） It can become ：

$\begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T \\ \\ &= \sum_{j=1}^T \left( -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \right) + \gamma T \end{aligned}$

therefore , The objective function on a leaf is ：

$O_j = -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \gamma$

For any leaf , The objective function can measure the quality of leaves , among $\gamma$ It is a super parameter that can be set , $\frac{1}{2}$ Constant , So for any leaf , We hope that the smaller the part marked in red, the better ：

$O_j = \frac{1}{2}\left( \boldsymbol{\color{red}{-\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda}}} \right)+ \gamma$

so , We hope that the larger the following formula is, the better ：

$\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda}$

This formula , It is XGBoost Indicators for branching “ Structure score ”（Structure Score）.

Gain of structure fraction

When branching , We hope that the smaller the objective function, the better , So in the process of branching , The objective function of the parent node is larger than that of the child node , So we can use （ Parent node objective function - Sum of objective functions of child nodes ） To measure the quality of branches , Then there are ：

$\begin{aligned} Gain &= O_p - (O_l + O_r) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma - (-\frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \gamma -\frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} + \gamma) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma + \frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} - \gamma + \frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \gamma \\ \\ &= \frac{1}{2}\left( \frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} \right) - \gamma \\ \\ &= \frac{1}{2} (Score_L + Score_R - Score_P) - \gamma \end{aligned}$

among , $\gamma$ It is a super parameter that can be set , $\frac{1}{2}$ Constant , therefore ：

$Gain = Score_L + Score_R - Score_P$

This is the structure fraction gain we use when branching .

Now you find ,XGBoost All new formulas used in the process （ Including unique fitting values 、 Unique branching index 、 Unique output value ） It is solved by minimizing the objective function . therefore ,XGBoost The whole process ensures that the objective function must iterate in the direction of minimization , Output value on each newly generated leaf $w_j$ Are output values that minimize the objective function . Now? , You can answer the first question

原网站

版权声明
本文为[「 25' h 」]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207282149466412.html

当前位置：网站首页>In depth analysis of integrated learning xgboost (Continued)

In depth analysis of integrated learning xgboost (Continued)

边栏推荐

猜你喜欢

随机推荐