当前位置:网站首页>In depth analysis of integrated learning xgboost (Continued)
In depth analysis of integrated learning xgboost (Continued)
2022-07-28 23:45:00 【「 25' h 」】
Yes XGBoost Come on , The really difficult part is not to sort out the above algorithm process , But prove that this process can make the model run in the direction of minimizing the objective function . The following obvious problems are included in this process :
Fitting when building trees r i k = − g i k h i k r_{ik} = -\frac{g_{ik}}{h_{ik}} rik=−hikgik What is it ? What is the significance of fitting it ?
How is the formula of structure fraction and structure fraction gain derived ? Why building trees like this can improve the effect of the model ?
Why is the output value of the leaf node w j w_j wj yes − ( ∑ i ∈ j g i k ) ∑ i ∈ j h i k + λ -\frac{(\sum_{i \in j} g_{ik})}{\sum_{i \in j} h_{ik} + \lambda} −∑i∈jhik+λ(∑i∈jgik)? What's the meaning of this output ?
The first part of the course says XGBoost Fitting is also residual , Where is the residual ?
First , According to the previous definition of the objective function ,XGBoost The objective function in is the objective function for a tree , Instead of aiming at the objective function of a sample or an entire algorithm . also , The objective function of any tree includes three parts : Loss function l l l、 The number of leaves T T T And regular terms . Specifically :
Suppose a single tree f k f_k fk The objective function of is O k O_k Ok, All in all T T T A leaf , Any sample on the tree i i i The loss function of is l ( ( y i , H ( x i ) ) l((y_i,H(x_i)) l((yi,H(xi)), among H ( x i ) H(x_i) H(xi) yes i i i The prediction result of sample No. on the integrated algorithm . There are... On the tree M Samples , Used in the objective function L2 Regularization ( λ \lambda λ Not for 0, α \alpha α by 0), also γ \gamma γ Not for 0, Then the objective function of the tree is :
O k = ∑ i = 1 M l ( y i , H k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 O_k = \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 Ok=i=1∑Ml(yi,Hk(xi))+γT+21λj=1∑Twj2
Our goal is to minimize the objective function , And find an independent variable that minimizes the objective function . For those who use the ordinary loss function Boosting Algorithm , The output of the algorithm H ( x ) H(x) H(x) It is constantly changing in the process of iteration , Loss function l ( y , H ( x ) ) l(y,H(x)) l(y,H(x)) It also keeps getting smaller in iterations :
H k ( x i ) = H k − 1 ( x i ) + f k ( x i ) H_k(x_i) = H_{k-1}(x_i) + f_k(x_i) Hk(xi)=Hk−1(xi)+fk(xi)
l k = l ( y i , H k − 1 ( x i ) + f k ( x i ) ) l_k = l(y_i,H_{k-1}(x_i) + f_k(x_i)) lk=l(yi,Hk−1(xi)+fk(xi))
When iterating to the k k k When the time , In the loss function y i y_i yi And H k − 1 ( x i ) H_{k-1}(x_i) Hk−1(xi) It's all constant , Only f k ( x i ) f_k(x_i) fk(xi) It's a variable. , Therefore, we only need to correct f k ( x i ) f_k(x_i) fk(xi) Derivation , And find the prediction value that minimizes the overall loss function f k ( x i ) f_k(x_i) fk(xi) that will do . stay GBDT among , We mentioned , Whether weak evaluator f k f_k fk What is the structure 、 What are the rules? 、 How to set up 、 How to fit , As long as its final output value f k ( x i ) f_k(x_i) fk(xi) Is to make the overall loss function L L L Minimized f k ( x i ) f_k(x_i) fk(xi), As the algorithm iterates step by step , The loss function is bound to become smaller and smaller . therefore , A suitable f k ( x i ) f_k(x_i) fk(xi) It can not only ensure the continuous reduction of losses , It can also guide the establishment of a single evaluator .
stay XGBoost among , We can also derive the objective function 、 And find an independent variable that minimizes the objective function , But the problem is ,XGBoost There are multiple independent variables in the objective function of :
O k = ∑ i = 1 M l ( y i , H k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ i = 1 M l ( y i , H k − 1 ( x i ) + f k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} O_k &= \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 \\ &= \sum_{i=1}^M l \left( y_i,H_{k-1}(x_i) + \boldsymbol{\color{red}{f_k(x_i)}} \right) + \gamma \boldsymbol{\color{red}T} + \frac{1}{2}\lambda\sum_{j=1}^T\boldsymbol{\color{red}{w_j}}^2 \end{aligned} Ok=i=1∑Ml(yi,Hk(xi))+γT+21λj=1∑Twj2=i=1∑Ml(yi,Hk−1(xi)+fk(xi))+γT+21λj=1∑Twj2
among , T T T It's No k k k The total number of leaves on a tree , f k ( x i ) f_k(x_i) fk(xi) And w j w_j wj Are the predicted values of the model output ( The output value on the leaf ), But in different forms , On any leaf j j j Samples on the Internet i i i for , Numerically f k ( x i ) = w j f_k(x_i) = w_j fk(xi)=wj. Yes XGBoost Come on , Only one variable can be selected as an independent variable , in consideration of f k ( x i ) f_k(x_i) fk(xi) It is only related to the accuracy of a single sample , and T T T Only related to tree structure ,XGBoost The final choice of the paper is related to accuracy 、 And variables related to tree structure w j w_j wj. meanwhile , If you know the best output value of leaves w j w_j wj It can guide the tree to grow into a reasonable structure , But I only know the total amount of leaves T T T Is unable to guide the establishment .
therefore , solve XGBoost The first step of the objective function , It is to arrange the objective function into w j w_j wj The form of expression .
In our objective function O k O_k Ok in , What can be expanded by Taylor is the first part of the loss function L L L:
O k = ∑ i = 1 M l ( y i , H k − 1 ( x i ) + f k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 O_k = \sum_{i=1}^Ml \left( y_i,H_{k-1}(x_i) + f_k(x_i) \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 Ok=i=1∑Ml(yi,Hk−1(xi)+fk(xi))+γT+21λj=1∑Twj2
Because of the loss function l l l There is only one variable in H k − 1 ( x i ) + f k ( x i ) H_{k-1}(x_i) + f_k(x_i) Hk−1(xi)+fk(xi), Therefore, the function can be abbreviated as l ( H k − 1 ( x i ) + f k ( x i ) ) l(H_{k-1}(x_i) + f_k(x_i)) l(Hk−1(xi)+fk(xi)).
According to the second-order Taylor expansion , It is known that :
f ( x ) ≈ ∑ n = 0 2 f ( n ) ( a ) n ! ( x − a ) n ≈ f ( a ) + f ′ ( a ) 1 ! ( x − a ) + f ′ ′ ( a ) 2 ! ( x − a ) 2 \begin{aligned} f(x) &\approx \sum_{n=0}^{2}\frac{f^{(n)}(a)}{n!}(x-a)^n \\ &\approx f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 \end{aligned} f(x)≈n=0∑2n!f(n)(a)(x−a)n≈f(a)+1!f′(a)(x−a)+2!f′′(a)(x−a)2
Let Taylor expand x = H k − 1 ( x i ) + f k ( x i ) x = H_{k-1}(x_i) + f_k(x_i) x=Hk−1(xi)+fk(xi), Let Taylor expand a = H k − 1 ( x i ) a = H_{k-1}(x_i) a=Hk−1(xi), be ( x − a ) = f k ( x i ) (x-a) = f_k(x_i) (x−a)=fk(xi). Accordingly , Loss function l ( H k − 1 ( x i ) + f k ( x i ) ) l(H_{k-1}(x_i) + f_k(x_i)) l(Hk−1(xi)+fk(xi)) Can be expressed as :
l ( H k − 1 ( x i ) + f k ( x i ) ) ≈ l ( H k − 1 ( x i ) ) + ∂ l ( H k − 1 ( x i ) ) ∂ H k − 1 ( x i ) ∗ f k ( x i ) + ∂ 2 l ( H k − 1 ( x i ) ) 2 ∂ H k − 1 2 ( x i ) ∗ f k 2 ( x i ) \begin{aligned} l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + \frac{\partial{l(H_{k-1}(x_i))}}{\partial{H_{k-1}(x_i)}} * f_k(x_i) + \frac{\partial^2{l(H_{k-1}(x_i))}}{2\partial{H^2_{k-1}(x_i)}} * f^2_k(x_i)\\ \end{aligned} l(Hk−1(xi)+fk(xi))≈l(Hk−1(xi))+∂Hk−1(xi)∂l(Hk−1(xi))∗fk(xi)+2∂Hk−12(xi)∂2l(Hk−1(xi))∗fk2(xi)
stay XGBoost We have defined the first derivative and the second derivative of the loss function :
g i k = ∂ l ( y i , H k − 1 ( x i ) ) ∂ H t − 1 ( x i ) g_{ik} = \frac{\partial{l(y_i,H_{k-1}(x_i))}}{\partial{H_{t-1}(x_i)}} gik=∂Ht−1(xi)∂l(yi,Hk−1(xi))
h i k = ∂ 2 l ( y i , H k − 1 ( x i ) ) ∂ H t − 1 2 ( x i ) h_{ik} = \frac{\partial^2{l(y_i,H_{k-1}(x_i))}}{\partial{H^2_{t-1}(x_i)}} hik=∂Ht−12(xi)∂2l(yi,Hk−1(xi))
stay XGBoost In the original paper , For the sake of brevity , g i g_i gi and h i h_i hi There is no subscript k k k, But we know that : g g g And h h h It needs to be recalculated in each iteration . Here we also refer to the practice in the original paper to remove the subscript k k k. therefore , The formula after Taylor expansion can be reduced to :
l ( H k − 1 ( x i ) + f k ( x i ) ) ≈ l ( H k − 1 ( x i ) ) + g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ≈ constant + g i f k ( x i ) + 1 2 h i f k 2 ( x i ) \begin{aligned}l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \\ &\approx constant + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \end{aligned} l(Hk−1(xi)+fk(xi))≈l(Hk−1(xi))+gifk(xi)+21hifk2(xi)≈ constant +gifk(xi)+21hifk2(xi)
It's not hard to find out , In this formula H k − 1 ( x i ) H_{k-1}(x_i) Hk−1(xi) Is constant , So the first part l ( H t − 1 ( x i ) ) l(H_{t-1}(x_i)) l(Ht−1(xi)) It's also a constant , Constants cannot be minimized , Therefore, we can eliminate constants from the objective function . After Taylor expansion , The objective function becomes :
O ~ k = ∑ i = 1 M ( g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ i = 1 M g i f k ( x i ) + 1 2 ∑ i = 1 M h i f k 2 ( x i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} \tilde{O}_k &= \sum_{i=1}^M\left(g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned} O~k=i=1∑M(gifk(xi)+21hifk2(xi))+γT+21λj=1∑Twj2=i=1∑Mgifk(xi)+21i=1∑Mhifk2(xi)+γT+21λj=1∑Twj2
Now the first two terms of the objective function respectively represent the g i f k ( x i ) g_if_k(x_i) gifk(xi) The sum of the , And all samples h i f k 2 ( x i ) h_if^2_k(x_i) hifk2(xi) And multiply 1/2. Don't forget , The only independent variable we choose is w j w_j wj, Therefore, we hope to be able to f k f_k fk In some way into w j w_j wj. It has been mentioned many times before , On any leaf j j j Samples on the Internet i i i for , Numerically f k ( x i ) = w j f_k(x_i) = w_j fk(xi)=wj, We can try to transform from a sample :
For a single sample i i i, Suppose this sample is located in the leaf j j j On , Should have :
g i f k ( x i ) = g i w j g_if_k(x_i) = g_iw_j gifk(xi)=giwj
For a leaf j j j, We can calculate all the samples on this leaf g i w j g_iw_j giwj The sum of the :
∑ i ∈ j g i w j \sum_{i \in j} g_iw_j i∈j∑giwj
And all samples on a leaf w j w_j wj It's all consistent , So on a leaf g i w j g_iw_j giwj The sum can be transformed into :
∑ i ∈ j g i w j = g 1 w j + g 2 w j + . . . + g n w j , among 1 , 2... n It's the leaves j Samples on the Internet = w j ∑ i ∈ j g i \begin{aligned}\sum_{i \in j} g_iw_j &= g_1w_j \ + \ g_2w_j \ + \ ... \ + \ g_nw_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w_j\sum_{i \in j} g_i\end{aligned} i∈j∑giwj=g1wj + g2wj + ... + gnwj, among 1,2...n It's the leaves j Samples on the Internet =wji∈j∑gi
Suppose there are T T T A leaf , Then all samples on the whole tree g i w j g_iw_j giwj The sum is :
∑ j = 1 T ( w j ∑ i ∈ j g i ) \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right) j=1∑T(wji∈j∑gi)
therefore :
∑ i = 1 M g i f k ( x i ) = ∑ j = 1 T ( w j ∑ i ∈ j g i ) \sum_{i=1}^Mg_if_k(x_i) = \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right) i=1∑Mgifk(xi)=j=1∑T(wji∈j∑gi)
Empathy , A single sample i i i Of h i f k 2 ( x i ) h_if^2_k(x_i) hifk2(xi) It can also be transformed in the same way . For a single sample :
h i f k 2 ( x i ) = h i w j 2 h_if^2_k(x_i) = h_iw^2_j hifk2(xi)=hiwj2
To a leaf :
∑ i ∈ j h i w j 2 = h 1 w j 2 + h 2 w j 2 + . . . + h n w j 2 , among 1 , 2... n It's the leaves j Samples on the Internet = w j 2 ∑ i ∈ j h i \begin{aligned}\sum_{i \in j}h_iw^2_j &= h_1w^2_j \ + \ h_2w^2_j \ + \ ... \ + \ h_nw^2_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w^2_j\sum_{i \in j} h_i \end{aligned} i∈j∑hiwj2=h1wj2 + h2wj2 + ... + hnwj2, among 1,2...n It's the leaves j Samples on the Internet =wj2i∈j∑hi
On the whole tree :
∑ i = 1 M h i f k 2 ( x i ) = ∑ j = 1 T ( w j 2 ∑ i ∈ j h i ) \sum_{i=1}^Mh_if^2_k(x_i) = \sum_{j=1}^T \left( w^2_j\sum_{i \in j} h_i \right) i=1∑Mhifk2(xi)=j=1∑T(wj2i∈j∑hi)
So for the whole objective function :
O ~ k = ∑ i = 1 M g i f k ( x i ) + 1 2 ∑ i = 1 M h i f k 2 ( x i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ∑ i ∈ j h i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} \tilde{O}_k &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &=\sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned} O~k=i=1∑Mgifk(xi)+21i=1∑Mhifk2(xi)+γT+21λj=1∑Twj2=j=1∑T(wji∈j∑gi+21wj2i∈j∑hi)+γT+21λj=1∑Twj2
It's not hard to find out , Now the regular term can be merged with the part of the original loss function :
= ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ∑ i ∈ j h i + 1 2 λ w j 2 ) + γ T = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T \begin{aligned} &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i + \frac{1}{2}\lambda w_j^2 \right) + \gamma T \\ &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \right) + \gamma T\end{aligned} =j=1∑T(wji∈j∑gi+21wj2i∈j∑hi+21λwj2)+γT=j=1∑T(wji∈j∑gi+21wj2(i∈j∑hi+λ))+γT
After the merger , The whole objective function becomes two terms , One is on all leaves ( Loss + Regular ) The sum of the , The other is the total amount of leaves . Now? , We can start to solve the minimum objective function and the corresponding optimal independent variable w j w_j wj 了 .
First , It is impossible to minimize the total number of leaves in the objective function , Excessively reducing the total number of leaves will greatly damage the learning ability of the model , So we can only consider making all the leaves ( Loss + Regular ) The sum is the smallest .
secondly , When the tree is built , Leaves are independent of each other , So on each leaf ( Loss + Regular ) They are also independent of each other . We just need to make every leaf ( Loss + Regular ) All smallest , It can guarantee the of all leaves ( Loss + Regular ) The sum is the smallest . so , We need to minimize the part marked red in the formula :
O ~ k = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T \begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T\end{aligned} O~k=j=1∑T⎝⎛wji∈j∑gi+21wj2(i∈j∑hi+λ)⎠⎞+γT
Leaf weight w j w_j wj
Name the part marked in red μ j \mu_j μj, It means leaves j j j The loss on + Regular . Then there are :
μ j = w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) \mu_j = w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) μj=wji∈j∑gi+21wj2(i∈j∑hi+λ)
Now? , To leaves j j j for , stay μ j \mu_j μj The last pair of unique arguments w j w_j wj Derivation , Then there are :
∂ μ j ∂ w j = ∂ w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ∂ w j = ∑ i ∈ j g i + w j ( ∑ i ∈ j h i + λ ) \begin{aligned}\frac{\partial{\mu_j}}{\partial w_j} &= \frac{\partial{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}}{\partial w_j} \\ \\ &= \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda)\end{aligned} ∂wj∂μj=∂wj∂wj∑i∈jgi+21wj2(∑i∈jhi+λ)=i∈j∑gi+wj(i∈j∑hi+λ)
Let the first derivative be 0, Then there are :
∑ i ∈ j g i + w j ( ∑ i ∈ j h i + λ ) = 0 w j ( ∑ i ∈ j h i + λ ) = − ∑ i ∈ j g i w j = − ∑ i ∈ j g i ∑ i ∈ j h i + λ \begin{aligned} \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda) &= 0 \\ \\ w_j(\sum_{i \in j} h_i + \lambda) &= -\sum_{i \in j} g_i \\ \\ w_j &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda}\end{aligned} i∈j∑gi+wj(i∈j∑hi+λ)wj(i∈j∑hi+λ)wj=0=−i∈j∑gi=−∑i∈jhi+λ∑i∈jgi
You should have found out , For a leaf , Minimize the objective function w j w_j wj It's the leaf weight we mentioned before , That is to say XGBoost The output value on the leaf in the mathematical process . If you want to make the output of leaves very close to the leaf weight formula , How to fit each sample ?
On any leaf j j j Samples on the Internet i i i Come on :
μ i = w j g i + 1 2 w j 2 h i \mu_i = w_jg_i + \frac{1}{2}w^2_jh_i μi=wjgi+21wj2hi
Put on a leaf μ j \mu_j μj Into a μ i \mu_i μi when , In principle, we need to μ j \mu_j μj Each item in is converted into the corresponding item of a single sample , However, there are problems in converting regular terms : And ∑ i ∈ j g i \sum_{i \in j} g_i ∑i∈jgi This can directly point to the different items of a single sample , λ \lambda λ Is the value set for a leaf , If you want to λ \lambda λ It is transformed into a regular term for a single sample , You need to know how many samples there are on the current leaf . However , Fitting occurs before tree building , Therefore, it is impossible to know the total amount of samples on a leaf at this time , So in xgboost In the actual implementation process of , Fitting each leaf does not involve regular terms , Regular terms are used only when calculating structure scores and leaf output values .
Yes μ i \mu_i μi The only argument on w j w_j wj Derivation , Then there are :
∂ μ i ∂ w j = ∂ ( w j g i + 1 2 w j 2 h i ) ∂ w j = g i + w j h i \begin{aligned}\frac{\partial{\mu_i}}{\partial w_j} &= \frac{\partial{\left( w_jg_i + \frac{1}{2}w^2_jh_i \right)}}{\partial w_j} \\ \\ &= g_i + w_jh_i\end{aligned} ∂wj∂μi=∂wj∂(wjgi+21wj2hi)=gi+wjhi
Let the first derivative be 0, Then there are :
g i + w j h i = 0 w j h i = − g i w j = − g i h i \begin{aligned} g_i + w_jh_i &= 0 \\ \\ w_jh_i &= - g_i \\ \\ w_j &= -\frac{g_i}{h_i} \end{aligned} gi+wjhiwjhiwj=0=−gi=−higi
For any sample i i i for , The optimum that minimizes the objective function w j w_j wj Is our pseudo residual r i r_i ri, That is to say XGBoost The fitting value used for fitting in the mathematical process .
Now? , We put the optimal that minimizes the objective function w j w_j wj Bring back μ j \mu_j μj in , see μ j \mu_j μj How to change :
μ j = w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) = − ∑ i ∈ j g i ∑ i ∈ j h i + λ ∗ ∑ i ∈ j g i + 1 2 ( − ∑ i ∈ j g i ∑ i ∈ j h i + λ ) 2 ∗ ∑ i ∈ j h i + λ = − ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ + 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ = − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ \begin{aligned} \mu_j &= w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \\ &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda} * \sum_{i \in j} g_i + \frac{1}{2}(-\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda})^2 * {\sum_{i \in j} h_i + \lambda}\\ &= -\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \\ &= - \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \end{aligned} μj=wji∈j∑gi+21wj2(i∈j∑hi+λ)=−∑i∈jhi+λ∑i∈jgi∗i∈j∑gi+21(−∑i∈jhi+λ∑i∈jgi)2∗i∈j∑hi+λ=−∑i∈jhi+λ(∑i∈jgi)2+21∑i∈jhi+λ(∑i∈jgi)2=−21∑i∈jhi+λ(∑i∈jgi)2
therefore , Objective function ( Loss on all leaves ) It can become :
O ~ k = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T = ∑ j = 1 T ( − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ ) + γ T \begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T \\ \\ &= \sum_{j=1}^T \left( -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \right) + \gamma T \end{aligned} O~k=j=1∑T⎝⎛wji∈j∑gi+21wj2(i∈j∑hi+λ)⎠⎞+γT=j=1∑T(−21∑i∈jhi+λ(∑i∈jgi)2)+γT
therefore , The objective function on a leaf is :
O j = − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ + γ O_j = -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \gamma Oj=−21∑i∈jhi+λ(∑i∈jgi)2+γ
For any leaf , The objective function can measure the quality of leaves , among γ \gamma γ It is a super parameter that can be set , 1 2 \frac{1}{2} 21 Constant , So for any leaf , We hope that the smaller the part marked in red, the better :
O j = 1 2 ( − ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ ) + γ O_j = \frac{1}{2}\left( \boldsymbol{\color{red}{-\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda}}} \right)+ \gamma Oj=21(−∑i∈jhi+λ(∑i∈jgi)2)+γ
so , We hope that the larger the following formula is, the better :
( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ \frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} ∑i∈jhi+λ(∑i∈jgi)2
This formula , It is XGBoost Indicators for branching “ Structure score ”(Structure Score).
When branching , We hope that the smaller the objective function, the better , So in the process of branching , The objective function of the parent node is larger than that of the child node , So we can use ( Parent node objective function - Sum of objective functions of child nodes ) To measure the quality of branches , Then there are :
G a i n = O p − ( O l + O r ) = − 1 2 ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ + γ − ( − 1 2 ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ + γ − 1 2 ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ + γ ) = − 1 2 ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ + γ + 1 2 ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ − γ + 1 2 ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ − γ = 1 2 ( ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ + ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ − ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ ) − γ = 1 2 ( S c o r e L + S c o r e R − S c o r e P ) − γ \begin{aligned} Gain &= O_p - (O_l + O_r) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma - (-\frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \gamma -\frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} + \gamma) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma + \frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} - \gamma + \frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \gamma \\ \\ &= \frac{1}{2}\left( \frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} \right) - \gamma \\ \\ &= \frac{1}{2} (Score_L + Score_R - Score_P) - \gamma \end{aligned} Gain=Op−(Ol+Or)=−21∑i∈Phi+λ(∑i∈Pgi)2+γ−(−21∑i∈Lhi+λ(∑i∈Lgi)2+γ−21∑i∈Rhi+λ(∑i∈Rgi)2+γ)=−21∑i∈Phi+λ(∑i∈Pgi)2+γ+21∑i∈Lhi+λ(∑i∈Lgi)2−γ+21∑i∈Rhi+λ(∑i∈Rgi)2−γ=21(∑i∈Lhi+λ(∑i∈Lgi)2+∑i∈Rhi+λ(∑i∈Rgi)2−∑i∈Phi+λ(∑i∈Pgi)2)−γ=21(ScoreL+ScoreR−ScoreP)−γ
among , γ \gamma γ It is a super parameter that can be set , 1 2 \frac{1}{2} 21 Constant , therefore :
G a i n = S c o r e L + S c o r e R − S c o r e P Gain = Score_L + Score_R - Score_P Gain=ScoreL+ScoreR−ScoreP
This is the structure fraction gain we use when branching .
Now you find ,XGBoost All new formulas used in the process ( Including unique fitting values 、 Unique branching index 、 Unique output value ) It is solved by minimizing the objective function . therefore ,XGBoost The whole process ensures that the objective function must iterate in the direction of minimization , Output value on each newly generated leaf w j w_j wj Are output values that minimize the objective function . Now? , You can answer the first question
边栏推荐
- Runloop principle (I)
- Trivy [3] custom scanning strategy
- 金仓数据库 KingbaseES V8.3至V8.6迁移最佳实践(2. KingbaseES V8.3和 V8.6 兼容性)
- Shenkaihong: on the river of Zhilian of all things, there is a bright moon of open source
- The front mounted ADAS camera in parking increased by 54.15% year-on-year, with TOP10 suppliers taking the lead
- Byte 8 years' experience of female test Director - for girls who want to change careers or are about to enter the testing industry
- 2022起重机械指挥考试题模拟考试平台操作
- Compatibility description between kingbasees and Oracle (2. Data type)
- 网络流量监控工具iftop
- Codeforces Round #810 (Div. 2) A - C
猜你喜欢

2022g3 boiler water treatment test simulation 100 questions simulation test platform operation

2022 G2 power plant boiler stoker examination question bank simulated examination platform operation

机器学习问题笔记

零视科技 H5S视频平台 GetUserInfo 信息泄漏漏洞 CNVD-2020-67113

Rhce第二天

可视化全链路日志追踪

2022焊工(初级)上岗证题目及答案

22 Niuke multi school Day1 I - Introduction to chiitoitsu DP

2022 R2 mobile pressure vessel filling test question simulation test platform operation
![[self] - brush questions set](/img/de/46582086addbe5465d658081516f4c.png)
[self] - brush questions set
随机推荐
Codeforces Round #474 (Div. 1 + Div. 2) - C, F
金仓数据库 KingbaseES V8.3至V8.6迁移最佳实践(2. KingbaseES V8.3和 V8.6 兼容性)
CV实例分割模型小抄(1)
MySQL introduction
The computer doesn't know what to uninstall, can't open the calculator, can't edit screenshots, can't open txt files, and so on
Fundamental inquiry binary tree
Byte 8 years' experience of female test Director - for girls who want to change careers or are about to enter the testing industry
What if win11 cannot find the DNS address? Win11 can't find DNS and can't access the web page solution
XML modeling
Typescript prevents base classes from being instantiated
Codeforces Round #810 (Div. 2) A - C
npm更换最新淘宝镜像
零视科技 H5S视频平台 GetUserInfo 信息泄漏漏洞 CNVD-2020-67113
22 Niu Ke multi school Day1 J - serval and essay heuristic merging
Pin mapping relationship of stm32f103c series single chip microcomputer under Arduino framework
[self] - brush questions logic
可视化全链路日志追踪
Classic topological sorting problem -- leetcode207 curriculum +leetcode210 curriculum II
2022g3 boiler water treatment test simulation 100 questions simulation test platform operation
trivy【2】工具漏洞扫描