当前位置:网站首页>In depth analysis of integrated learning xgboost (Continued)
In depth analysis of integrated learning xgboost (Continued)
2022-07-28 23:45:00 【「 25' h 」】
Yes XGBoost Come on , The really difficult part is not to sort out the above algorithm process , But prove that this process can make the model run in the direction of minimizing the objective function . The following obvious problems are included in this process :
Fitting when building trees r i k = − g i k h i k r_{ik} = -\frac{g_{ik}}{h_{ik}} rik=−hikgik What is it ? What is the significance of fitting it ?
How is the formula of structure fraction and structure fraction gain derived ? Why building trees like this can improve the effect of the model ?
Why is the output value of the leaf node w j w_j wj yes − ( ∑ i ∈ j g i k ) ∑ i ∈ j h i k + λ -\frac{(\sum_{i \in j} g_{ik})}{\sum_{i \in j} h_{ik} + \lambda} −∑i∈jhik+λ(∑i∈jgik)? What's the meaning of this output ?
The first part of the course says XGBoost Fitting is also residual , Where is the residual ?
First , According to the previous definition of the objective function ,XGBoost The objective function in is the objective function for a tree , Instead of aiming at the objective function of a sample or an entire algorithm . also , The objective function of any tree includes three parts : Loss function l l l、 The number of leaves T T T And regular terms . Specifically :
Suppose a single tree f k f_k fk The objective function of is O k O_k Ok, All in all T T T A leaf , Any sample on the tree i i i The loss function of is l ( ( y i , H ( x i ) ) l((y_i,H(x_i)) l((yi,H(xi)), among H ( x i ) H(x_i) H(xi) yes i i i The prediction result of sample No. on the integrated algorithm . There are... On the tree M Samples , Used in the objective function L2 Regularization ( λ \lambda λ Not for 0, α \alpha α by 0), also γ \gamma γ Not for 0, Then the objective function of the tree is :
O k = ∑ i = 1 M l ( y i , H k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 O_k = \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 Ok=i=1∑Ml(yi,Hk(xi))+γT+21λj=1∑Twj2
Our goal is to minimize the objective function , And find an independent variable that minimizes the objective function . For those who use the ordinary loss function Boosting Algorithm , The output of the algorithm H ( x ) H(x) H(x) It is constantly changing in the process of iteration , Loss function l ( y , H ( x ) ) l(y,H(x)) l(y,H(x)) It also keeps getting smaller in iterations :
H k ( x i ) = H k − 1 ( x i ) + f k ( x i ) H_k(x_i) = H_{k-1}(x_i) + f_k(x_i) Hk(xi)=Hk−1(xi)+fk(xi)
l k = l ( y i , H k − 1 ( x i ) + f k ( x i ) ) l_k = l(y_i,H_{k-1}(x_i) + f_k(x_i)) lk=l(yi,Hk−1(xi)+fk(xi))
When iterating to the k k k When the time , In the loss function y i y_i yi And H k − 1 ( x i ) H_{k-1}(x_i) Hk−1(xi) It's all constant , Only f k ( x i ) f_k(x_i) fk(xi) It's a variable. , Therefore, we only need to correct f k ( x i ) f_k(x_i) fk(xi) Derivation , And find the prediction value that minimizes the overall loss function f k ( x i ) f_k(x_i) fk(xi) that will do . stay GBDT among , We mentioned , Whether weak evaluator f k f_k fk What is the structure 、 What are the rules? 、 How to set up 、 How to fit , As long as its final output value f k ( x i ) f_k(x_i) fk(xi) Is to make the overall loss function L L L Minimized f k ( x i ) f_k(x_i) fk(xi), As the algorithm iterates step by step , The loss function is bound to become smaller and smaller . therefore , A suitable f k ( x i ) f_k(x_i) fk(xi) It can not only ensure the continuous reduction of losses , It can also guide the establishment of a single evaluator .
stay XGBoost among , We can also derive the objective function 、 And find an independent variable that minimizes the objective function , But the problem is ,XGBoost There are multiple independent variables in the objective function of :
O k = ∑ i = 1 M l ( y i , H k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ i = 1 M l ( y i , H k − 1 ( x i ) + f k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} O_k &= \sum_{i=1}^Ml(y_i,H_k(x_i)) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^Tw_j^2 \\ &= \sum_{i=1}^M l \left( y_i,H_{k-1}(x_i) + \boldsymbol{\color{red}{f_k(x_i)}} \right) + \gamma \boldsymbol{\color{red}T} + \frac{1}{2}\lambda\sum_{j=1}^T\boldsymbol{\color{red}{w_j}}^2 \end{aligned} Ok=i=1∑Ml(yi,Hk(xi))+γT+21λj=1∑Twj2=i=1∑Ml(yi,Hk−1(xi)+fk(xi))+γT+21λj=1∑Twj2
among , T T T It's No k k k The total number of leaves on a tree , f k ( x i ) f_k(x_i) fk(xi) And w j w_j wj Are the predicted values of the model output ( The output value on the leaf ), But in different forms , On any leaf j j j Samples on the Internet i i i for , Numerically f k ( x i ) = w j f_k(x_i) = w_j fk(xi)=wj. Yes XGBoost Come on , Only one variable can be selected as an independent variable , in consideration of f k ( x i ) f_k(x_i) fk(xi) It is only related to the accuracy of a single sample , and T T T Only related to tree structure ,XGBoost The final choice of the paper is related to accuracy 、 And variables related to tree structure w j w_j wj. meanwhile , If you know the best output value of leaves w j w_j wj It can guide the tree to grow into a reasonable structure , But I only know the total amount of leaves T T T Is unable to guide the establishment .
therefore , solve XGBoost The first step of the objective function , It is to arrange the objective function into w j w_j wj The form of expression .
In our objective function O k O_k Ok in , What can be expanded by Taylor is the first part of the loss function L L L:
O k = ∑ i = 1 M l ( y i , H k − 1 ( x i ) + f k ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 O_k = \sum_{i=1}^Ml \left( y_i,H_{k-1}(x_i) + f_k(x_i) \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 Ok=i=1∑Ml(yi,Hk−1(xi)+fk(xi))+γT+21λj=1∑Twj2
Because of the loss function l l l There is only one variable in H k − 1 ( x i ) + f k ( x i ) H_{k-1}(x_i) + f_k(x_i) Hk−1(xi)+fk(xi), Therefore, the function can be abbreviated as l ( H k − 1 ( x i ) + f k ( x i ) ) l(H_{k-1}(x_i) + f_k(x_i)) l(Hk−1(xi)+fk(xi)).
According to the second-order Taylor expansion , It is known that :
f ( x ) ≈ ∑ n = 0 2 f ( n ) ( a ) n ! ( x − a ) n ≈ f ( a ) + f ′ ( a ) 1 ! ( x − a ) + f ′ ′ ( a ) 2 ! ( x − a ) 2 \begin{aligned} f(x) &\approx \sum_{n=0}^{2}\frac{f^{(n)}(a)}{n!}(x-a)^n \\ &\approx f(a) + \frac{f'(a)}{1!}(x-a) + \frac{f''(a)}{2!}(x-a)^2 \end{aligned} f(x)≈n=0∑2n!f(n)(a)(x−a)n≈f(a)+1!f′(a)(x−a)+2!f′′(a)(x−a)2
Let Taylor expand x = H k − 1 ( x i ) + f k ( x i ) x = H_{k-1}(x_i) + f_k(x_i) x=Hk−1(xi)+fk(xi), Let Taylor expand a = H k − 1 ( x i ) a = H_{k-1}(x_i) a=Hk−1(xi), be ( x − a ) = f k ( x i ) (x-a) = f_k(x_i) (x−a)=fk(xi). Accordingly , Loss function l ( H k − 1 ( x i ) + f k ( x i ) ) l(H_{k-1}(x_i) + f_k(x_i)) l(Hk−1(xi)+fk(xi)) Can be expressed as :
l ( H k − 1 ( x i ) + f k ( x i ) ) ≈ l ( H k − 1 ( x i ) ) + ∂ l ( H k − 1 ( x i ) ) ∂ H k − 1 ( x i ) ∗ f k ( x i ) + ∂ 2 l ( H k − 1 ( x i ) ) 2 ∂ H k − 1 2 ( x i ) ∗ f k 2 ( x i ) \begin{aligned} l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + \frac{\partial{l(H_{k-1}(x_i))}}{\partial{H_{k-1}(x_i)}} * f_k(x_i) + \frac{\partial^2{l(H_{k-1}(x_i))}}{2\partial{H^2_{k-1}(x_i)}} * f^2_k(x_i)\\ \end{aligned} l(Hk−1(xi)+fk(xi))≈l(Hk−1(xi))+∂Hk−1(xi)∂l(Hk−1(xi))∗fk(xi)+2∂Hk−12(xi)∂2l(Hk−1(xi))∗fk2(xi)
stay XGBoost We have defined the first derivative and the second derivative of the loss function :
g i k = ∂ l ( y i , H k − 1 ( x i ) ) ∂ H t − 1 ( x i ) g_{ik} = \frac{\partial{l(y_i,H_{k-1}(x_i))}}{\partial{H_{t-1}(x_i)}} gik=∂Ht−1(xi)∂l(yi,Hk−1(xi))
h i k = ∂ 2 l ( y i , H k − 1 ( x i ) ) ∂ H t − 1 2 ( x i ) h_{ik} = \frac{\partial^2{l(y_i,H_{k-1}(x_i))}}{\partial{H^2_{t-1}(x_i)}} hik=∂Ht−12(xi)∂2l(yi,Hk−1(xi))
stay XGBoost In the original paper , For the sake of brevity , g i g_i gi and h i h_i hi There is no subscript k k k, But we know that : g g g And h h h It needs to be recalculated in each iteration . Here we also refer to the practice in the original paper to remove the subscript k k k. therefore , The formula after Taylor expansion can be reduced to :
l ( H k − 1 ( x i ) + f k ( x i ) ) ≈ l ( H k − 1 ( x i ) ) + g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ≈ constant + g i f k ( x i ) + 1 2 h i f k 2 ( x i ) \begin{aligned}l(H_{k-1}(x_i) + f_k(x_i)) &\approx l(H_{k-1}(x_i)) + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \\ &\approx constant + g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i) \end{aligned} l(Hk−1(xi)+fk(xi))≈l(Hk−1(xi))+gifk(xi)+21hifk2(xi)≈ constant +gifk(xi)+21hifk2(xi)
It's not hard to find out , In this formula H k − 1 ( x i ) H_{k-1}(x_i) Hk−1(xi) Is constant , So the first part l ( H t − 1 ( x i ) ) l(H_{t-1}(x_i)) l(Ht−1(xi)) It's also a constant , Constants cannot be minimized , Therefore, we can eliminate constants from the objective function . After Taylor expansion , The objective function becomes :
O ~ k = ∑ i = 1 M ( g i f k ( x i ) + 1 2 h i f k 2 ( x i ) ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ i = 1 M g i f k ( x i ) + 1 2 ∑ i = 1 M h i f k 2 ( x i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} \tilde{O}_k &= \sum_{i=1}^M\left(g_if_k(x_i) + \frac{1}{2}h_if^2_k(x_i)\right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned} O~k=i=1∑M(gifk(xi)+21hifk2(xi))+γT+21λj=1∑Twj2=i=1∑Mgifk(xi)+21i=1∑Mhifk2(xi)+γT+21λj=1∑Twj2
Now the first two terms of the objective function respectively represent the g i f k ( x i ) g_if_k(x_i) gifk(xi) The sum of the , And all samples h i f k 2 ( x i ) h_if^2_k(x_i) hifk2(xi) And multiply 1/2. Don't forget , The only independent variable we choose is w j w_j wj, Therefore, we hope to be able to f k f_k fk In some way into w j w_j wj. It has been mentioned many times before , On any leaf j j j Samples on the Internet i i i for , Numerically f k ( x i ) = w j f_k(x_i) = w_j fk(xi)=wj, We can try to transform from a sample :
For a single sample i i i, Suppose this sample is located in the leaf j j j On , Should have :
g i f k ( x i ) = g i w j g_if_k(x_i) = g_iw_j gifk(xi)=giwj
For a leaf j j j, We can calculate all the samples on this leaf g i w j g_iw_j giwj The sum of the :
∑ i ∈ j g i w j \sum_{i \in j} g_iw_j i∈j∑giwj
And all samples on a leaf w j w_j wj It's all consistent , So on a leaf g i w j g_iw_j giwj The sum can be transformed into :
∑ i ∈ j g i w j = g 1 w j + g 2 w j + . . . + g n w j , among 1 , 2... n It's the leaves j Samples on the Internet = w j ∑ i ∈ j g i \begin{aligned}\sum_{i \in j} g_iw_j &= g_1w_j \ + \ g_2w_j \ + \ ... \ + \ g_nw_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w_j\sum_{i \in j} g_i\end{aligned} i∈j∑giwj=g1wj + g2wj + ... + gnwj, among 1,2...n It's the leaves j Samples on the Internet =wji∈j∑gi
Suppose there are T T T A leaf , Then all samples on the whole tree g i w j g_iw_j giwj The sum is :
∑ j = 1 T ( w j ∑ i ∈ j g i ) \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right) j=1∑T(wji∈j∑gi)
therefore :
∑ i = 1 M g i f k ( x i ) = ∑ j = 1 T ( w j ∑ i ∈ j g i ) \sum_{i=1}^Mg_if_k(x_i) = \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i \right) i=1∑Mgifk(xi)=j=1∑T(wji∈j∑gi)
Empathy , A single sample i i i Of h i f k 2 ( x i ) h_if^2_k(x_i) hifk2(xi) It can also be transformed in the same way . For a single sample :
h i f k 2 ( x i ) = h i w j 2 h_if^2_k(x_i) = h_iw^2_j hifk2(xi)=hiwj2
To a leaf :
∑ i ∈ j h i w j 2 = h 1 w j 2 + h 2 w j 2 + . . . + h n w j 2 , among 1 , 2... n It's the leaves j Samples on the Internet = w j 2 ∑ i ∈ j h i \begin{aligned}\sum_{i \in j}h_iw^2_j &= h_1w^2_j \ + \ h_2w^2_j \ + \ ... \ + \ h_nw^2_j, among 1,2...n It's the leaves j Samples on the Internet \\ &= w^2_j\sum_{i \in j} h_i \end{aligned} i∈j∑hiwj2=h1wj2 + h2wj2 + ... + hnwj2, among 1,2...n It's the leaves j Samples on the Internet =wj2i∈j∑hi
On the whole tree :
∑ i = 1 M h i f k 2 ( x i ) = ∑ j = 1 T ( w j 2 ∑ i ∈ j h i ) \sum_{i=1}^Mh_if^2_k(x_i) = \sum_{j=1}^T \left( w^2_j\sum_{i \in j} h_i \right) i=1∑Mhifk2(xi)=j=1∑T(wj2i∈j∑hi)
So for the whole objective function :
O ~ k = ∑ i = 1 M g i f k ( x i ) + 1 2 ∑ i = 1 M h i f k 2 ( x i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ∑ i ∈ j h i ) + γ T + 1 2 λ ∑ j = 1 T w j 2 \begin{aligned} \tilde{O}_k &= \sum_{i=1}^Mg_if_k(x_i) + \frac{1}{2}\sum_{i=1}^Mh_if^2_k(x_i) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2 \\ &=\sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i \right) + \gamma T + \frac{1}{2}\lambda\sum_{j=1}^T w_j^2\end{aligned} O~k=i=1∑Mgifk(xi)+21i=1∑Mhifk2(xi)+γT+21λj=1∑Twj2=j=1∑T(wji∈j∑gi+21wj2i∈j∑hi)+γT+21λj=1∑Twj2
It's not hard to find out , Now the regular term can be merged with the part of the original loss function :
= ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ∑ i ∈ j h i + 1 2 λ w j 2 ) + γ T = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T \begin{aligned} &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j\sum_{i \in j} h_i + \frac{1}{2}\lambda w_j^2 \right) + \gamma T \\ &= \sum_{j=1}^T \left( w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \right) + \gamma T\end{aligned} =j=1∑T(wji∈j∑gi+21wj2i∈j∑hi+21λwj2)+γT=j=1∑T(wji∈j∑gi+21wj2(i∈j∑hi+λ))+γT
After the merger , The whole objective function becomes two terms , One is on all leaves ( Loss + Regular ) The sum of the , The other is the total amount of leaves . Now? , We can start to solve the minimum objective function and the corresponding optimal independent variable w j w_j wj 了 .
First , It is impossible to minimize the total number of leaves in the objective function , Excessively reducing the total number of leaves will greatly damage the learning ability of the model , So we can only consider making all the leaves ( Loss + Regular ) The sum is the smallest .
secondly , When the tree is built , Leaves are independent of each other , So on each leaf ( Loss + Regular ) They are also independent of each other . We just need to make every leaf ( Loss + Regular ) All smallest , It can guarantee the of all leaves ( Loss + Regular ) The sum is the smallest . so , We need to minimize the part marked red in the formula :
O ~ k = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T \begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T\end{aligned} O~k=j=1∑T⎝⎛wji∈j∑gi+21wj2(i∈j∑hi+λ)⎠⎞+γT
Leaf weight w j w_j wj
Name the part marked in red μ j \mu_j μj, It means leaves j j j The loss on + Regular . Then there are :
μ j = w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) \mu_j = w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) μj=wji∈j∑gi+21wj2(i∈j∑hi+λ)
Now? , To leaves j j j for , stay μ j \mu_j μj The last pair of unique arguments w j w_j wj Derivation , Then there are :
∂ μ j ∂ w j = ∂ w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ∂ w j = ∑ i ∈ j g i + w j ( ∑ i ∈ j h i + λ ) \begin{aligned}\frac{\partial{\mu_j}}{\partial w_j} &= \frac{\partial{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}}{\partial w_j} \\ \\ &= \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda)\end{aligned} ∂wj∂μj=∂wj∂wj∑i∈jgi+21wj2(∑i∈jhi+λ)=i∈j∑gi+wj(i∈j∑hi+λ)
Let the first derivative be 0, Then there are :
∑ i ∈ j g i + w j ( ∑ i ∈ j h i + λ ) = 0 w j ( ∑ i ∈ j h i + λ ) = − ∑ i ∈ j g i w j = − ∑ i ∈ j g i ∑ i ∈ j h i + λ \begin{aligned} \sum_{i \in j} g_i + w_j(\sum_{i \in j} h_i + \lambda) &= 0 \\ \\ w_j(\sum_{i \in j} h_i + \lambda) &= -\sum_{i \in j} g_i \\ \\ w_j &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda}\end{aligned} i∈j∑gi+wj(i∈j∑hi+λ)wj(i∈j∑hi+λ)wj=0=−i∈j∑gi=−∑i∈jhi+λ∑i∈jgi
You should have found out , For a leaf , Minimize the objective function w j w_j wj It's the leaf weight we mentioned before , That is to say XGBoost The output value on the leaf in the mathematical process . If you want to make the output of leaves very close to the leaf weight formula , How to fit each sample ?
On any leaf j j j Samples on the Internet i i i Come on :
μ i = w j g i + 1 2 w j 2 h i \mu_i = w_jg_i + \frac{1}{2}w^2_jh_i μi=wjgi+21wj2hi
Put on a leaf μ j \mu_j μj Into a μ i \mu_i μi when , In principle, we need to μ j \mu_j μj Each item in is converted into the corresponding item of a single sample , However, there are problems in converting regular terms : And ∑ i ∈ j g i \sum_{i \in j} g_i ∑i∈jgi This can directly point to the different items of a single sample , λ \lambda λ Is the value set for a leaf , If you want to λ \lambda λ It is transformed into a regular term for a single sample , You need to know how many samples there are on the current leaf . However , Fitting occurs before tree building , Therefore, it is impossible to know the total amount of samples on a leaf at this time , So in xgboost In the actual implementation process of , Fitting each leaf does not involve regular terms , Regular terms are used only when calculating structure scores and leaf output values .
Yes μ i \mu_i μi The only argument on w j w_j wj Derivation , Then there are :
∂ μ i ∂ w j = ∂ ( w j g i + 1 2 w j 2 h i ) ∂ w j = g i + w j h i \begin{aligned}\frac{\partial{\mu_i}}{\partial w_j} &= \frac{\partial{\left( w_jg_i + \frac{1}{2}w^2_jh_i \right)}}{\partial w_j} \\ \\ &= g_i + w_jh_i\end{aligned} ∂wj∂μi=∂wj∂(wjgi+21wj2hi)=gi+wjhi
Let the first derivative be 0, Then there are :
g i + w j h i = 0 w j h i = − g i w j = − g i h i \begin{aligned} g_i + w_jh_i &= 0 \\ \\ w_jh_i &= - g_i \\ \\ w_j &= -\frac{g_i}{h_i} \end{aligned} gi+wjhiwjhiwj=0=−gi=−higi
For any sample i i i for , The optimum that minimizes the objective function w j w_j wj Is our pseudo residual r i r_i ri, That is to say XGBoost The fitting value used for fitting in the mathematical process .
Now? , We put the optimal that minimizes the objective function w j w_j wj Bring back μ j \mu_j μj in , see μ j \mu_j μj How to change :
μ j = w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) = − ∑ i ∈ j g i ∑ i ∈ j h i + λ ∗ ∑ i ∈ j g i + 1 2 ( − ∑ i ∈ j g i ∑ i ∈ j h i + λ ) 2 ∗ ∑ i ∈ j h i + λ = − ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ + 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ = − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ \begin{aligned} \mu_j &= w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda) \\ &= -\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda} * \sum_{i \in j} g_i + \frac{1}{2}(-\frac{\sum_{i \in j} g_i}{\sum_{i \in j} h_i + \lambda})^2 * {\sum_{i \in j} h_i + \lambda}\\ &= -\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \\ &= - \frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \end{aligned} μj=wji∈j∑gi+21wj2(i∈j∑hi+λ)=−∑i∈jhi+λ∑i∈jgi∗i∈j∑gi+21(−∑i∈jhi+λ∑i∈jgi)2∗i∈j∑hi+λ=−∑i∈jhi+λ(∑i∈jgi)2+21∑i∈jhi+λ(∑i∈jgi)2=−21∑i∈jhi+λ(∑i∈jgi)2
therefore , Objective function ( Loss on all leaves ) It can become :
O ~ k = ∑ j = 1 T ( w j ∑ i ∈ j g i + 1 2 w j 2 ( ∑ i ∈ j h i + λ ) ) + γ T = ∑ j = 1 T ( − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ ) + γ T \begin{aligned} \tilde{O}_k &= \sum_{j=1}^T \left( \boldsymbol{\color{red}{w_j\sum_{i \in j} g_i + \frac{1}{2}w^2_j(\sum_{i \in j} h_i + \lambda)}} \right) + \gamma T \\ \\ &= \sum_{j=1}^T \left( -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} \right) + \gamma T \end{aligned} O~k=j=1∑T⎝⎛wji∈j∑gi+21wj2(i∈j∑hi+λ)⎠⎞+γT=j=1∑T(−21∑i∈jhi+λ(∑i∈jgi)2)+γT
therefore , The objective function on a leaf is :
O j = − 1 2 ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ + γ O_j = -\frac{1}{2}\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} + \gamma Oj=−21∑i∈jhi+λ(∑i∈jgi)2+γ
For any leaf , The objective function can measure the quality of leaves , among γ \gamma γ It is a super parameter that can be set , 1 2 \frac{1}{2} 21 Constant , So for any leaf , We hope that the smaller the part marked in red, the better :
O j = 1 2 ( − ( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ ) + γ O_j = \frac{1}{2}\left( \boldsymbol{\color{red}{-\frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda}}} \right)+ \gamma Oj=21(−∑i∈jhi+λ(∑i∈jgi)2)+γ
so , We hope that the larger the following formula is, the better :
( ∑ i ∈ j g i ) 2 ∑ i ∈ j h i + λ \frac{(\sum_{i \in j} g_i)^2}{\sum_{i \in j} h_i + \lambda} ∑i∈jhi+λ(∑i∈jgi)2
This formula , It is XGBoost Indicators for branching “ Structure score ”(Structure Score).
When branching , We hope that the smaller the objective function, the better , So in the process of branching , The objective function of the parent node is larger than that of the child node , So we can use ( Parent node objective function - Sum of objective functions of child nodes ) To measure the quality of branches , Then there are :
G a i n = O p − ( O l + O r ) = − 1 2 ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ + γ − ( − 1 2 ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ + γ − 1 2 ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ + γ ) = − 1 2 ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ + γ + 1 2 ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ − γ + 1 2 ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ − γ = 1 2 ( ( ∑ i ∈ L g i ) 2 ∑ i ∈ L h i + λ + ( ∑ i ∈ R g i ) 2 ∑ i ∈ R h i + λ − ( ∑ i ∈ P g i ) 2 ∑ i ∈ P h i + λ ) − γ = 1 2 ( S c o r e L + S c o r e R − S c o r e P ) − γ \begin{aligned} Gain &= O_p - (O_l + O_r) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma - (-\frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \gamma -\frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} + \gamma) \\ \\ &= -\frac{1}{2}\frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} + \gamma + \frac{1}{2}\frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} - \gamma + \frac{1}{2}\frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \gamma \\ \\ &= \frac{1}{2}\left( \frac{(\sum_{i \in L} g_i)^2}{\sum_{i \in L} h_i + \lambda} + \frac{(\sum_{i \in R} g_i)^2}{\sum_{i \in R} h_i + \lambda} - \frac{(\sum_{i \in P} g_i)^2}{\sum_{i \in P} h_i + \lambda} \right) - \gamma \\ \\ &= \frac{1}{2} (Score_L + Score_R - Score_P) - \gamma \end{aligned} Gain=Op−(Ol+Or)=−21∑i∈Phi+λ(∑i∈Pgi)2+γ−(−21∑i∈Lhi+λ(∑i∈Lgi)2+γ−21∑i∈Rhi+λ(∑i∈Rgi)2+γ)=−21∑i∈Phi+λ(∑i∈Pgi)2+γ+21∑i∈Lhi+λ(∑i∈Lgi)2−γ+21∑i∈Rhi+λ(∑i∈Rgi)2−γ=21(∑i∈Lhi+λ(∑i∈Lgi)2+∑i∈Rhi+λ(∑i∈Rgi)2−∑i∈Phi+λ(∑i∈Pgi)2)−γ=21(ScoreL+ScoreR−ScoreP)−γ
among , γ \gamma γ It is a super parameter that can be set , 1 2 \frac{1}{2} 21 Constant , therefore :
G a i n = S c o r e L + S c o r e R − S c o r e P Gain = Score_L + Score_R - Score_P Gain=ScoreL+ScoreR−ScoreP
This is the structure fraction gain we use when branching .
Now you find ,XGBoost All new formulas used in the process ( Including unique fitting values 、 Unique branching index 、 Unique output value ) It is solved by minimizing the objective function . therefore ,XGBoost The whole process ensures that the objective function must iterate in the direction of minimization , Output value on each newly generated leaf w j w_j wj Are output values that minimize the objective function . Now? , You can answer the first question
边栏推荐
- 2022焊工(初级)上岗证题目及答案
- 如何将一个mongodb中集合的索引 添加到另一个mongodb中集合中
- Class, leetcode919 -- complete binary tree inserter
- With the "integration of driving and parking", freytek's high-performance domain controller leads the new track
- Why did "you" become a test / development programmer? The value of your existence
- 酒店预订系统数据库房间库存设计思路
- 2022 R2 mobile pressure vessel filling test question simulation test platform operation
- 你学过的每样东西,都会在你一生中的某个时刻派上用场(转)
- Mongodb index add, view, export, delete
- How to add the index of a set in mongodb to another set in mongodb
猜你喜欢

How to embed AI digital human function in VR panorama? Create a cloud experience

Samba service setup

2022焊工(初级)上岗证题目及答案

刨根问底学 二叉树

2022G3锅炉水处理考试模拟100题模拟考试平台操作

深度剖析集成学习Xgboost

22牛客多校day1 I - Chiitoitsu 概论dp

顶级“黑客”能厉害到什么地步?无信号也能上网,专家:高端操作!

这款全网热评的无线路由器,到底有什么特别?

Byte 8 years' experience of female test Director - for girls who want to change careers or are about to enter the testing industry
随机推荐
如何开一家盈利的健身房?我用1年回本的经验告诉你,别谈恋爱
英特尔数据中心GPU正式发货,以开放灵活提供强劲算力
[self] - question brushing - string
(22) two permutation (DP), package delivery (greedy)
2022 simulated examination platform operation of hoisting machinery command examination questions
KingbaseES客户端编程接口指南-ODBC(4. 创建数据源)
金仓数据库KingbaseES 客户端编程接口指南 - ODBC特性支持约束
深度剖析集成学习GBDT
Manufacturing steps of interactive slide screen in exhibition hall
金仓数据库KingbaseES客户端编程接口指南-ODBC(5. 开发过程)
通过Wi-Fi 7实现极高吞吐量——洞察下一代Wi-Fi物理层
被忽视的智能电视小程序领域
2022T电梯修理考试试题及模拟考试
金仓数据库KingbaseES 客户端编程接口指南 - ODBC (2. 概述)
2022 welder (Junior) work license questions and answers
22牛客多校day1 J - Serval and Essay 启发式合并
wget什么意思
Codeforces Round #474 (Div. 1 + Div. 2) - C, F
网络流量监控工具iftop
Codeforces Round #474 (Div. 1 + Div. 2) - C, F