当前位置:网站首页>Paper notes: generalized random forests
Paper notes: generalized random forests
2022-06-25 16:36:00 【#Super Pig】
This paper ignores the asymptote proof for the time being , Focus on GRF Of prediction and split Method
ref:
- Zhihu blog :https://zhuanlan.zhihu.com/p/448524822
- S. Athey, J. Tibshirani, and S. Wager, “Generalized random forests,” Ann. Statist., vol. 47, no. 2, Apr. 2019, doi: 10.1214/18-AOS1709.
Motivation
This article aims to find a general Of forest-based Estimation method of , It's right random forest Generalization extension of . This is also the greatest contribution of the work . To be specific , The work proposed General Object yes :
E [ Ψ θ ( x ) , ν ( x ) ( O i ) ∣ X i = x ] = 0 (1) \mathbb{E}[\Psi_{\theta(x),\nu(x)}(O_i)|X_i=x]=0 \tag{1} E[Ψθ(x),ν(x)(Oi)∣Xi=x]=0(1)
among , Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅) yes scoring function, It can be understood as loss function Or optimization objectives , θ ( x ) \theta(x) θ(x) Is the quantity we expect to estimate , ν ( x ) \nu(x) ν(x) It's optional nuisance parameter, O i O_i Oi Is and θ ( x ) \theta(x) θ(x) The relevant quantity . The purpose is to build a forest to make Eq(1) establish .
To achieve these goals (Eq(1)), We need to solve the following optimization problems :
( θ ^ ( x ) , ν ^ ( x ) ) = arg min θ , ν ∥ ∑ i = 1 n α i ( x ) ⋅ Ψ θ , ν ( O i ) ∥ 2 (2) (\hat\theta(x),\hat\nu(x))=\arg\min_{\theta,\nu} \|\sum_{i=1}^{n}\alpha_i(x)\cdot\Psi_{\theta,\nu}(O_i)\|_2 \tag{2} (θ^(x),ν^(x))=argθ,νmin∥i=1∑nαi(x)⋅Ψθ,ν(Oi)∥2(2)
And the optimal solution of the optimization problem θ ^ ( x ) \hat\theta(x) θ^(x) That's our estimate ;
as for α i ( x ) \alpha_i(x) αi(x), It represents the training sample i i i And test samples ( from x x x Express ) The similarity , Play the role of weighting , The specific calculation method is as follows :
α i ( x ) = 1 B ⋅ ∑ b = 1 B α b i ( x ) (3) \alpha_i(x)=\frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \tag{3} αi(x)=B1⋅b=1∑Bαbi(x)(3)
α b i ( x ) = 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ (4) \alpha_{b_i}(x)=\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \tag{4} αbi(x)=∣Lb(x)∣1({ Xi∈Lb(x)})(4)
among , B B B Represents the number of trees , b b b It stands for the second b b b tree , L b ( x ) L_b(x) Lb(x) Represent and test samples x x x Fall in the second place b b b A collection of training samples on the same leaf of a tree . therefore , α b i ( x ) \alpha_{b_i}(x) αbi(x) It means in the second paragraph b b b The third in the tree i i i Samples and x x x The frequency of falling on the same leaf node 【 This frequency reflects the similarity 】. Be careful , ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1!
To sum up ,Eq(1) and Eq(2) It's actually equivalent , After the forest is built, it can be based on Eq(1) or (2) Make a forecast assessment , These two formulas are GRF At the heart of , Many statistical problems ( Such as , least square 、 maximum likelihood 、 Quantile regression, etc ) Can be seen as Eq(1) The special case of .
Case of Regression
Take the regression problem as an example , prove random forest yes GRF A special case of :
For regression problems , Estimates we care about μ ( x ) = E [ Y i ∣ X i = x ] \mu(x)=\mathbb{E}[Y_i|X_i =x] μ(x)=E[Yi∣Xi =x]( here μ ( x ) \mu(x) μ(x) Namely θ ( x ) \theta(x) θ(x)), Corresponding scoring function Namely Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x).
meanwhile , We know random forest After the forest is built , Given the test sample x x x, The predicted value is x x x Of the training sample set of the leaf node Y mean value , The formal expression is as follows :
μ ^ ( x ) = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) , μ ^ b ( x ) = ∑ { i : X i ∈ L b ( x ) } Y i ∣ L b ( x ) ∣ (5) \hat\mu(x)=\frac{1}{B}\cdot\sum_{b=1}^B\hat\mu_b(x), \ \hat\mu_b(x)=\frac{\sum_{\{i:X_i\in L_b(x)\}}Y_i}{|L_b(x)|} \tag{5} μ^(x)=B1⋅b=1∑Bμ^b(x), μ^b(x)=∣Lb(x)∣∑{ i:Xi∈Lb(x)}Yi(5)
Now? , We just need to prove that when scoring function by Ψ μ ( x ) ( O i ) = Y i − μ ( x ) \Psi_{\mu(x)}(O_i)=Y_i-\mu(x) Ψμ(x)(Oi)=Yi−μ(x) when ,Eq(5) Establishment is Eq(1) The necessary and sufficient conditions for its establishment . Prove the following :
Eq(1) The establishment is equivalent to Eq(6) establish :
∑ i = 1 n α i ( x ) ⋅ ( Y i − μ ^ ( x ) ) = 0 (6) \sum_{i=1}^n\alpha_i(x)\cdot (Y_i-\hat\mu(x))=0 \tag{6} i=1∑nαi(x)⋅(Yi−μ^(x))=0(6)
And because of ∑ i = 1 n α i ( x ) = 1 \sum_{i=1}^n\alpha_i(x)=1 ∑i=1nαi(x)=1 establish , therefore Eq(6) It can be transformed into Eq(7):
μ ^ ( x ) = ∑ i = 1 n α i ( x ) ⋅ Y i = ∑ i = 1 n 1 B ⋅ ∑ b = 1 B α b i ( x ) ⋅ Y i = 1 B ⋅ ∑ b = 1 B ⋅ ∑ i = 1 n 1 ( { X i ∈ L b ( x ) } ) ∣ L b ( x ) ∣ ⋅ Y i = 1 B ⋅ ∑ b = 1 B μ ^ b ( x ) (7) \begin{aligned} \hat\mu(x) &=\sum_{i=1}^n\alpha_i(x)\cdot Y_i \\ &=\sum_{i=1}^n \frac{1}{B}\cdot\sum_{b=1}^B\alpha_{b_i}(x) \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B\cdot\sum_{i=1}^n\frac{1(\{X_i\in L_b(x)\})}{|L_b(x)|} \cdot Y_i \\ &=\frac{1}{B}\cdot\sum_{b=1}^B \hat\mu_b(x) \end{aligned} \tag{7} μ^(x)=i=1∑nαi(x)⋅Yi=i=1∑nB1⋅b=1∑Bαbi(x)⋅Yi=B1⋅b=1∑B⋅i=1∑n∣Lb(x)∣1({ Xi∈Lb(x)})⋅Yi=B1⋅b=1∑Bμ^b(x)(7)
thus it can be seen , When Eq(1) When established , Can be launched Eq(6) establish , therefore ,random forest yes GRF A special case of .
split criterion
The original idea is , Minimize the error between the evaluated value and the true value of child nodes , Which is to minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2):
e r r ( C 1 , C 2 ) = ∑ j = 1 2 P [ X ∈ C j ∣ X ∈ P ] ⋅ E [ ( θ ^ C j ( J ) − θ ( x ) ) 2 ∣ X ∈ C j ] (8) err(C_1,C_2)=\sum_{j=1}^2\mathbb{P}[X\in C_j|X\in P]\cdot\mathbb{E}[(\hat\theta_{C_j}(\mathcal{J})-\theta(x))^2|X\in C_j] \tag{8} err(C1,C2)=j=1∑2P[X∈Cj∣X∈P]⋅E[(θ^Cj(J)−θ(x))2∣X∈Cj](8)
however , Due to the true value θ ( x ) \theta(x) θ(x) Unknown , therefore , After some derivation , We will minimize e r r ( C 1 , C 2 ) err(C_1,C_2) err(C1,C2) To maximize D e l t a ( C 1 , C 2 ) Delta(C_1,C_2) Delta(C1,C2):
Δ ( C 1 , C 2 ) = n c 1 ⋅ n c 2 n p 2 ⋅ ( θ ^ C 1 ( J ) − θ ^ C 2 ( J ) ) 2 (9) \Delta(C_1,C_2)=\frac{n_{c_1}\cdot n_{c_2}}{n_p^2}\cdot(\hat\theta_{C_1}(\mathcal{J})-\hat\theta_{C_2}(\mathcal{J}))^2 \tag{9} Δ(C1,C2)=np2nc1⋅nc2⋅(θ^C1(J)−θ^C2(J))2(9)
After transformation , You can find , Maximize Eq(7) To maximize the heterogeneity between child nodes .
thus , We know the splitting criterion of nodes , But in practice , because θ ^ C j ( J ) \hat\theta_{C_j}(\mathcal{J}) θ^Cj(J) The computational overhead is large , therefore , The author puts forward a method based on gradient The approximate solution of :
gradient tree algorithm
First ,PROPOSITION1 Pointed out that , θ ^ C \hat\theta_{C} θ^C There is the following approximate solution θ ~ C \tilde\theta_{C} θ~C:
θ ~ C = θ ^ p − 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ C } ξ T ⋅ A p − 1 Ψ θ ^ p , ν ^ p ( O i ) (10) \tilde\theta_{C}=\hat \theta_p-\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in C\}}\xi^T\cdot A_p^{-1}\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{10} θ~C=θ^p−∣{ i:Xi∈C}∣1⋅{ i:Xi∈C}∑ξT⋅Ap−1Ψθ^p,ν^p(Oi)(10)
among , θ ^ p \hat \theta_p θ^p Represents the parent node P P P On θ \theta θ The estimate of , Can be Eq(1)orEq(2) Get ; as for ξ \xi ξ, The paper says it is from ( θ , ν ) (\theta,\nu) (θ,ν) Filter out... From the vector θ \theta θ-coordinate Vector , But I see in other papers that everyone has omitted this thing ; and A p A_p Ap The meaning is Ψ θ ^ p , ν ^ p ( O i ) \Psi_{\hat\theta_p,\hat\nu_p}(O_i) Ψθ^p,ν^p(Oi) The desired gradient of , The calculation formula is as follows :
A p = ∇ E [ Ψ θ ^ p , ν ^ p ( O i ) ∣ X i ∈ P ] = 1 ∣ { i : X i ∈ C } ∣ ⋅ ∑ { i : X i ∈ P } ∇ Ψ θ ^ p , ν ^ p ( O i ) (11) A_p=\nabla\mathbb{E}[\Psi_{\hat\theta_p,\hat\nu_p}(O_i)|X_i\in P]=\frac{1}{|\{i:X_i\in C\}|}\cdot\sum_{\{i:X_i\in P\}}\nabla\Psi_{\hat\theta_p,\hat\nu_p}(O_i) \tag{11} Ap=∇E[Ψθ^p,ν^p(Oi)∣Xi∈P]=∣{ i:Xi∈C}∣1⋅{ i:Xi∈P}∑∇Ψθ^p,ν^p(Oi)(11)
But I don't quite understand who the derivative here is for .
When θ ^ C \hat\theta_{C} θ^C There is an approximate solution θ ~ C \tilde\theta_{C} θ~C after , Can be launched Δ ( C 1 , C 2 ) \Delta(C_1,C_2) Δ(C1,C2) There are also corresponding approximate solutions Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2):【 The derivation of this step is omitted for the time being 】
Δ ~ ( C 1 , C 2 ) = ∑ j = 1 2 1 ∣ { i : X i ∈ C j } ∣ ⋅ ( ∑ { i : X i ∈ C j } ρ i ) 2 (12) \tilde\Delta(C_1,C_2)=\sum_{j=1}^2\frac{1}{|\{i:X_i\in C_j\}|}\cdot(\sum_{\{i:X_i\in C_j\}}\rho_i)^2 \tag{12} Δ~(C1,C2)=j=1∑2∣{ i:Xi∈Cj}∣1⋅({ i:Xi∈Cj}∑ρi)2(12)
among , ρ i = − ξ T ⋅ A p − 1 ⋅ Ψ θ ^ p , ν ^ p ( O i ) \rho_i=-\xi^T\cdot A_p^{-1}\cdot\Psi_{\hat\theta_p,\hat\nu_p}(O_i) ρi=−ξT⋅Ap−1⋅Ψθ^p,ν^p(Oi), It means the first one i Samples are calculating θ ^ p \hat\theta_p θ^p The impact of time .
thus , We can summarize node splitting into the following two steps :
1. labeling step
This step , First you need to calculate θ ^ p \hat\theta_p θ^p and A p A_p Ap, And then calculate ρ i \rho_i ρi; Be careful , At each split , Just calculate one ρ i \rho_i ρi【 Because the parent node has been determined 】
2. regreession step
Look for child nodes , bring Δ ~ ( C 1 , C 2 ) \tilde\Delta(C_1,C_2) Δ~(C1,C2) Maximum . This step can pass the standard CART Regression split realization .
GRF for CATE
next , Let's see GRF How to apply to CATE Of the assessment .
In this application , The author still uses Partially Linear model Based on Ψ ( ⋅ ) \Psi(\cdot) Ψ(⋅), So-called Partially Linear Model It means that the data meets the following structure :
Y = θ ( x ) ⋅ T + g ( x ) + ϵ , T = f ( x ) + η (13) Y=\theta(x)\cdot T+g(x)+\epsilon, \ T=f(x)+\eta \tag{13} Y=θ(x)⋅T+g(x)+ϵ, T=f(x)+η(13)
So-called ” Partial linearity “ Mainly reflected in Y Structurally .
Put it in CATE Evaluation questions , θ ( x ) \theta(x) θ(x) It means x x x Treatment effect under , Formalized as θ ( x ) = E [ Y ( T = 1 ) − Y ( T = 0 ) ∣ X = x ] \theta(x)=\mathbb{E}[Y(T=1)-Y(T=0)|X=x] θ(x)=E[Y(T=1)−Y(T=0)∣X=x].
be based on Partially Linear Model, The author constructed scoring function by Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − θ ( x ) ⋅ T i − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\theta(x)\cdot T_i-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−θ(x)⋅Ti−ν(x), It can be understood as this scoring function Our aim is to find a ( θ ^ ( x ) , ν ^ ( x ) ) (\hat\theta(x),\hat\nu(x)) (θ^(x),ν^(x)) bring Y i Y_i Yi And θ ( x ) ⋅ T i + ν ( x ) \theta(x)\cdot T_i+\nu(x) θ(x)⋅Ti+ν(x) As close as possible 【 The essence is the fitting problem 】.
Under this setting , The values are solved as follows :
θ ^ ( x ) = ξ T ⋅ C o v ( T i , Y i ∣ X i = x ) V a r ( T i ∣ X i = x ) (14) \hat\theta(x)=\xi^T\cdot\frac{Cov(T_i,Y_i|Xi=x)}{Var(T_i|X_i=x)} \tag{14} θ^(x)=ξT⋅Var(Ti∣Xi=x)Cov(Ti,Yi∣Xi=x)(14)
A p = 1 ∣ { i : X i ∈ P } ∣ ⋅ ∑ { i : X i ∈ C j } ( T i − T ˉ p ) ⨂ 2 (15) A_p=\frac{1}{|\{i:X_i\in P\}|}\cdot\sum_{\{i:X_i\in C_j\}}(T_i-\bar T_p)^{\bigotimes 2} \tag{15} Ap=∣{ i:Xi∈P}∣1⋅{ i:Xi∈Cj}∑(Ti−Tˉp)⨂2(15)
ρ i = ξ T ⋅ A p − 1 ⋅ ( Y i − Y ˉ p − ( T i − T ˉ p ) ⋅ θ ^ p ) (16) \rho_i=\xi^T\cdot A_p^{-1}\cdot (Y_i-\bar Y_p-(T_i-\bar T_p)\cdot \hat\theta_p) \tag{16} ρi=ξT⋅Ap−1⋅(Yi−Yˉp−(Ti−Tˉp)⋅θ^p)(16)
The derivation of these values , At present, I only understand Eq(14) The source of the :
consider Y = θ ( x ) ⋅ T + g ( x ) Y=\theta(x)\cdot T+g(x) Y=θ(x)⋅T+g(x), The optimal θ ( x ) \theta(x) θ(x) The solution of can be regarded as the solution of one - dimensional equation y = a x + b y=ax+b y=ax+b The slope of , And this slope can be expressed by variance and covariance 【 Reference material 】
It should be noted that , The mean here 、 variance 、 Covariance is weighted , And the weight is α i \alpha_i αi.
CausalForestDML
seeing the name of a thing one thinks of its function ,CausalForestDML It's fusion CausalForest and DML.DML In estimation CATE The core idea of time is based on the following equation :
Y − E [ Y ∣ X ] = θ ( x ) ⋅ ( T − E [ T ∣ X ] ) etc. price On Y ~ = θ ( x ) ⋅ T ~ (17) Y-\mathbb{E}[Y|X]=\theta(x)\cdot (T-\mathbb{E}[T|X]) \ Equivalent to \ \tilde Y=\theta(x)\cdot \tilde T \tag{17} Y−E[Y∣X]=θ(x)⋅(T−E[T∣X]) etc. price On Y~=θ(x)⋅T~(17)
That is to say , take CATE Assessment questions for , Convert into T To fit the residuals of Y Residual of , And the regression coefficient is CATE.【 The process of calculating residuals , In fact, it is the process of orthogonalization 】
be based on DML Thought ,CausalForestDML The following scoring function: Ψ θ ( x ) , ν ( x ) ( O i ) = Y i − E [ Y i ∣ X ] − θ ( x ) ⋅ ( T i − E [ T i ∣ X ] ) − ν ( x ) \Psi_{\theta(x),\nu(x)}(O_i)=Y_i-\mathbb{E}[Y_i|X]-\theta(x)\cdot (T_i-\mathbb{E}[T_i|X])-\nu(x) Ψθ(x),ν(x)(Oi)=Yi−E[Yi∣X]−θ(x)⋅(Ti−E[Ti∣X])−ν(x).
The corresponding optimal θ ( x ) \theta(x) θ(x) It becomes θ ^ ( x ) = ξ T ⋅ C o v ( Y i − E [ Y i ∣ X i ] , T i − E [ T i ∣ X i ] ∣ X i = x ) V a r ( T i − E [ Y i ∣ X i ] ∣ X i = x ) \hat\theta(x)=\xi^T\cdot\frac{Cov(Y_i-\mathbb{E}[Y_i|X_i],T_i-\mathbb{E}[T_i|X_i]|Xi=x)}{Var(T_i-\mathbb{E}[Y_i|X_i]|X_i=x)} θ^(x)=ξT⋅Var(Ti−E[Yi∣Xi]∣Xi=x)Cov(Yi−E[Yi∣Xi],Ti−E[Ti∣Xi]∣Xi=x).
Be careful : The original text calls this method Centered GRF! And watch 1 It turns out that GRF with centering Better than GRF without centering The effect is better .
GRF vs Causal Forest
These two articles are Susan Of ,GRF Compared with Causal Forest The biggest difference is
- Casual Forest Use exact loss criterion( namely Eq(9)) Split rather than gradient-based loss criterion( namely Eq(12));
- Causal Forest Calculate again treatment effect Weighted average is not used ;
边栏推荐
- 10款超牛Vim插件,爱不释手了
- Servlet details
- 心楼:华为运动健康的七年筑造之旅
- error Parsing error: Unexpected reserved word ‘await‘.
- Data type variable operator
- Optimization of lazyagg query rewriting in parsing data warehouse
- Once the code was encrypted by the company's computer, the compilation failed
- After flutter was upgraded from 2.2.3 to 2.5, the compilation of mixed projects became slower
- Xinlou: un voyage de sept ans de Huawei Sports Health
- Function and implementation of closures
猜你喜欢

【 apprentissage automatique】 cas de prévision et d'analyse de l'examen d'entrée à l'Université basé sur des séries chronologiques multiples

GO语言-锁操作

Principle analysis of ThreadLocal source code

深入理解和把握数字经济的基本特征

Navicat Premium 15 for Mac(数据库开发工具)中文版

【機器學習】基於多元時間序列對高考預測分析案例

这项最新的调查研究,揭开多云发展的两大秘密

赫尔辛基交通安全改善项目部署Velodyne Lidar智能基础设施解决方案

Read mysql45 lecture - index

This latest research has revealed two secrets of cloudy development
随机推荐
Day_ 16 set
Day_ ten
Day_ twelve
【機器學習】基於多元時間序列對高考預測分析案例
Unity技术手册 - 生命周期内大小(Size over Lifetime)和速度决定大小(Size by Speed)
Servlet details
一个 TDD 示例
DOM event flow, event delegate
Final, override, polymorphic, abstract, interface
论文笔记:Generalized Random Forests
一文带你搞懂 JWT 常见概念 & 优缺点
【效率】又一款笔记神器开源了!
Activation and value transfer of activity
Div element
Blue Bridge Cup - practice system login
Lifeifei's team applied vit to the robot, increased the maximum speed of planning reasoning by 512 times, and also cued hekaiming's MAE
Helsinki traffic safety improvement project deploys velodyne lidar Intelligent Infrastructure Solution
Collection overview, array encapsulation
Uniapp converts graphic verification codes in the form of file streams into images
Shuttle pop-up returns to the upper level