当前位置:网站首页>multiple linear regression
multiple linear regression
2022-07-01 03:20:00 【weixin_ nine hundred and sixty-one million eight hundred and se】
The content of this article comes from teacher Qingfeng's explanation
Understanding of linearity
Assume x x x It's an independent variable , y y y It's a dependent variable , And satisfy the linear relation : y i = β 0 + β 1 x i + μ i y_i=\beta_0+\beta_1 x_i+\mu_i yi=β0+β1xi+μi
The linear assumption does not require the initial models to assume the above strict linear relationship , The independent variable and dependent variable can be transformed into a linear relationship model through variable replacement , Such as :
y i = β 0 + β 1 ln x i + μ i ln y i = β 0 + β 1 ln x i + μ i y i = β 0 + β 1 x i + μ i y i = β 0 + β 1 x 1 i + β 2 x 2 i + δ x 1 i x 2 i + μ i y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ \ln y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\delta x_{1i}x_{2i}+\mu_i yi=β0+β1lnxi+μilnyi=β0+β1lnxi+μiyi=β0+β1xi+μiyi=β0+β1x1i+β2x2i+δx1ix2i+μiThis relationship requires data preprocessing before modeling .
Explore endogeneity
Cited example
- hypothesis x x x It is a product quality score (1-10 Between ), y y y Is the output of this product . We build a univariate linear regression model , obtain y ^ = 3.4 + 2.3 x \hat y=3.4+2.3x y^=3.4+2.3x
- 3.4: The score is 0 when , The average sales of this product is 3.4
- 2.3: Each additional unit of score , The average sales volume of this product can be increased 2.3
- If you now have two arguments , x 1 x_1 x1 Indicates the quality score , x 2 x_2 x2 Indicates the price of the product . We establish a multiple linear regression model , To the y ^ = 5.3 + 0.19 x 1 − 1.74 x 2 \hat y=5.3+0.19x_1-1.74x_2 y^=5.3+0.19x1−1.74x2
- 5.3: The score is 0 And the price is 0 when , The average sales volume of this product is 5.3( It doesn't make sense , No analysis is required )
- 0.19: With the other variables unchanged , Each additional unit of score , The average sales volume of this product increased 0.19
- -1.74: With the other variables unchanged , Every unit of increase in price , The average sales volume of this product decreased 1.74
- You can see , After introducing a new independent variable price , It has a great influence on the regression coefficient !
- reason : Endogeneity caused by missing variables
Endogeneity
Suppose our model is :
y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_kx_k+\mu y=β0+β1x1+β2x2+...+βkxk+μ
μ \mu μ It is a disturbance term that cannot be observed but meets certain conditions . If the error term μ \mu μ And all the arguments x x x It's not relevant , The regression model is said to be exogenous , If relevant , There is endogeneity , Endogeneity will lead to inaccurate estimation of regression coefficient : Does not satisfy unbiasedness and consistency .In the single regression model in the cited example , The error item contains the price , And the price is related to the quality score , So it leads to endogenesis .
Core explanatory variables and control variables
- No endogeneity requires that all explanatory variables are not related to the disturbance term , This assumption is usually too strong , Because there are usually many explanatory variables .
- To weaken this condition , Explanatory variables can be divided into core explanatory variables and control variables , As long as the core explanatory variables and μ \mu μ Irrelevant .
- The core explanatory variable : The variables we are most interested in , Therefore, we especially hope to get a consistent estimation of its coefficients .
- Control variables : There is not much interest in these variables themselves , Just for “ Control ” Those missing factors that have an impact on the explained variables . That is, put all the variables related to the core explanatory variables into the regression .
Interpretation of regression coefficients
Regression estimation equation :
y ^ = β ^ 0 + β ^ 1 x 1 + β ^ 2 x 2 + . . . + β ^ k x k \hat y=\hat\beta_0+\hat\beta_1x_1+\hat\beta_2x_2+...+\hat\beta_kx_k y^=β^0+β^1x1+β^2x2+...+β^kxk- β 0 \beta_0 β0 Generally, we do not consider the numerical significance of , Because all the arguments will not be 0.
- β ^ m \hat\beta_m β^m Is to control other independent variables unchanged , x m x_m xm Every additional unit , Yes y y y Changes caused by , namely β ^ m = ∂ y ∂ x m \hat\beta_m=\frac{\partial y}{\partial x_m} β^m=∂xm∂y, Therefore, the regression coefficient in the multiple linear regression model , Also called partial regression coefficient .
When to take logarithm ?
- Taking logarithm means the elasticity of the original explained variable to the explanatory variable , That is, the change in percentage rather than the change in value .
- at present , There are no fixed rules for when to take logarithms , But there are some rules of thumb :
- Related to market value , for example : Price 、 sales 、 You can take logarithm of salary, etc ;
- In terms of annual variables , Such as years of Education 、 Work experience is usually not taken as logarithm .
- Proportional variable , Such as unemployment rate 、 Participation rate, etc , Either way ;
- Variable value must be non negative , If you include 0, You can be right y Take the logarithm ln(1+y).
- The advantage of taking logarithms :
- Heteroscedasticity of weak data ;
- If the variable itself does not conform to the normal distribution , After taking the logarithm, it may follow the normal distribution gradually ;
- The need for model form , Economic models make sense .
Interpretation of regression coefficients of four types of models
- Univariate linear regression : y = a + b x + μ y=a+bx+\mu y=a+bx+μ, x x x Every increase 1 A unit of , y y y Average change b b b A unit of ;
- Double logarithm model : ln y = a + b ln x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1%, y y y Average change b%;
- Semilogarithmic model : y = a + b ln x + μ y=a+b\ln x+\mu y=a+blnx+μ, x x x Every increase 1%, y y y Average change b/100 A unit of ;
- Semilogarithmic model : ln y = a + b ln x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1 A unit of , y y y Average change (100b)% A unit of .
Dummy variable
- Regression deals with quantitative data , So how to deal with qualitative data ?
- Stata The handling of dummy variables is very friendly , This software can be used for analysis .
Single category
We want to study the impact of gender on wages :
y = β 0 + δ 0 F e m a l e + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\delta_0 Female+\beta_1 x_1+\beta_2x_2+...+\beta_k x_k+\mu y=β0+δ0Female+β1x1+β2x2+...+βkxk+μ- F e m a l e i = 1 Female_i=1 Femalei=1 It means the first one i i i One sample is female ;
- F e m a l e i = 0 Female_i=0 Femalei=0 It means the first one i i i Two samples were male ;
- The core explanatory variable : F e m a l e Female Female;
- Control variables : x m x_m xm( Variables related to women )
- E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) = δ 0 × 1 + C E(y|Female=1 And other independent variables )=\delta_0\times1+C E(y∣Female=1 With And Its He since change The amount to set )=δ0×1+C
- E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = δ 0 × 0 + C E(y|Female=0 And other independent variables )=\delta_0\times0+C E(y∣Female=0 With And Its He since change The amount to set )=δ0×0+C
- E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) − E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = E(y|Female=1 And other independent variables )-E(y|Female=0 And other independent variables )= E(y∣Female=1 With And Its He since change The amount to set )−E(y∣Female=0 With And Its He since change The amount to set )= δ 0 \delta_0 δ0( δ 0 \delta_0 δ0 Significantly different from 0 It makes sense )
- δ 0 \delta_0 δ0 It can be interpreted as : Given other arguments , The difference between the average wage of women and that of men .( The average salary of men is the control group )
Many classification
- One of the multiple categorical variables is the control group , The remaining variables are dummy variables , This is to avoid the effect of complete multicollinearity , Therefore, the number of dummy variables is generally classified -1
The goodness of fit is low
- Regression is divided into explanatory regression and predictive regression :
- Predictive regression is generally more important R 2 R^2 R2
- Explanatory regression pays more attention to the overall significance of the model 、 Statistical significance and economic significance significance of independent variables
- The model can be adjusted , For example, take logarithm or square of the data and then carry out regression .
- There may be outliers in the data or the data may be unevenly distributed in different quarters .
Goodness of fit and adjusted goodness of fit
- The more arguments we introduce , The goodness of fit will increase , Obviously this is not what we want . We prefer to use the adjusted goodness of fit , If the newly introduced independent variable pairs the residuals SSE The reduction is particularly small , Then the adjusted fitting will reduce .
R 2 = 1 − S S E S S T R^2=1-\frac{SSE}{SST} R2=1−SSTSSE R a d j u s t e d 2 = 1 − S S E / ( n − k − 1 ) S S T / ( n − 1 ) R^2_{adjusted}=1-\frac{SSE/(n-k-1)}{SST/(n-1)} Radjusted2=1−SST/(n−1)SSE/(n−k−1)
Standardized regression coefficient
- In order to more accurately study the important factors affecting the evaluation quantity ( Remove dimensional effects ), We may consider using standardized regression coefficients .
- Standardize the data , Is to subtract its mean from the original data , Then divide by the standard deviation of the variable , After regression, the standardized regression coefficient can be obtained .
- The larger the absolute value of the standardized regression coefficient , It shows that the greater the influence on the dependent variable ( Only significant regression coefficients are concerned )
- Standardizing the data will not affect the standard error of the regression coefficient , Nor does it affect significance .
Heteroscedasticity
- In the previous regression analysis , We all default to the perturbation term μ i \mu_i μi Is a spherical perturbation term : Satisfy “ Homovariance ”( E ( μ i 2 ) = σ 2 E(\mu_i^2)=\sigma^2 E(μi2)=σ2) and “ No autocorrelation ”( E ( μ i μ j ) = 0 E(\mu_i \mu_j)=0 E(μiμj)=0) Two conditions .
- Cross section data are prone to heteroscedasticity problems ; Time series data are prone to autocorrelation .
Consequences of heteroscedasticity
- OLS The estimated regression coefficient is unbiased 、 coincident .
- Hypothesis testing cannot be used ( The constructed statistics are invalid ).
- OLS The estimator is no longer the most linear unbiased estimator .
Test heteroscedasticity
- You can draw residuals and fitting values ( Or independent variable ) The scatter diagram of , Uniform distribution means no heteroscedasticity .
- BP Inspection and white inspection . The latter also includes the square term and the cross term , therefore ,BP The test can be regarded as a special case of white test .BP Tested Stata command :estat hettest,rhs iid; White's test Stata command :estat imtest,white
Solve heteroscedasticity
- Use OLS+ Robust standard error ( It's used a lot )
- Still use OLS Return to , But use robust standard error . This is the simplest , It is also the most common method at present . As long as the sample size is large , Even if there is heteroscedasticity , If robust standard errors are used , Then all parameters are estimated 、 The hypothesis test can be carried out as usual .
- Stata command :regress y x_1 x_2 … x_k ,robust
- The generalized least squares method GLS
- principle : The data with large variance contains less information , We can give a large amount of information ( That is, the data with smaller variance has greater weight )
- shortcoming : We don't know the true covariance matrix of the perturbation term , Therefore, we can only use sample data to estimate , The result is not robust , There is contingency .
Multicollinearity
- If the data matrix X X X Dissatisfied column rank , That is, an explanatory variable can be linearly expressed by other explanatory variables , There is “ Strictly multicollinearity ”( Complete multicollinearity ).
- If the i i i Explanatory variables x i x_i xi For the remaining explanatory variables { x 1 , . . . , x i − 1 , x i + 1 , . . . , x k } \{x_1,...,x_{i-1},x_{i+1},...,x_k\} { x1,...,xi−1,xi+1,...,xk} Regression , The obtained determinacy coefficient is higher , Then there is approximate multicollinearity .
performance
- Although the whole regression equation R 2 R^2 R2 more 、 F F F The test is also significant , But a single coefficient t t t The test is not significant , Or the estimated coefficient is unreasonable , Even the sign is contrary to the theoretical expectation .
- Increasing or decreasing explanatory variables makes the estimated coefficient change greatly .
How to test multicollinearity
- Variance expansion factor (VIF): Suppose now k k k There are two independent variables , So the first m m m It's an independent variable V I F m = 1 1 − R 1 ∼ k / m 2 VIF_m=\frac{1}{1-R_{1\sim k/m}^2} VIFm=1−R1∼k/m21
R 1 ∼ k / m 2 R_{1\sim k/m}^2 R1∼k/m2 When will the second m m m There are independent variables as dependent variables , For the rest k − 1 k-1 k−1 Goodness of fit obtained by regression of independent variables .
- V I F m VIF_m VIFm The bigger it is , Note No m m m The greater the correlation between one variable and other variables .
- If V I F = max { V I F 1 , . . . , V I F k } > 10 VIF=\max\{VIF_1,...,VIF_k\}>10 VIF=max{ VIF1,...,VIFk}>10, It is considered that the regression equation has serious multicollinearity .Stata command :estat vif
Multicollinearity processing
- If it's just for forecasting , That is, we don't care about the specific regression coefficient , Then the existence of multicollinearity has no effect ( Assume that the entire equation is significant ). This is because , The main consequence of multicollinearity is that the contribution to a single variable is inaccurate , But the overall effect of all variables can still be estimated more accurately .
- If you care about the specific regression coefficient , However, multicollinearity does not affect the significance of the variables concerned , Then you can ignore . Even with variance inflation , These coefficients are still significant ; If there is no multicollinearity , Will only be more significant .
- If multicollinearity affects the variable of interest ( The core explanatory variable ) Significance of , You need to increase the sample size , Eliminate variables that cause severe collinearity ( Don't delete it easily , Because there may be endogenous effects ), Or modify the model settings .
Solve multicollinearity
- Step forward and return : Introduce independent variables into the model one by one , Every time an independent variable is introduced, it should be tested , When significant, the regression model is added . shortcoming : With the introduction of other independent variables in the future , The original significant independent variable may also become insignificant , But it was not eliminated from the regression equation in time .
- Step back : Contrary to forward stepwise regression , First put all variables into the model , Then try to eliminate one of the independent variables from the model , See if there is a significant variation in the cheapness of the dependent variable explained by the whole model , After that, the independent variable with the least explanatory power is eliminated ; This process is iterative , Until no independent variable meets the elimination condition . shortcoming : Start by introducing all the variables into the regression equation , This calculation is relatively large .
边栏推荐
- 力扣-两数之和
- So easy 将程序部署到服务器
- 第03章_用户与权限管理
- Lavaweb [first understanding the solution of subsequent problems]
- 实战 ELK 优雅管理服务器日志
- How to achieve 0 error (s) and 0 warning (s) in keil5
- Latest interface automation interview questions
- 过滤器 Filter
- [reading notes] copywriting realization -- four golden steps for writing effective copywriting
- Subnet division (10)
猜你喜欢

【小程序项目开发--京东商城】uni-app之自定义搜索组件(上)

How the network is connected: Chapter 2 (Part 2) packet receiving and sending operations between IP and Ethernet

过滤器 Filter

Redis分布式锁的8大坑

Huawei operator level router configuration example | configuration static VPLS example

XXL job User Guide

JUC learning

Huawei operator level router configuration example | BGP VPLS and LDP VPLS interworking example

Druid monitoring statistics source

C#实现基于广度优先BFS求解无权图最短路径----完整程序展示
随机推荐
Redis tutorial
So easy deploy program to server
Overview of EtherCAT principle
别再说不会解决 “跨域“ 问题啦
伺服第二编码器数值链接到倍福PLC的NC虚拟轴做显示
【小程序项目开发-- 京东商城】uni-app之分类导航区域
彻底解决Lost connection to MySQL server at ‘reading initial communication packet
几行事务代码,让我赔了16万
Const and the secret of pointers
CX5120控制汇川IS620N伺服报错E15解决方案
数据交换 JSON
世界上最好的学习法:费曼学习法
If a parent class defines a parameterless constructor, is it necessary to call super ()?
终极套娃 2.0 | 云原生交付的封装
咱就是说 随便整几千个表情包为我所用一下
[small program project development -- Jingdong Mall] the home page commodity floor of uni app
Huawei operator level router configuration example | BGP VPLS configuration example
[applet project development -- Jingdong Mall] user defined search component of uni app (Part 1)
Dart training and sphygmomanometer inflation pump power control DPC
Redis efficient like and cancel function