当前位置:网站首页>multiple linear regression
multiple linear regression
2022-07-01 03:20:00 【weixin_ nine hundred and sixty-one million eight hundred and se】
The content of this article comes from teacher Qingfeng's explanation
Understanding of linearity
Assume x x x It's an independent variable , y y y It's a dependent variable , And satisfy the linear relation : y i = β 0 + β 1 x i + μ i y_i=\beta_0+\beta_1 x_i+\mu_i yi=β0+β1xi+μi
The linear assumption does not require the initial models to assume the above strict linear relationship , The independent variable and dependent variable can be transformed into a linear relationship model through variable replacement , Such as :
y i = β 0 + β 1 ln x i + μ i ln y i = β 0 + β 1 ln x i + μ i y i = β 0 + β 1 x i + μ i y i = β 0 + β 1 x 1 i + β 2 x 2 i + δ x 1 i x 2 i + μ i y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ \ln y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\delta x_{1i}x_{2i}+\mu_i yi=β0+β1lnxi+μilnyi=β0+β1lnxi+μiyi=β0+β1xi+μiyi=β0+β1x1i+β2x2i+δx1ix2i+μiThis relationship requires data preprocessing before modeling .
Explore endogeneity
Cited example
- hypothesis x x x It is a product quality score (1-10 Between ), y y y Is the output of this product . We build a univariate linear regression model , obtain y ^ = 3.4 + 2.3 x \hat y=3.4+2.3x y^=3.4+2.3x
- 3.4: The score is 0 when , The average sales of this product is 3.4
- 2.3: Each additional unit of score , The average sales volume of this product can be increased 2.3
- If you now have two arguments , x 1 x_1 x1 Indicates the quality score , x 2 x_2 x2 Indicates the price of the product . We establish a multiple linear regression model , To the y ^ = 5.3 + 0.19 x 1 − 1.74 x 2 \hat y=5.3+0.19x_1-1.74x_2 y^=5.3+0.19x1−1.74x2
- 5.3: The score is 0 And the price is 0 when , The average sales volume of this product is 5.3( It doesn't make sense , No analysis is required )
- 0.19: With the other variables unchanged , Each additional unit of score , The average sales volume of this product increased 0.19
- -1.74: With the other variables unchanged , Every unit of increase in price , The average sales volume of this product decreased 1.74
- You can see , After introducing a new independent variable price , It has a great influence on the regression coefficient !
- reason : Endogeneity caused by missing variables
Endogeneity
Suppose our model is :
y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_kx_k+\mu y=β0+β1x1+β2x2+...+βkxk+μ
μ \mu μ It is a disturbance term that cannot be observed but meets certain conditions . If the error term μ \mu μ And all the arguments x x x It's not relevant , The regression model is said to be exogenous , If relevant , There is endogeneity , Endogeneity will lead to inaccurate estimation of regression coefficient : Does not satisfy unbiasedness and consistency .In the single regression model in the cited example , The error item contains the price , And the price is related to the quality score , So it leads to endogenesis .
Core explanatory variables and control variables
- No endogeneity requires that all explanatory variables are not related to the disturbance term , This assumption is usually too strong , Because there are usually many explanatory variables .
- To weaken this condition , Explanatory variables can be divided into core explanatory variables and control variables , As long as the core explanatory variables and μ \mu μ Irrelevant .
- The core explanatory variable : The variables we are most interested in , Therefore, we especially hope to get a consistent estimation of its coefficients .
- Control variables : There is not much interest in these variables themselves , Just for “ Control ” Those missing factors that have an impact on the explained variables . That is, put all the variables related to the core explanatory variables into the regression .
Interpretation of regression coefficients
Regression estimation equation :
y ^ = β ^ 0 + β ^ 1 x 1 + β ^ 2 x 2 + . . . + β ^ k x k \hat y=\hat\beta_0+\hat\beta_1x_1+\hat\beta_2x_2+...+\hat\beta_kx_k y^=β^0+β^1x1+β^2x2+...+β^kxk- β 0 \beta_0 β0 Generally, we do not consider the numerical significance of , Because all the arguments will not be 0.
- β ^ m \hat\beta_m β^m Is to control other independent variables unchanged , x m x_m xm Every additional unit , Yes y y y Changes caused by , namely β ^ m = ∂ y ∂ x m \hat\beta_m=\frac{\partial y}{\partial x_m} β^m=∂xm∂y, Therefore, the regression coefficient in the multiple linear regression model , Also called partial regression coefficient .
When to take logarithm ?
- Taking logarithm means the elasticity of the original explained variable to the explanatory variable , That is, the change in percentage rather than the change in value .
- at present , There are no fixed rules for when to take logarithms , But there are some rules of thumb :
- Related to market value , for example : Price 、 sales 、 You can take logarithm of salary, etc ;
- In terms of annual variables , Such as years of Education 、 Work experience is usually not taken as logarithm .
- Proportional variable , Such as unemployment rate 、 Participation rate, etc , Either way ;
- Variable value must be non negative , If you include 0, You can be right y Take the logarithm ln(1+y).
- The advantage of taking logarithms :
- Heteroscedasticity of weak data ;
- If the variable itself does not conform to the normal distribution , After taking the logarithm, it may follow the normal distribution gradually ;
- The need for model form , Economic models make sense .
Interpretation of regression coefficients of four types of models
- Univariate linear regression : y = a + b x + μ y=a+bx+\mu y=a+bx+μ, x x x Every increase 1 A unit of , y y y Average change b b b A unit of ;
- Double logarithm model : ln y = a + b ln x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1%, y y y Average change b%;
- Semilogarithmic model : y = a + b ln x + μ y=a+b\ln x+\mu y=a+blnx+μ, x x x Every increase 1%, y y y Average change b/100 A unit of ;
- Semilogarithmic model : ln y = a + b ln x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1 A unit of , y y y Average change (100b)% A unit of .
Dummy variable
- Regression deals with quantitative data , So how to deal with qualitative data ?
- Stata The handling of dummy variables is very friendly , This software can be used for analysis .
Single category
We want to study the impact of gender on wages :
y = β 0 + δ 0 F e m a l e + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\delta_0 Female+\beta_1 x_1+\beta_2x_2+...+\beta_k x_k+\mu y=β0+δ0Female+β1x1+β2x2+...+βkxk+μ- F e m a l e i = 1 Female_i=1 Femalei=1 It means the first one i i i One sample is female ;
- F e m a l e i = 0 Female_i=0 Femalei=0 It means the first one i i i Two samples were male ;
- The core explanatory variable : F e m a l e Female Female;
- Control variables : x m x_m xm( Variables related to women )
- E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) = δ 0 × 1 + C E(y|Female=1 And other independent variables )=\delta_0\times1+C E(y∣Female=1 With And Its He since change The amount to set )=δ0×1+C
- E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = δ 0 × 0 + C E(y|Female=0 And other independent variables )=\delta_0\times0+C E(y∣Female=0 With And Its He since change The amount to set )=δ0×0+C
- E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) − E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = E(y|Female=1 And other independent variables )-E(y|Female=0 And other independent variables )= E(y∣Female=1 With And Its He since change The amount to set )−E(y∣Female=0 With And Its He since change The amount to set )= δ 0 \delta_0 δ0( δ 0 \delta_0 δ0 Significantly different from 0 It makes sense )
- δ 0 \delta_0 δ0 It can be interpreted as : Given other arguments , The difference between the average wage of women and that of men .( The average salary of men is the control group )
Many classification
- One of the multiple categorical variables is the control group , The remaining variables are dummy variables , This is to avoid the effect of complete multicollinearity , Therefore, the number of dummy variables is generally classified -1
The goodness of fit is low
- Regression is divided into explanatory regression and predictive regression :
- Predictive regression is generally more important R 2 R^2 R2
- Explanatory regression pays more attention to the overall significance of the model 、 Statistical significance and economic significance significance of independent variables
- The model can be adjusted , For example, take logarithm or square of the data and then carry out regression .
- There may be outliers in the data or the data may be unevenly distributed in different quarters .
Goodness of fit and adjusted goodness of fit
- The more arguments we introduce , The goodness of fit will increase , Obviously this is not what we want . We prefer to use the adjusted goodness of fit , If the newly introduced independent variable pairs the residuals SSE The reduction is particularly small , Then the adjusted fitting will reduce .
R 2 = 1 − S S E S S T R^2=1-\frac{SSE}{SST} R2=1−SSTSSE R a d j u s t e d 2 = 1 − S S E / ( n − k − 1 ) S S T / ( n − 1 ) R^2_{adjusted}=1-\frac{SSE/(n-k-1)}{SST/(n-1)} Radjusted2=1−SST/(n−1)SSE/(n−k−1)
Standardized regression coefficient
- In order to more accurately study the important factors affecting the evaluation quantity ( Remove dimensional effects ), We may consider using standardized regression coefficients .
- Standardize the data , Is to subtract its mean from the original data , Then divide by the standard deviation of the variable , After regression, the standardized regression coefficient can be obtained .
- The larger the absolute value of the standardized regression coefficient , It shows that the greater the influence on the dependent variable ( Only significant regression coefficients are concerned )
- Standardizing the data will not affect the standard error of the regression coefficient , Nor does it affect significance .
Heteroscedasticity
- In the previous regression analysis , We all default to the perturbation term μ i \mu_i μi Is a spherical perturbation term : Satisfy “ Homovariance ”( E ( μ i 2 ) = σ 2 E(\mu_i^2)=\sigma^2 E(μi2)=σ2) and “ No autocorrelation ”( E ( μ i μ j ) = 0 E(\mu_i \mu_j)=0 E(μiμj)=0) Two conditions .
- Cross section data are prone to heteroscedasticity problems ; Time series data are prone to autocorrelation .
Consequences of heteroscedasticity
- OLS The estimated regression coefficient is unbiased 、 coincident .
- Hypothesis testing cannot be used ( The constructed statistics are invalid ).
- OLS The estimator is no longer the most linear unbiased estimator .
Test heteroscedasticity
- You can draw residuals and fitting values ( Or independent variable ) The scatter diagram of , Uniform distribution means no heteroscedasticity .
- BP Inspection and white inspection . The latter also includes the square term and the cross term , therefore ,BP The test can be regarded as a special case of white test .BP Tested Stata command :estat hettest,rhs iid; White's test Stata command :estat imtest,white
Solve heteroscedasticity
- Use OLS+ Robust standard error ( It's used a lot )
- Still use OLS Return to , But use robust standard error . This is the simplest , It is also the most common method at present . As long as the sample size is large , Even if there is heteroscedasticity , If robust standard errors are used , Then all parameters are estimated 、 The hypothesis test can be carried out as usual .
- Stata command :regress y x_1 x_2 … x_k ,robust
- The generalized least squares method GLS
- principle : The data with large variance contains less information , We can give a large amount of information ( That is, the data with smaller variance has greater weight )
- shortcoming : We don't know the true covariance matrix of the perturbation term , Therefore, we can only use sample data to estimate , The result is not robust , There is contingency .
Multicollinearity
- If the data matrix X X X Dissatisfied column rank , That is, an explanatory variable can be linearly expressed by other explanatory variables , There is “ Strictly multicollinearity ”( Complete multicollinearity ).
- If the i i i Explanatory variables x i x_i xi For the remaining explanatory variables { x 1 , . . . , x i − 1 , x i + 1 , . . . , x k } \{x_1,...,x_{i-1},x_{i+1},...,x_k\} { x1,...,xi−1,xi+1,...,xk} Regression , The obtained determinacy coefficient is higher , Then there is approximate multicollinearity .
performance
- Although the whole regression equation R 2 R^2 R2 more 、 F F F The test is also significant , But a single coefficient t t t The test is not significant , Or the estimated coefficient is unreasonable , Even the sign is contrary to the theoretical expectation .
- Increasing or decreasing explanatory variables makes the estimated coefficient change greatly .
How to test multicollinearity
- Variance expansion factor (VIF): Suppose now k k k There are two independent variables , So the first m m m It's an independent variable V I F m = 1 1 − R 1 ∼ k / m 2 VIF_m=\frac{1}{1-R_{1\sim k/m}^2} VIFm=1−R1∼k/m21
R 1 ∼ k / m 2 R_{1\sim k/m}^2 R1∼k/m2 When will the second m m m There are independent variables as dependent variables , For the rest k − 1 k-1 k−1 Goodness of fit obtained by regression of independent variables .
- V I F m VIF_m VIFm The bigger it is , Note No m m m The greater the correlation between one variable and other variables .
- If V I F = max { V I F 1 , . . . , V I F k } > 10 VIF=\max\{VIF_1,...,VIF_k\}>10 VIF=max{ VIF1,...,VIFk}>10, It is considered that the regression equation has serious multicollinearity .Stata command :estat vif
Multicollinearity processing
- If it's just for forecasting , That is, we don't care about the specific regression coefficient , Then the existence of multicollinearity has no effect ( Assume that the entire equation is significant ). This is because , The main consequence of multicollinearity is that the contribution to a single variable is inaccurate , But the overall effect of all variables can still be estimated more accurately .
- If you care about the specific regression coefficient , However, multicollinearity does not affect the significance of the variables concerned , Then you can ignore . Even with variance inflation , These coefficients are still significant ; If there is no multicollinearity , Will only be more significant .
- If multicollinearity affects the variable of interest ( The core explanatory variable ) Significance of , You need to increase the sample size , Eliminate variables that cause severe collinearity ( Don't delete it easily , Because there may be endogenous effects ), Or modify the model settings .
Solve multicollinearity
- Step forward and return : Introduce independent variables into the model one by one , Every time an independent variable is introduced, it should be tested , When significant, the regression model is added . shortcoming : With the introduction of other independent variables in the future , The original significant independent variable may also become insignificant , But it was not eliminated from the regression equation in time .
- Step back : Contrary to forward stepwise regression , First put all variables into the model , Then try to eliminate one of the independent variables from the model , See if there is a significant variation in the cheapness of the dependent variable explained by the whole model , After that, the independent variable with the least explanatory power is eliminated ; This process is iterative , Until no independent variable meets the elimination condition . shortcoming : Start by introducing all the variables into the regression equation , This calculation is relatively large .
边栏推荐
- [applet project development -- JD mall] uni app commodity classification page (first)
- STM32——一线协议之DS18B20温度采样
- 【Qt】添加第三方库的知识补充
- pytest-fixture
- [linear DP] longest common subsequence
- Redis高效点赞与取消功能
- Mybati SQL statement printing
- Introduction to core functions of webrtc -- an article on understanding SDP PlanB unifiedplan (migrating from PlanB to unifiedplan)
- ES6解构语法详解
- md5sum操作
猜你喜欢
Design practice of current limiting components
Communication protocol -- Classification and characteristics Introduction
【机器学习】向量化计算 -- 机器学习路上必经路
Druid monitoring statistics source
Chapitre 03 Bar _ Gestion des utilisateurs et des droits
Detailed explanation of pointer array and array pointer (comprehensive knowledge points)
彻底解决Lost connection to MySQL server at ‘reading initial communication packet
伺服第二编码器数值链接到倍福PLC的NC虚拟轴做显示
XXL job User Guide
[machine learning] vectorized computing -- a must on the way of machine learning
随机推荐
咱就是说 随便整几千个表情包为我所用一下
[exsi] transfer files between hosts
实战 ELK 优雅管理服务器日志
Mysql知识点
Lavaweb [first understanding the solution of subsequent problems]
STM32——一线协议之DS18B20温度采样
Stop saying that you can't solve the "cross domain" problem
EDLines: A real-time line segment detector with a false detection control翻译
Classic programming problem: finding the number of daffodils
gcc使用、Makefile总结
Error accessing URL 404
Introduction and installation of Solr
Ctfshow blasting WP
VMware vSphere 6.7 virtualization cloud management 12. Vcsa6.7 update vCenter server license
打包iso文件的话,怎样使用hybrid格式输出?isohybrid:command not found
Introduction to ieda right click source file menu
Introduction to the core functions of webrtc -- an article to understand peerconnectionfactoryinterface rtcconfiguration peerconnectioninterface
一文讲解发布者订阅者模式与观察者模式
ctfshow爆破wp
HTB-Lame