当前位置:网站首页>multiple linear regression

multiple linear regression

2022-07-01 03:20:00 weixin_ nine hundred and sixty-one million eight hundred and se

The content of this article comes from teacher Qingfeng's explanation

Understanding of linearity

  • Assume x x x It's an independent variable , y y y It's a dependent variable , And satisfy the linear relation : y i = β 0 + β 1 x i + μ i y_i=\beta_0+\beta_1 x_i+\mu_i yi=β0+β1xi+μi

  • The linear assumption does not require the initial models to assume the above strict linear relationship , The independent variable and dependent variable can be transformed into a linear relationship model through variable replacement , Such as :
    y i = β 0 + β 1 ln ⁡ x i + μ i ln ⁡ y i = β 0 + β 1 ln ⁡ x i + μ i y i = β 0 + β 1 x i + μ i y i = β 0 + β 1 x 1 i + β 2 x 2 i + δ x 1 i x 2 i + μ i y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ \ln y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\delta x_{1i}x_{2i}+\mu_i yi=β0+β1lnxi+μilnyi=β0+β1lnxi+μiyi=β0+β1xi+μiyi=β0+β1x1i+β2x2i+δx1ix2i+μi

  • This relationship requires data preprocessing before modeling .

Explore endogeneity

Cited example

  1. hypothesis x x x It is a product quality score (1-10 Between ), y y y Is the output of this product . We build a univariate linear regression model , obtain y ^ = 3.4 + 2.3 x \hat y=3.4+2.3x y^=3.4+2.3x
    1. 3.4: The score is 0 when , The average sales of this product is 3.4
    2. 2.3: Each additional unit of score , The average sales volume of this product can be increased 2.3
  2. If you now have two arguments , x 1 x_1 x1 Indicates the quality score , x 2 x_2 x2 Indicates the price of the product . We establish a multiple linear regression model , To the y ^ = 5.3 + 0.19 x 1 − 1.74 x 2 \hat y=5.3+0.19x_1-1.74x_2 y^=5.3+0.19x11.74x2
    1. 5.3: The score is 0 And the price is 0 when , The average sales volume of this product is 5.3( It doesn't make sense , No analysis is required )
    2. 0.19: With the other variables unchanged , Each additional unit of score , The average sales volume of this product increased 0.19
    3. -1.74: With the other variables unchanged , Every unit of increase in price , The average sales volume of this product decreased 1.74
  • You can see , After introducing a new independent variable price , It has a great influence on the regression coefficient !
  • reason : Endogeneity caused by missing variables

Endogeneity

  • Suppose our model is :
    y = β 0 + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_kx_k+\mu y=β0+β1x1+β2x2+...+βkxk+μ
    μ \mu μ It is a disturbance term that cannot be observed but meets certain conditions . If the error term μ \mu μ And all the arguments x x x It's not relevant , The regression model is said to be exogenous , If relevant , There is endogeneity , Endogeneity will lead to inaccurate estimation of regression coefficient : Does not satisfy unbiasedness and consistency .

  • In the single regression model in the cited example , The error item contains the price , And the price is related to the quality score , So it leads to endogenesis .

Core explanatory variables and control variables

  • No endogeneity requires that all explanatory variables are not related to the disturbance term , This assumption is usually too strong , Because there are usually many explanatory variables .
  • To weaken this condition , Explanatory variables can be divided into core explanatory variables and control variables , As long as the core explanatory variables and μ \mu μ Irrelevant .
  • The core explanatory variable : The variables we are most interested in , Therefore, we especially hope to get a consistent estimation of its coefficients .
  • Control variables : There is not much interest in these variables themselves , Just for “ Control ” Those missing factors that have an impact on the explained variables . That is, put all the variables related to the core explanatory variables into the regression .

Interpretation of regression coefficients

  • Regression estimation equation :
    y ^ = β ^ 0 + β ^ 1 x 1 + β ^ 2 x 2 + . . . + β ^ k x k \hat y=\hat\beta_0+\hat\beta_1x_1+\hat\beta_2x_2+...+\hat\beta_kx_k y^=β^0+β^1x1+β^2x2+...+β^kxk

    1. β 0 \beta_0 β0 Generally, we do not consider the numerical significance of , Because all the arguments will not be 0.
    2. β ^ m \hat\beta_m β^m Is to control other independent variables unchanged , x m x_m xm Every additional unit , Yes y y y Changes caused by , namely β ^ m = ∂ y ∂ x m \hat\beta_m=\frac{\partial y}{\partial x_m} β^m=xmy, Therefore, the regression coefficient in the multiple linear regression model , Also called partial regression coefficient .

When to take logarithm ?

  • Taking logarithm means the elasticity of the original explained variable to the explanatory variable , That is, the change in percentage rather than the change in value .
  • at present , There are no fixed rules for when to take logarithms , But there are some rules of thumb :
    1. Related to market value , for example : Price 、 sales 、 You can take logarithm of salary, etc ;
    2. In terms of annual variables , Such as years of Education 、 Work experience is usually not taken as logarithm .
    3. Proportional variable , Such as unemployment rate 、 Participation rate, etc , Either way ;
    4. Variable value must be non negative , If you include 0, You can be right y Take the logarithm ln(1+y).
  • The advantage of taking logarithms :
    1. Heteroscedasticity of weak data ;
    2. If the variable itself does not conform to the normal distribution , After taking the logarithm, it may follow the normal distribution gradually ;
    3. The need for model form , Economic models make sense .

Interpretation of regression coefficients of four types of models

  1. Univariate linear regression : y = a + b x + μ y=a+bx+\mu y=a+bx+μ, x x x Every increase 1 A unit of , y y y Average change b b b A unit of ;
  2. Double logarithm model : ln ⁡ y = a + b ln ⁡ x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1%, y y y Average change b%;
  3. Semilogarithmic model : y = a + b ln ⁡ x + μ y=a+b\ln x+\mu y=a+blnx+μ, x x x Every increase 1%, y y y Average change b/100 A unit of ;
  4. Semilogarithmic model : ln ⁡ y = a + b ln ⁡ x + μ \ln y=a+b\ln x+\mu lny=a+blnx+μ, x x x Every increase 1 A unit of , y y y Average change (100b)% A unit of .

Dummy variable

  • Regression deals with quantitative data , So how to deal with qualitative data ?
  • Stata The handling of dummy variables is very friendly , This software can be used for analysis .

Single category

  • We want to study the impact of gender on wages :
    y = β 0 + δ 0 F e m a l e + β 1 x 1 + β 2 x 2 + . . . + β k x k + μ y=\beta_0+\delta_0 Female+\beta_1 x_1+\beta_2x_2+...+\beta_k x_k+\mu y=β0+δ0Female+β1x1+β2x2+...+βkxk+μ

    1. F e m a l e i = 1 Female_i=1 Femalei=1 It means the first one i i i One sample is female ;
    2. F e m a l e i = 0 Female_i=0 Femalei=0 It means the first one i i i Two samples were male ;
    3. The core explanatory variable : F e m a l e Female Female;
    4. Control variables : x m x_m xm( Variables related to women )
  1. E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) = δ 0 × 1 + C E(y|Female=1 And other independent variables )=\delta_0\times1+C E(yFemale=1 With And Its He since change The amount to set )=δ0×1+C
  2. E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = δ 0 × 0 + C E(y|Female=0 And other independent variables )=\delta_0\times0+C E(yFemale=0 With And Its He since change The amount to set )=δ0×0+C
  3. E ( y ∣ F e m a l e = 1 With And Its He since change The amount to set ) − E ( y ∣ F e m a l e = 0 With And Its He since change The amount to set ) = E(y|Female=1 And other independent variables )-E(y|Female=0 And other independent variables )= E(yFemale=1 With And Its He since change The amount to set )E(yFemale=0 With And Its He since change The amount to set )= δ 0 \delta_0 δ0 δ 0 \delta_0 δ0 Significantly different from 0 It makes sense )
  4. δ 0 \delta_0 δ0 It can be interpreted as : Given other arguments , The difference between the average wage of women and that of men .( The average salary of men is the control group )

Many classification

  • One of the multiple categorical variables is the control group , The remaining variables are dummy variables , This is to avoid the effect of complete multicollinearity , Therefore, the number of dummy variables is generally classified -1

The goodness of fit is low

  1. Regression is divided into explanatory regression and predictive regression :
    1. Predictive regression is generally more important R 2 R^2 R2
    2. Explanatory regression pays more attention to the overall significance of the model 、 Statistical significance and economic significance significance of independent variables
  2. The model can be adjusted , For example, take logarithm or square of the data and then carry out regression .
  3. There may be outliers in the data or the data may be unevenly distributed in different quarters .

Goodness of fit and adjusted goodness of fit

  • The more arguments we introduce , The goodness of fit will increase , Obviously this is not what we want . We prefer to use the adjusted goodness of fit , If the newly introduced independent variable pairs the residuals SSE The reduction is particularly small , Then the adjusted fitting will reduce .
    R 2 = 1 − S S E S S T R^2=1-\frac{SSE}{SST} R2=1SSTSSE R a d j u s t e d 2 = 1 − S S E / ( n − k − 1 ) S S T / ( n − 1 ) R^2_{adjusted}=1-\frac{SSE/(n-k-1)}{SST/(n-1)} Radjusted2=1SST/(n1)SSE/(nk1)

Standardized regression coefficient

  • In order to more accurately study the important factors affecting the evaluation quantity ( Remove dimensional effects ), We may consider using standardized regression coefficients .
  • Standardize the data , Is to subtract its mean from the original data , Then divide by the standard deviation of the variable , After regression, the standardized regression coefficient can be obtained .
  • The larger the absolute value of the standardized regression coefficient , It shows that the greater the influence on the dependent variable ( Only significant regression coefficients are concerned )
  • Standardizing the data will not affect the standard error of the regression coefficient , Nor does it affect significance .

Heteroscedasticity

  • In the previous regression analysis , We all default to the perturbation term μ i \mu_i μi Is a spherical perturbation term : Satisfy “ Homovariance ”( E ( μ i 2 ) = σ 2 E(\mu_i^2)=\sigma^2 E(μi2)=σ2) and “ No autocorrelation ”( E ( μ i μ j ) = 0 E(\mu_i \mu_j)=0 E(μiμj)=0) Two conditions .
  • Cross section data are prone to heteroscedasticity problems ; Time series data are prone to autocorrelation .

Consequences of heteroscedasticity

  1. OLS The estimated regression coefficient is unbiased 、 coincident .
  2. Hypothesis testing cannot be used ( The constructed statistics are invalid ).
  3. OLS The estimator is no longer the most linear unbiased estimator .

Test heteroscedasticity

  1. You can draw residuals and fitting values ( Or independent variable ) The scatter diagram of , Uniform distribution means no heteroscedasticity .
  2. BP Inspection and white inspection . The latter also includes the square term and the cross term , therefore ,BP The test can be regarded as a special case of white test .BP Tested Stata command :estat hettest,rhs iid; White's test Stata command :estat imtest,white

Solve heteroscedasticity

  1. Use OLS+ Robust standard error ( It's used a lot )
    1. Still use OLS Return to , But use robust standard error . This is the simplest , It is also the most common method at present . As long as the sample size is large , Even if there is heteroscedasticity , If robust standard errors are used , Then all parameters are estimated 、 The hypothesis test can be carried out as usual .
    2. Stata command :regress y x_1 x_2 … x_k ,robust
  2. The generalized least squares method GLS
    1. principle : The data with large variance contains less information , We can give a large amount of information ( That is, the data with smaller variance has greater weight )
    2. shortcoming : We don't know the true covariance matrix of the perturbation term , Therefore, we can only use sample data to estimate , The result is not robust , There is contingency .

Multicollinearity

  • If the data matrix X X X Dissatisfied column rank , That is, an explanatory variable can be linearly expressed by other explanatory variables , There is “ Strictly multicollinearity ”( Complete multicollinearity ).
  • If the i i i Explanatory variables x i x_i xi For the remaining explanatory variables { x 1 , . . . , x i − 1 , x i + 1 , . . . , x k } \{x_1,...,x_{i-1},x_{i+1},...,x_k\} { x1,...,xi1,xi+1,...,xk} Regression , The obtained determinacy coefficient is higher , Then there is approximate multicollinearity .

performance

  1. Although the whole regression equation R 2 R^2 R2 more 、 F F F The test is also significant , But a single coefficient t t t The test is not significant , Or the estimated coefficient is unreasonable , Even the sign is contrary to the theoretical expectation .
  2. Increasing or decreasing explanatory variables makes the estimated coefficient change greatly .

How to test multicollinearity

  • Variance expansion factor (VIF): Suppose now k k k There are two independent variables , So the first m m m It's an independent variable V I F m = 1 1 − R 1 ∼ k / m 2 VIF_m=\frac{1}{1-R_{1\sim k/m}^2} VIFm=1R1k/m21

R 1 ∼ k / m 2 R_{1\sim k/m}^2 R1k/m2 When will the second m m m There are independent variables as dependent variables , For the rest k − 1 k-1 k1 Goodness of fit obtained by regression of independent variables .

  • V I F m VIF_m VIFm The bigger it is , Note No m m m The greater the correlation between one variable and other variables .
  • If V I F = max ⁡ { V I F 1 , . . . , V I F k } > 10 VIF=\max\{VIF_1,...,VIF_k\}>10 VIF=max{ VIF1,...,VIFk}>10, It is considered that the regression equation has serious multicollinearity .Stata command :estat vif

Multicollinearity processing

  1. If it's just for forecasting , That is, we don't care about the specific regression coefficient , Then the existence of multicollinearity has no effect ( Assume that the entire equation is significant ). This is because , The main consequence of multicollinearity is that the contribution to a single variable is inaccurate , But the overall effect of all variables can still be estimated more accurately .
  2. If you care about the specific regression coefficient , However, multicollinearity does not affect the significance of the variables concerned , Then you can ignore . Even with variance inflation , These coefficients are still significant ; If there is no multicollinearity , Will only be more significant .
  3. If multicollinearity affects the variable of interest ( The core explanatory variable ) Significance of , You need to increase the sample size , Eliminate variables that cause severe collinearity ( Don't delete it easily , Because there may be endogenous effects ), Or modify the model settings .

Solve multicollinearity

  • Step forward and return : Introduce independent variables into the model one by one , Every time an independent variable is introduced, it should be tested , When significant, the regression model is added . shortcoming : With the introduction of other independent variables in the future , The original significant independent variable may also become insignificant , But it was not eliminated from the regression equation in time .
  • Step back : Contrary to forward stepwise regression , First put all variables into the model , Then try to eliminate one of the independent variables from the model , See if there is a significant variation in the cheapness of the dependent variable explained by the whole model , After that, the independent variable with the least explanatory power is eliminated ; This process is iterative , Until no independent variable meets the elimination condition . shortcoming : Start by introducing all the variables into the regression equation , This calculation is relatively large .
原网站

版权声明
本文为[weixin_ nine hundred and sixty-one million eight hundred and se]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/182/202207010308242659.html