当前位置：网站首页>multiple linear regression

multiple linear regression

2022-07-01 03:20:00 【weixin_ nine hundred and sixty-one million eight hundred and se】

The content of this article comes from teacher Qingfeng's explanation

Understanding of linearity

Assume $x$ It's an independent variable , $y$ It's a dependent variable , And satisfy the linear relation ： $y_i=\beta_0+\beta_1 x_i+\mu_i$
The linear assumption does not require the initial models to assume the above strict linear relationship , The independent variable and dependent variable can be transformed into a linear relationship model through variable replacement , Such as ：
$y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ \ln y_i=\beta_0+\beta_1\ln x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_i+\mu_i\\ y_i=\beta_0+\beta_1 x_{1i}+\beta_2 x_{2i}+\delta x_{1i}x_{2i}+\mu_i$
This relationship requires data preprocessing before modeling .

Explore endogeneity

Cited example

hypothesis $x$ It is a product quality score （1-10 Between ）, $y$ Is the output of this product . We build a univariate linear regression model , obtain $\hat y=3.4+2.3x$
1. 3.4： The score is 0 when , The average sales of this product is 3.4
2. 2.3： Each additional unit of score , The average sales volume of this product can be increased 2.3
If you now have two arguments , $x_1$ Indicates the quality score , $x_2$ Indicates the price of the product . We establish a multiple linear regression model , To the $\hat y=5.3+0.19x_1-1.74x_2$
1. 5.3： The score is 0 And the price is 0 when , The average sales volume of this product is 5.3（ It doesn't make sense , No analysis is required ）
2. 0.19： With the other variables unchanged , Each additional unit of score , The average sales volume of this product increased 0.19
3. -1.74： With the other variables unchanged , Every unit of increase in price , The average sales volume of this product decreased 1.74

You can see , After introducing a new independent variable price , It has a great influence on the regression coefficient ！
reason ： Endogeneity caused by missing variables

Endogeneity

Suppose our model is ：
$y=\beta_0+\beta_1x_1+\beta_2x_2+...+\beta_kx_k+\mu$
$\mu$ It is a disturbance term that cannot be observed but meets certain conditions . If the error term $\mu$ And all the arguments $x$ It's not relevant , The regression model is said to be exogenous , If relevant , There is endogeneity , Endogeneity will lead to inaccurate estimation of regression coefficient ： Does not satisfy unbiasedness and consistency .
In the single regression model in the cited example , The error item contains the price , And the price is related to the quality score , So it leads to endogenesis .

Core explanatory variables and control variables

No endogeneity requires that all explanatory variables are not related to the disturbance term , This assumption is usually too strong , Because there are usually many explanatory variables .
To weaken this condition , Explanatory variables can be divided into core explanatory variables and control variables , As long as the core explanatory variables and $\mu$ Irrelevant .
The core explanatory variable ： The variables we are most interested in , Therefore, we especially hope to get a consistent estimation of its coefficients .
Control variables ： There is not much interest in these variables themselves , Just for “ Control ” Those missing factors that have an impact on the explained variables . That is, put all the variables related to the core explanatory variables into the regression .

Interpretation of regression coefficients

Regression estimation equation ：
$\hat y=\hat\beta_0+\hat\beta_1x_1+\hat\beta_2x_2+...+\hat\beta_kx_k$
1. $\beta_0$ Generally, we do not consider the numerical significance of , Because all the arguments will not be 0.
2. $\hat\beta_m$ Is to control other independent variables unchanged , $x_m$ Every additional unit , Yes $y$ Changes caused by , namely $\hat\beta_m=\frac{\partial y}{\partial x_m}$ , Therefore, the regression coefficient in the multiple linear regression model , Also called partial regression coefficient .

When to take logarithm ？

Taking logarithm means the elasticity of the original explained variable to the explanatory variable , That is, the change in percentage rather than the change in value .
at present , There are no fixed rules for when to take logarithms , But there are some rules of thumb ：
1. Related to market value , for example ： Price 、 sales 、 You can take logarithm of salary, etc ;
2. In terms of annual variables , Such as years of Education 、 Work experience is usually not taken as logarithm .
3. Proportional variable , Such as unemployment rate 、 Participation rate, etc , Either way ;
4. Variable value must be non negative , If you include 0, You can be right y Take the logarithm ln(1+y).
The advantage of taking logarithms ：
1. Heteroscedasticity of weak data ;
2. If the variable itself does not conform to the normal distribution , After taking the logarithm, it may follow the normal distribution gradually ;
3. The need for model form , Economic models make sense .

Interpretation of regression coefficients of four types of models

Univariate linear regression ： $y=a+bx+\mu$ , $x$ Every increase 1 A unit of , $y$ Average change $b$ A unit of ;
Double logarithm model ： $\ln y=a+b\ln x+\mu$ , $x$ Every increase 1%, $y$ Average change b%;
Semilogarithmic model ： $y=a+b\ln x+\mu$ , $x$ Every increase 1%, $y$ Average change b/100 A unit of ;
Semilogarithmic model ： $\ln y=a+b\ln x+\mu$ , $x$ Every increase 1 A unit of , $y$ Average change （100b）% A unit of .

Dummy variable

Regression deals with quantitative data , So how to deal with qualitative data ？
Stata The handling of dummy variables is very friendly , This software can be used for analysis .

Single category

We want to study the impact of gender on wages ：
$y=\beta_0+\delta_0 Female+\beta_1 x_1+\beta_2x_2+...+\beta_k x_k+\mu$
1. $Female_i=1$ It means the first one $i$ One sample is female ;
2. $Female_i=0$ It means the first one $i$ Two samples were male ;
3. The core explanatory variable ： $F e m a l e$ ;
4. Control variables ： $x_m$ （ Variables related to women ）

$)=\delta_0\times1+C$
$)=\delta_0\times0+C$
$E (y ∣ F e m a l e = 1 With And Its He since change The amount to set) - E (y ∣ F e m a l e = 0 With And Its He since change The amount to set) =$ $\delta_0$ （ $\delta_0$ Significantly different from 0 It makes sense ）
$\delta_0$ It can be interpreted as ： Given other arguments , The difference between the average wage of women and that of men .（ The average salary of men is the control group ）

Many classification

One of the multiple categorical variables is the control group , The remaining variables are dummy variables , This is to avoid the effect of complete multicollinearity , Therefore, the number of dummy variables is generally classified -1

The goodness of fit is low

Regression is divided into explanatory regression and predictive regression ：
1. Predictive regression is generally more important $R^2$
2. Explanatory regression pays more attention to the overall significance of the model 、 Statistical significance and economic significance significance of independent variables
The model can be adjusted , For example, take logarithm or square of the data and then carry out regression .
There may be outliers in the data or the data may be unevenly distributed in different quarters .

Goodness of fit and adjusted goodness of fit

The more arguments we introduce , The goodness of fit will increase , Obviously this is not what we want . We prefer to use the adjusted goodness of fit , If the newly introduced independent variable pairs the residuals SSE The reduction is particularly small , Then the adjusted fitting will reduce .
$R^2=1-\frac{SSE}{SST}$ $R^2_{adjusted}=1-\frac{SSE/(n-k-1)}{SST/(n-1)}$

Standardized regression coefficient

In order to more accurately study the important factors affecting the evaluation quantity （ Remove dimensional effects ）, We may consider using standardized regression coefficients .
Standardize the data , Is to subtract its mean from the original data , Then divide by the standard deviation of the variable , After regression, the standardized regression coefficient can be obtained .
The larger the absolute value of the standardized regression coefficient , It shows that the greater the influence on the dependent variable （ Only significant regression coefficients are concerned ）
Standardizing the data will not affect the standard error of the regression coefficient , Nor does it affect significance .

Heteroscedasticity

In the previous regression analysis , We all default to the perturbation term $\mu_i$ Is a spherical perturbation term ： Satisfy “ Homovariance ”（ $E(\mu_i^2)=\sigma^2$ ） and “ No autocorrelation ”（ $E(\mu_i \mu_j)=0$ ） Two conditions .
Cross section data are prone to heteroscedasticity problems ; Time series data are prone to autocorrelation .

Consequences of heteroscedasticity

OLS The estimated regression coefficient is unbiased 、 coincident .
Hypothesis testing cannot be used （ The constructed statistics are invalid ）.
OLS The estimator is no longer the most linear unbiased estimator .

Test heteroscedasticity

You can draw residuals and fitting values （ Or independent variable ） The scatter diagram of , Uniform distribution means no heteroscedasticity .
BP Inspection and white inspection . The latter also includes the square term and the cross term , therefore ,BP The test can be regarded as a special case of white test .BP Tested Stata command ：estat hettest,rhs iid; White's test Stata command ：estat imtest,white

Solve heteroscedasticity

Use OLS+ Robust standard error （ It's used a lot ）
1. Still use OLS Return to , But use robust standard error . This is the simplest , It is also the most common method at present . As long as the sample size is large , Even if there is heteroscedasticity , If robust standard errors are used , Then all parameters are estimated 、 The hypothesis test can be carried out as usual .
2. Stata command ：regress y x_1 x_2 … x_k ,robust
The generalized least squares method GLS
1. principle ： The data with large variance contains less information , We can give a large amount of information ( That is, the data with smaller variance has greater weight )
2. shortcoming ： We don't know the true covariance matrix of the perturbation term , Therefore, we can only use sample data to estimate , The result is not robust , There is contingency .

Multicollinearity

If the data matrix $X$ Dissatisfied column rank , That is, an explanatory variable can be linearly expressed by other explanatory variables , There is “ Strictly multicollinearity ”（ Complete multicollinearity ）.
If the $i$ Explanatory variables $x_i$ For the remaining explanatory variables ${x_1,...,x_{i-1},x_{i+1},...,x_k\}$ Regression , The obtained determinacy coefficient is higher , Then there is approximate multicollinearity .

performance

Although the whole regression equation $R^2$ more 、 $F$ The test is also significant , But a single coefficient $t$ The test is not significant , Or the estimated coefficient is unreasonable , Even the sign is contrary to the theoretical expectation .
Increasing or decreasing explanatory variables makes the estimated coefficient change greatly .

How to test multicollinearity

Variance expansion factor (VIF)： Suppose now $k$ There are two independent variables , So the first $m$ It's an independent variable $VIF_m=\frac{1}{1-R_{1\sim k/m}^2}$

$R_{1\sim k/m}^2$ When will the second $m$ There are independent variables as dependent variables , For the rest $k - 1$ Goodness of fit obtained by regression of independent variables .

$VIF_m$ The bigger it is , Note No $m$ The greater the correlation between one variable and other variables .
If $VIF=\max\{VIF_1,...,VIF_k\}>10$ , It is considered that the regression equation has serious multicollinearity .Stata command ：estat vif

Multicollinearity processing

If it's just for forecasting , That is, we don't care about the specific regression coefficient , Then the existence of multicollinearity has no effect （ Assume that the entire equation is significant ）. This is because , The main consequence of multicollinearity is that the contribution to a single variable is inaccurate , But the overall effect of all variables can still be estimated more accurately .
If you care about the specific regression coefficient , However, multicollinearity does not affect the significance of the variables concerned , Then you can ignore . Even with variance inflation , These coefficients are still significant ; If there is no multicollinearity , Will only be more significant .
If multicollinearity affects the variable of interest （ The core explanatory variable ） Significance of , You need to increase the sample size , Eliminate variables that cause severe collinearity （ Don't delete it easily , Because there may be endogenous effects ）, Or modify the model settings .

Solve multicollinearity

Step forward and return ： Introduce independent variables into the model one by one , Every time an independent variable is introduced, it should be tested , When significant, the regression model is added . shortcoming ： With the introduction of other independent variables in the future , The original significant independent variable may also become insignificant , But it was not eliminated from the regression equation in time .
Step back ： Contrary to forward stepwise regression , First put all variables into the model , Then try to eliminate one of the independent variables from the model , See if there is a significant variation in the cheapness of the dependent variable explained by the whole model , After that, the independent variable with the least explanatory power is eliminated ; This process is iterative , Until no independent variable meets the elimination condition . shortcoming ： Start by introducing all the variables into the regression equation , This calculation is relatively large .

原网站

版权声明
本文为[weixin_ nine hundred and sixty-one million eight hundred and se]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010308242659.html