当前位置：网站首页>Teacher Wu Enda's machine learning course notes 04 multiple linear regression

Teacher Wu Enda's machine learning course notes 04 multiple linear regression

2022-07-29 06:53:00 【three billion seventy-seven million four hundred and ninety-one】

4 Multiple linear regression

4.1 Multidimensional characteristics

There is only one single feature in the linear regression problem studied before , But for other problems, there can also be multiple characteristics , This linear regression problem is called multiple linear regression .

Token Convention

use $n$ Represents the number of characteristic quantities , use $x^{(i)}$ It means the first one $i$ Input eigenvectors of samples , This vector is a $n$ Dimension vector .（ Note that the eigenvector here is not the eigenvector mentioned in the past matrix .）

use $x^{(i)}_j$ It means the first one $i$ Of samples $j$ The value of a characteristic quantity .

Hypothesis function

Hypothetical function of multivariable linear regression $h$ It should be expressed as ： $h_{\theta}(x)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$ .

In order to simplify the representation , Can be introduced $x_0=1$ , be $h_{\theta}(x)=\theta_{0} x_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}+\ldots+\theta_{n} x_{n}$ . At this point, the eigenvector becomes $n + 1$ dimension .

You can use the eigenvector $X$ Express , Parameters are vectors $\Theta$ Express , be $h_{\theta}(x)=\Theta^TX$

summary

4.2 Multivariable gradient descent method

Multivariable gradient descent method parameter update rule

$\theta_{0}:=\theta_{0}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{0}^{(i)}$
$\theta_{1}:=\theta_{1}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{1}^{(i)}$
$\theta_{2}:=\theta_{2}-a \frac{1}{m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right) x_{2}^{(i)}$

summary

The parameter update rule of multivariable gradient descent method is similar to that of univariate gradient descent method .

4.3-4.4 Practical skills of gradient descent method

skill 1- Feature scaling

For multi feature problems , If the features have similar scales （ That is, the values of different features are in a similar range ）, The gradient descent method will converge faster .

The solution is to try to scale all features to -1 To 1 Between （ Or close to -1 To 1 that will do ）. The specific implementation method is to divide each feature by its maximum value . The mean can also be normalized , That is for $i\ge1$ Make $x_{i}=\frac{x_{i}-\mu_{i}}{s_i}$ , among $\mu_{i}$ Is the average of eigenvalues , $s_{i}$ Is the range of eigenvalues （ Maximum minus minimum ）, take $s_{i}$ Set to standard deviation , This can roughly make the feature $x_i$ be in -0.5 To 0.5 Between .

skill 2- Learning rate

For each particular problem , The number of iterations required for the gradient descent method may vary greatly . Draw a graph of the cost function with respect to the number of iterations , Through this image, we can judge whether the gradient descent method has converged . It can also be judged by the method of automatic convergence test , Reduce the value of the cost function after one iteration to the threshold （ for example 0.001） Compare , However, it is usually very difficult to choose an appropriate threshold , Therefore, it is more appropriate to judge whether the gradient descent method converges with images .

At the same time, the graph of the cost function with respect to the number of iterations can also determine whether the algorithm works properly , For example, if the curve is rising or partially rising , Then it is likely to be $\alpha$ Problems caused by too much .

Each iteration of gradient descent algorithm is affected by the learning rate , If the learning rate is too low , Then the number of iterations required to achieve convergence will be very high ; If the learning rate is too high , Each iteration may not reduce the cost function , It may go beyond the local minimum, leading to the failure of convergence .

You can usually consider trying the learning rate first $\alpha=0.001,0.003,0.01,0.03,0.1,1$ , Every time 3 Take a value , Then choose the one that makes the cost function drop quickly $\alpha$ , Then select the largest of these values as the final choice $\alpha$ .

summary

Use the technique of feature scaling , Can make the gradient drop faster , Convergence requires fewer iterations .

By observing the curve of the cost function with respect to the number of iterations , You can choose the right learning rate .

4.5 Characteristic and polynomial regression

feature selection

The length and width of the matrix , It may be possible to replace , Therefore, sometimes defining new features may lead to a better model .

Linear regression does not apply to all data , Sometimes it may be necessary to use a quadratic model or a cubic model to fit the data , But these models can also be transformed into linear regression models , Such as $h_{\theta}(x)=\theta_{0}+\theta_{1} x+\theta_{2} x^{2}+\theta_{3} x^{3}$ You can order $x_{2}=x^{2}, x_{3}=x^{3}$ Turn into $h_{\theta}(X)=\theta_{0}+\theta_{1} x_{1}+\theta_{2} x_{2}^{2}+\theta_{3} x_{3}^{3}$ . Polynomial regression is also closely related to feature selection , And if you choose $x_{2}=x^{2}, x_{3}=x^{3}$ Such features , Feature scaling becomes very important .

summary

For multivariable linear regression problems , Choosing appropriate features will make the learning algorithm more effective .

Polynomial regression through feature selection can be fitted by linear regression .

4.6 Normal equation

Normal equation method

To minimize the cost function , The gradient descent method converges to the global minimum through multiple iterations , The normal equation is a kind of solution $\theta$ The analytic solution of the problem , Solve once $\theta$ The optimal value .

According to the method of calculus , You can find the partial derivative , And by solving $\frac{\partial}{\partial \theta_{j}} J\left(\theta_{j}\right)=0$ A set of linear equations , To solve the minimum value of the cost function .

Let the characteristic matrix of the training set be $X$ （ Contains $x^{(i)}_0=1$ ） , The result of the training set is a vector $y$ , Then the normal equation can be used to solve the problem that minimizes the cost function $\theta=\left(X^{T} X\right)^{-1} X^{T} y$ .

It should be noted that , If the normal equation method is used , Feature scaling is not required .

Comparison between gradient descent method and normal equation method

gradient descent	Normal equation
You need to try and choose the learning rate	There is no need to choose the learning rate
It takes several iterations , Calculation may be slower	One calculation is enough , There is no need to take additional steps to test the convergence
When the number of features is large, it can also be better applied	Need to compute $\left(X^{T} X\right)^{-1}$ , If the number of features $n$ If it's bigger, it's more expensive , Because the computation time complexity of matrix inverse is $𝑂(𝑛^3)$ , Generally speaking, when $n$ Less than 10000 It's still acceptable
It's suitable for all kinds of models	Only for linear models , Not suitable for logistic regression models 、 Classification model and other models

summary

For linear regression model , When the characteristic number $n$ More hours （ $n$ Less than 10000）, The normal equation method is better , When the characteristic number $n$ large , Gradient descent method is better .

For more complex models , The normal equation method is not applicable .

原网站

版权声明
本文为[three billion seventy-seven million four hundred and ninety-one]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290522038129.html