当前位置：网站首页>1.21 study gradient descent and normal equation

1.21 study gradient descent and normal equation

2022-06-26 08:47:00 【Thick Cub with thorns】

1.20 Multivariate linear regression

List of articles
1.20 Multivariate linear regression
@[toc] Four 、 Multivariate linear regression (Linear Regression with Multiple Variables)
4.1 Multidimensional characteristics
4.2 Multivariable gradient descent
4.3 Gradient descent method practice 1- Feature scaling
4.4 Gradient descent method practice 2- Learning rate
4.5 Characteristic and polynomial regression
4.6 Normal equation
4.7 Normal equations and irreversibility （ Optional ）

Four 、 Multivariate linear regression (Linear Regression with Multiple Variables)

4.1 Multidimensional characteristics

Reference video : 4 - 1 - Multiple Features (8 min).mkv

So far, , We looked at univariate / Regression model of characteristics , Now we add more features to the house price model , For example, the number of rooms, floors, etc , Construct a model with multiple variables , The features in the model are $\left( {x_{1}},{x_{2}},...,{x_{n}} \right)$ .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-h8o4EF5L-1642734612973)(…/images/591785837c95bca369021efa14a8bb1c.png)]

After adding more features , We introduce a series of new comments ：

$n$ Number of representative features

${x^{\left( i \right)}}$ On behalf of the $i$ Training examples , Is the second in the characteristic matrix $i$ That's ok , It's a vector （vector）.

For example , The image above

${x}^{(2)}\text{=}\begin{bmatrix} 1416\\\ 3\\\ 2\\\ 40 \end{bmatrix}$ ,

${x}_{j}^{\left( i \right)}$ Represents the... In the characteristic matrix $i$ OK, No $j$ Features , That is the first. $i$ Training example No $j$ Features .

As above, $x_{2}^{\left( 2 \right)}=3,x_{3}^{\left( 2 \right)}=2$ ,

Support the multivariable Hypothesis $h$ Expressed as ： $h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}$ ,

There is... In this formula $n + 1$ Parameters and $n$ A variable , In order to simplify the formula , introduce $x_{0}=1$ , Then the formula is transformed into ： $h_{\theta} \left( x \right)={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}$

At this time, the parameter in the model is $n + 1$ Dimension vector , Any training example is also $n + 1$ Dimension vector , Characteristic matrix $X$ The dimension of is $m * (n + 1)$ . So the formula can be reduced to ： $h_{\theta} \left( x \right)={\theta^{T}}X$ , Superscript $T$ Transposition of representative matrix .

4.2 Multivariable gradient descent

Reference video : 4 - 2 - Gradient Descent for Multiple Variables (5 min).mkv

Similar to univariate linear regression , In multivariate linear regression , We also construct a cost function , Then the cost function is the sum of squares of all modeling errors , namely ： $J\left( {\theta_{0}},{\theta_{1}}...{\theta_{n}} \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( h_{\theta} \left({x}^{\left( i \right)} \right)-{y}^{\left( i \right)} \right)}^{2}}}$ ,

among ： $h_{\theta}\left( x \right)=\theta^{T}X={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}$ ,

Our goal is the same as in the univariate linear regression problem , It's about finding a series of parameters that minimize the cost function .
The batch gradient descent algorithm of multivariable linear regression is ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-oaANvvXQ-1642734612981)(…/images/41797ceb7293b838a3125ba945624cf6.png)]

namely ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-sO2IcogV-1642734612985)(…/images/6bdaff07783e37fcbb1f8765ca06b01b.png)]

Find the derivative and get ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ut9lLItD-1642734612987)(…/images/dd33179ceccbd8b0b59a5ae698847049.png)]

When $n > = 1$ when ,
${\theta }_{0}}:={ {\theta }_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{0}^{(i)}$

${\theta }_{1}}:={ {\theta }_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{1}^{(i)}$

${\theta }_{2}}:={ {\theta }_{2}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{2}^{(i)}$

We start by randomly selecting a series of parameter values , After calculating all the predictions , Give all the parameters a new value , So cycle until convergence .

Code example ：

Computational cost function
$J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( {h_{\theta}}\left( {x^{(i)}} \right)-{y^{(i)}} \right)}^{2}}}$
among ： ${h_{\theta}}\left( x \right)={\theta^{T}}X={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}$

Python Code ：

def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

4.3 Gradient descent method practice 1- Feature scaling

Reference video : 4 - 3 - Gradient Descent in Practice I - Feature Scaling (9 min).mkv

When we face the problem of multi-dimensional features , We need to ensure that these features have similar scales , This will help the gradient descent algorithm converge faster .

Take housing prices as an example , Suppose we use two features , The size of the house and the number of rooms , The value of the dimension is 0-2000 Square feet , The value of the number of rooms is 0-5, Take two parameters as abscissa and ordinate respectively , Drawing a contour map of the cost function can , You can see that the image looks flat , The gradient descent algorithm needs many iterations to converge .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-o25CZhXq-1642734612990)(…/images/966e5a9b00687678374b8221fdd33475.jpg)]

The solution is to try to scale all features to -1 To 1 Between . Pictured ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-oQOGMRZZ-1642734612994)(…/images/b8167ff0926046e112acf789dba98057.png)]

The simplest way is to make ： ${x}_{n}}=\frac{ { {x}_{n}}-{ {\mu}_{n}}}{ { {s}_{n}}}$ , among ${\mu_{n}}$ It's the average , ${s_{n}}$ Is the standard deviation .

4.4 Gradient descent method practice 2- Learning rate

Reference video : 4 - 4 - Gradient Descent in Practice II - Learning Rate (9 min).mkv

The number of iterations required for the convergence of the gradient descent algorithm varies with the model , We can't predict in advance , We can plot the number of iterations and the cost function to see when the algorithm tends to converge .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-x2MTg1op-1642734612999)(…/images/cd4e3df45c34f6a8e2bb7cd3a2849e6c.jpg)]

There are also some ways to automatically test for convergence , For example, compare the change value of the cost function with a certain threshold （ for example 0.001） Compare , But it's usually better to look at the chart above .

Each iteration of gradient descent algorithm is affected by the learning rate , If learning rate $a$ Too small , Then the number of iterations required to achieve convergence will be very high ; If learning rate $a$ Too big , Each iteration may not reduce the cost function , It may go beyond the local minimum, leading to the failure of convergence .

You can usually consider trying some learning rates ：

$\alpha=0.01,0.03,0.1,0.3,1,3,10$

4.5 Characteristic and polynomial regression

Reference video : 4 - 5 - Features and Polynomial Regression (8 min).mkv

Such as house price forecast ,

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-JTX1oYKH-1642734613003)(…/images/8ffaa10ae1138f1873bc65e1e3657bd4.png)]

$h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}\times{frontage}+{\theta_{2}}\times{depth}$

${x_{1}}=frontage$ （ Street width ）, ${x_{2}}=depth$ （ Longitudinal depth ）, $x = f r o n t a g e * d e p t h = a r e a$ （ area ）, be ： ${h_{\theta}}\left( x \right)={\theta_{0}}+{\theta_{1}}x$ .
Linear regression does not apply to all data , Sometimes we need curves to fit our data , For example, a quadratic model ： $h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}^2}$
Or a cubic model ： $h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}^2}+{\theta_{3}}{x_{3}^3}$

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zvH3Fkfl-1642734613005)(…/images/3a47e15258012b06b34d4e05fb3af2cf.jpg)]

Usually we need to observe the data first and then decide what model we are going to try . in addition , We can make ：

${ {x}_{2}}=x_{2}^{2},{ {x}_{3}}=x_{3}^{3}$ , Thus, the model is transformed into a linear regression model .

According to the function graphic properties , We can also make ：

${h}}_{\theta}}(x)={ {\theta }_{0}}\text{+}{ {\theta }_{1}}(size)+{ {\theta}_{2}}{ {(size)}^{2}}$

perhaps :

${h}}_{\theta}}(x)={ {\theta }_{0}}\text{+}{ {\theta }_{1}}(size)+{ {\theta }_{2}}\sqrt{size}$

notes ： If we use polynomial regression model , Before running the gradient descent algorithm , Feature scaling is very necessary .

4.6 Normal equation

Reference video : 4 - 6 - Normal Equation (16 min).mkv

up to now , We are all using gradient descent algorithm , But for some linear regression problems , The normal equation method is a better solution . Such as ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-6nZckdrd-1642734613009)(…/images/a47ec797d8a9c331e02ed90bca48a24b.png)]

The normal equation is to find the parameters that minimize the cost function by solving the following equation ： $\frac{\partial}{\partial{\theta_{j}}}J\left( {\theta_{j}} \right)=0$ .
Suppose that the characteristic matrix of our training set is $X$ （ Contains ${ {x}_{0}}=1$ ） And the result of our training set is vector $y$ , Then use the normal equation to solve the vector $\theta ={ {\left( {X^T}X \right)}^{-1}}{X^{T}}y$ .
Superscript T Transposition of representative matrix , Superscript -1 Represents the inverse of a matrix . Let's set the matrix $A={X^{T}}X$ , be ： ${\left( {X^T}X \right)}^{-1}}={A^{-1}}$
The following shows the data as an example ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bZOfvbt9-1642734613011)(…/images/261a11d6bce6690121f26ee369b9e9d1.png)]

namely ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-o6czGrND-1642734613014)(…/images/c8eedc42ed9feb21fac64e4de8d39a06.png)]

Solving parameters by normal equation method ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Ase52Wko-1642734613017)(…/images/b62d24a1f709496a6d7c65f87464e911.jpg)]

notes ： For those irreversible matrices （ Usually because features are not independent , For example, it includes both dimensions in feet and dimensions in meters , It is also possible that the number of features is greater than the number of training sets ）, The normal equation method cannot be used .

Gradient descent versus normal equation ：

gradient descent	Normal equation
We need to choose the learning rate $\alpha$	Unwanted
It takes several iterations	One operation yields
When the number of features $n$ It can also be better applied when it is large	Need to compute ${\left( { {X}^{T}}X \right)}^{-1}}$ If the number of features n If it's bigger, it's more expensive , Because the computation time complexity of matrix inverse is $O\left( { {n}^{3}} \right)$ , Generally speaking, when $n$ Less than 10000 It's still acceptable
It's suitable for all kinds of models	Only for linear models , It is not suitable for other models such as logistic regression model

To sum up , As long as the number of characteristic variables is not large , The standard equation is a good calculation parameter $\theta $ Alternative methods . To be specific , As long as the number of characteristic variables is less than 10000 , I usually use the standard equation method , Instead of gradient descent .

As we are going to talk about more and more complex learning algorithms , for example , When we talk about classification algorithms , Image logic regression algorithm , We'll see , Actually, for those algorithms , Standard equation method cannot be used . For those more complex learning algorithms , We will still have to use the gradient descent method . therefore , Gradient descent method is a very useful algorithm , It can be used in linear regression problems with a large number of characteristic variables . Or we'll be in the class later , Some other algorithms will be mentioned , Because the standard equation method is not suitable or can not be used on them . But for this particular linear regression model , The standard equation method is a faster alternative to the gradient descent method . therefore , According to specific problems , And the number of your characteristic variables , Both algorithms are worth learning .

Of the normal equation python Realization ：

import numpy as np
    
 def normalEqn(X, y):
    
   theta = np.linalg.inv(X.[email protected])@X.[email protected] #[email protected] Equivalent to X.T.dot(X)
    
   return theta

4.7 Normal equations and irreversibility （ Optional ）

Reference video : 4 - 7 - Normal Equation Noninvertibility (Optional) (6 min).mkv

Talk about normal equations in this video ( normal equation ), And their irreversibility .
Because this is a more in-depth concept , And people always ask me questions about this , therefore , I want to discuss it here , Because the concept is more in-depth , So take it easy with this optional material , Maybe you will explore further , And will feel that understanding will be very useful . But even if you don't understand the relationship between normal equations and linear regression , It doesn't matter .

The questions we want to talk about are as follows ： $\theta ={ {\left( {X^{T}}X \right)}^{-1}}{X^{T}}y$

remarks ： At the end of this section, I write down the derivation process .

add to the content ：

$\theta ={ {\left( {X^{T}}X \right)}^{-1}}{X^{T}}y$ The derivation process of ：

$J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( {h_{\theta}}\left( {x^{(i)}} \right)-{y^{(i)}} \right)}^{2}}}$
among ： ${h_{\theta}}\left( x \right)={\theta^{T}}X={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}}$

Convert vector expression to matrix expression , Then there are $J(\theta )=\frac{1}{2}{ {\left( X\theta -y\right)}^{2}}$ , among $X$ by $m$ That's ok $n$ Columns of the matrix （ $m$ Is the number of samples , $n$ For the number of features ）, $\theta$ by $n$ That's ok 1 Columns of the matrix , $y$ by $m$ That's ok 1 Columns of the matrix , Yes $J(\theta )$ Make the following transformation