当前位置:网站首页>1.21 study gradient descent and normal equation

1.21 study gradient descent and normal equation

2022-06-26 08:47:00 Thick Cub with thorns

1.20 Multivariate linear regression


Four 、 Multivariate linear regression (Linear Regression with Multiple Variables)

4.1 Multidimensional characteristics

Reference video : 4 - 1 - Multiple Features (8 min).mkv

So far, , We looked at univariate / Regression model of characteristics , Now we add more features to the house price model , For example, the number of rooms, floors, etc , Construct a model with multiple variables , The features in the model are ( x 1 , x 2 , . . . , x n ) \left( {x_{1}},{x_{2}},...,{x_{n}} \right) (x1,x2,...,xn).

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-h8o4EF5L-1642734612973)(…/images/591785837c95bca369021efa14a8bb1c.png)]

After adding more features , We introduce a series of new comments :

n n n Number of representative features

x ( i ) {x^{\left( i \right)}} x(i) On behalf of the i i i Training examples , Is the second in the characteristic matrix i i i That's ok , It's a vector vector).

For example , The image above

x ( 2 ) = [ 1416   3   2   40 ] {x}^{(2)}\text{=}\begin{bmatrix} 1416\\\ 3\\\ 2\\\ 40 \end{bmatrix} x(2)=1416 3 2 40,

x j ( i ) {x}_{j}^{\left( i \right)} xj(i) Represents the... In the characteristic matrix i i i OK, No j j j Features , That is the first. i i i Training example No j j j Features .

As above, x 2 ( 2 ) = 3 , x 3 ( 2 ) = 2 x_{2}^{\left( 2 \right)}=3,x_{3}^{\left( 2 \right)}=2 x2(2)=3,x3(2)=2,

Support the multivariable Hypothesis h h h Expressed as : h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}} hθ(x)=θ0+θ1x1+θ2x2+...+θnxn,

There is... In this formula n + 1 n+1 n+1 Parameters and n n n A variable , In order to simplify the formula , introduce x 0 = 1 x_{0}=1 x0=1, Then the formula is transformed into : h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_{\theta} \left( x \right)={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}} hθ(x)=θ0x0+θ1x1+θ2x2+...+θnxn

At this time, the parameter in the model is n + 1 n+1 n+1 Dimension vector , Any training example is also n + 1 n+1 n+1 Dimension vector , Characteristic matrix X X X The dimension of is m ∗ ( n + 1 ) m*(n+1) m(n+1). So the formula can be reduced to : h θ ( x ) = θ T X h_{\theta} \left( x \right)={\theta^{T}}X hθ(x)=θTX, Superscript T T T Transposition of representative matrix .

4.2 Multivariable gradient descent

Reference video : 4 - 2 - Gradient Descent for Multiple Variables (5 min).mkv

Similar to univariate linear regression , In multivariate linear regression , We also construct a cost function , Then the cost function is the sum of squares of all modeling errors , namely : J ( θ 0 , θ 1 . . . θ n ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( {\theta_{0}},{\theta_{1}}...{\theta_{n}} \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( h_{\theta} \left({x}^{\left( i \right)} \right)-{y}^{\left( i \right)} \right)}^{2}}} J(θ0,θ1...θn)=2m1i=1m(hθ(x(i))y(i))2 ,

among : h θ ( x ) = θ T X = θ 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n h_{\theta}\left( x \right)=\theta^{T}X={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}} hθ(x)=θTX=θ0+θ1x1+θ2x2+...+θnxn ,

Our goal is the same as in the univariate linear regression problem , It's about finding a series of parameters that minimize the cost function .
The batch gradient descent algorithm of multivariable linear regression is :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-oaANvvXQ-1642734612981)(…/images/41797ceb7293b838a3125ba945624cf6.png)]

namely :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-sO2IcogV-1642734612985)(…/images/6bdaff07783e37fcbb1f8765ca06b01b.png)]

Find the derivative and get :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-ut9lLItD-1642734612987)(…/images/dd33179ceccbd8b0b59a5ae698847049.png)]

When n > = 1 n>=1 n>=1 when ,
θ 0 : = θ 0 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) { {\theta }_{0}}:={ {\theta }_{0}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{0}^{(i)} θ0:=θ0am1i=1m(hθ(x(i))y(i))x0(i)

θ 1 : = θ 1 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 1 ( i ) { {\theta }_{1}}:={ {\theta }_{1}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{1}^{(i)} θ1:=θ1am1i=1m(hθ(x(i))y(i))x1(i)

θ 2 : = θ 2 − a 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 2 ( i ) { {\theta }_{2}}:={ {\theta }_{2}}-a\frac{1}{m}\sum\limits_{i=1}^{m}{({ {h}_{\theta }}({ {x}^{(i)}})-{ {y}^{(i)}})}x_{2}^{(i)} θ2:=θ2am1i=1m(hθ(x(i))y(i))x2(i)

We start by randomly selecting a series of parameter values , After calculating all the predictions , Give all the parameters a new value , So cycle until convergence .

Code example :

Computational cost function
J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( {h_{\theta}}\left( {x^{(i)}} \right)-{y^{(i)}} \right)}^{2}}} J(θ)=2m1i=1m(hθ(x(i))y(i))2
among : h θ ( x ) = θ T X = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n {h_{\theta}}\left( x \right)={\theta^{T}}X={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}} hθ(x)=θTX=θ0x0+θ1x1+θ2x2+...+θnxn

Python Code :

def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

4.3 Gradient descent method practice 1- Feature scaling

Reference video : 4 - 3 - Gradient Descent in Practice I - Feature Scaling (9 min).mkv

When we face the problem of multi-dimensional features , We need to ensure that these features have similar scales , This will help the gradient descent algorithm converge faster .

Take housing prices as an example , Suppose we use two features , The size of the house and the number of rooms , The value of the dimension is 0-2000 Square feet , The value of the number of rooms is 0-5, Take two parameters as abscissa and ordinate respectively , Drawing a contour map of the cost function can , You can see that the image looks flat , The gradient descent algorithm needs many iterations to converge .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-o25CZhXq-1642734612990)(…/images/966e5a9b00687678374b8221fdd33475.jpg)]

The solution is to try to scale all features to -1 To 1 Between . Pictured :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-oQOGMRZZ-1642734612994)(…/images/b8167ff0926046e112acf789dba98057.png)]

The simplest way is to make : x n = x n − μ n s n { {x}_{n}}=\frac{ { {x}_{n}}-{ {\mu}_{n}}}{ { {s}_{n}}} xn=snxnμn, among μ n {\mu_{n}} μn It's the average , s n {s_{n}} sn Is the standard deviation .

4.4 Gradient descent method practice 2- Learning rate

Reference video : 4 - 4 - Gradient Descent in Practice II - Learning Rate (9 min).mkv

The number of iterations required for the convergence of the gradient descent algorithm varies with the model , We can't predict in advance , We can plot the number of iterations and the cost function to see when the algorithm tends to converge .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-x2MTg1op-1642734612999)(…/images/cd4e3df45c34f6a8e2bb7cd3a2849e6c.jpg)]

There are also some ways to automatically test for convergence , For example, compare the change value of the cost function with a certain threshold ( for example 0.001) Compare , But it's usually better to look at the chart above .

Each iteration of gradient descent algorithm is affected by the learning rate , If learning rate a a a Too small , Then the number of iterations required to achieve convergence will be very high ; If learning rate a a a Too big , Each iteration may not reduce the cost function , It may go beyond the local minimum, leading to the failure of convergence .

You can usually consider trying some learning rates :

α = 0.01 , 0.03 , 0.1 , 0.3 , 1 , 3 , 10 \alpha=0.01,0.03,0.1,0.3,1,3,10 α=0.01,0.03,0.1,0.3,1,3,10

4.5 Characteristic and polynomial regression

Reference video : 4 - 5 - Features and Polynomial Regression (8 min).mkv

Such as house price forecast ,

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-JTX1oYKH-1642734613003)(…/images/8ffaa10ae1138f1873bc65e1e3657bd4.png)]

h θ ( x ) = θ 0 + θ 1 × f r o n t a g e + θ 2 × d e p t h h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}\times{frontage}+{\theta_{2}}\times{depth} hθ(x)=θ0+θ1×frontage+θ2×depth

x 1 = f r o n t a g e {x_{1}}=frontage x1=frontage( Street width ), x 2 = d e p t h {x_{2}}=depth x2=depth( Longitudinal depth ), x = f r o n t a g e ∗ d e p t h = a r e a x=frontage*depth=area x=frontagedepth=area( area ), be : h θ ( x ) = θ 0 + θ 1 x {h_{\theta}}\left( x \right)={\theta_{0}}+{\theta_{1}}x hθ(x)=θ0+θ1x.
Linear regression does not apply to all data , Sometimes we need curves to fit our data , For example, a quadratic model : h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 2 h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}^2} hθ(x)=θ0+θ1x1+θ2x22
Or a cubic model : h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 2 + θ 3 x 3 3 h_{\theta}\left( x \right)={\theta_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}^2}+{\theta_{3}}{x_{3}^3} hθ(x)=θ0+θ1x1+θ2x22+θ3x33

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-zvH3Fkfl-1642734613005)(…/images/3a47e15258012b06b34d4e05fb3af2cf.jpg)]

Usually we need to observe the data first and then decide what model we are going to try . in addition , We can make :

x 2 = x 2 2 , x 3 = x 3 3 { {x}_{2}}=x_{2}^{2},{ {x}_{3}}=x_{3}^{3} x2=x22,x3=x33, Thus, the model is transformed into a linear regression model .

According to the function graphic properties , We can also make :

h θ ( x ) = θ 0 + θ 1 ( s i z e ) + θ 2 ( s i z e ) 2 { { {h}}_{\theta}}(x)={ {\theta }_{0}}\text{+}{ {\theta }_{1}}(size)+{ {\theta}_{2}}{ {(size)}^{2}} hθ(x)=θ0+θ1(size)+θ2(size)2

perhaps :

h θ ( x ) = θ 0 + θ 1 ( s i z e ) + θ 2 s i z e { { {h}}_{\theta}}(x)={ {\theta }_{0}}\text{+}{ {\theta }_{1}}(size)+{ {\theta }_{2}}\sqrt{size} hθ(x)=θ0+θ1(size)+θ2size

notes : If we use polynomial regression model , Before running the gradient descent algorithm , Feature scaling is very necessary .

4.6 Normal equation

Reference video : 4 - 6 - Normal Equation (16 min).mkv

up to now , We are all using gradient descent algorithm , But for some linear regression problems , The normal equation method is a better solution . Such as :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-6nZckdrd-1642734613009)(…/images/a47ec797d8a9c331e02ed90bca48a24b.png)]

The normal equation is to find the parameters that minimize the cost function by solving the following equation : ∂ ∂ θ j J ( θ j ) = 0 \frac{\partial}{\partial{\theta_{j}}}J\left( {\theta_{j}} \right)=0 θjJ(θj)=0 .
Suppose that the characteristic matrix of our training set is X X X( Contains x 0 = 1 { {x}_{0}}=1 x0=1) And the result of our training set is vector y y y, Then use the normal equation to solve the vector θ = ( X T X ) − 1 X T y \theta ={ {\left( {X^T}X \right)}^{-1}}{X^{T}}y θ=(XTX)1XTy .
Superscript T Transposition of representative matrix , Superscript -1 Represents the inverse of a matrix . Let's set the matrix A = X T X A={X^{T}}X A=XTX, be : ( X T X ) − 1 = A − 1 { {\left( {X^T}X \right)}^{-1}}={A^{-1}} (XTX)1=A1
The following shows the data as an example :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-bZOfvbt9-1642734613011)(…/images/261a11d6bce6690121f26ee369b9e9d1.png)]

namely :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-o6czGrND-1642734613014)(…/images/c8eedc42ed9feb21fac64e4de8d39a06.png)]

Solving parameters by normal equation method :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Ase52Wko-1642734613017)(…/images/b62d24a1f709496a6d7c65f87464e911.jpg)]

notes : For those irreversible matrices ( Usually because features are not independent , For example, it includes both dimensions in feet and dimensions in meters , It is also possible that the number of features is greater than the number of training sets ), The normal equation method cannot be used .

Gradient descent versus normal equation :

gradient descent Normal equation
We need to choose the learning rate α \alpha α Unwanted
It takes several iterations One operation yields
When the number of features n n n It can also be better applied when it is large Need to compute ( X T X ) − 1 { {\left( { {X}^{T}}X \right)}^{-1}} (XTX)1 If the number of features n If it's bigger, it's more expensive , Because the computation time complexity of matrix inverse is O ( n 3 ) O\left( { {n}^{3}} \right) O(n3), Generally speaking, when n n n Less than 10000 It's still acceptable
It's suitable for all kinds of models Only for linear models , It is not suitable for other models such as logistic regression model

To sum up , As long as the number of characteristic variables is not large , The standard equation is a good calculation parameter $\theta $ Alternative methods . To be specific , As long as the number of characteristic variables is less than 10000 , I usually use the standard equation method , Instead of gradient descent .

As we are going to talk about more and more complex learning algorithms , for example , When we talk about classification algorithms , Image logic regression algorithm , We'll see , Actually, for those algorithms , Standard equation method cannot be used . For those more complex learning algorithms , We will still have to use the gradient descent method . therefore , Gradient descent method is a very useful algorithm , It can be used in linear regression problems with a large number of characteristic variables . Or we'll be in the class later , Some other algorithms will be mentioned , Because the standard equation method is not suitable or can not be used on them . But for this particular linear regression model , The standard equation method is a faster alternative to the gradient descent method . therefore , According to specific problems , And the number of your characteristic variables , Both algorithms are worth learning .

Of the normal equation python Realization :

import numpy as np
    
 def normalEqn(X, y):
    
   theta = np.linalg.inv(X.[email protected])@X.[email protected] #[email protected] Equivalent to X.T.dot(X)
    
   return theta

4.7 Normal equations and irreversibility ( Optional )

Reference video : 4 - 7 - Normal Equation Noninvertibility (Optional) (6 min).mkv

Talk about normal equations in this video ( normal equation ), And their irreversibility .
Because this is a more in-depth concept , And people always ask me questions about this , therefore , I want to discuss it here , Because the concept is more in-depth , So take it easy with this optional material , Maybe you will explore further , And will feel that understanding will be very useful . But even if you don't understand the relationship between normal equations and linear regression , It doesn't matter .

The questions we want to talk about are as follows : θ = ( X T X ) − 1 X T y \theta ={ {\left( {X^{T}}X \right)}^{-1}}{X^{T}}y θ=(XTX)1XTy

remarks : At the end of this section, I write down the derivation process .

add to the content :

θ = ( X T X ) − 1 X T y \theta ={ {\left( {X^{T}}X \right)}^{-1}}{X^{T}}y θ=(XTX)1XTy The derivation process of :

J ( θ ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{ { {\left( {h_{\theta}}\left( {x^{(i)}} \right)-{y^{(i)}} \right)}^{2}}} J(θ)=2m1i=1m(hθ(x(i))y(i))2
among : h θ ( x ) = θ T X = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + . . . + θ n x n {h_{\theta}}\left( x \right)={\theta^{T}}X={\theta_{0}}{x_{0}}+{\theta_{1}}{x_{1}}+{\theta_{2}}{x_{2}}+...+{\theta_{n}}{x_{n}} hθ(x)=θTX=θ0x0+θ1x1+θ2x2+...+θnxn

Convert vector expression to matrix expression , Then there are J ( θ ) = 1 2 ( X θ − y ) 2 J(\theta )=\frac{1}{2}{ {\left( X\theta -y\right)}^{2}} J(θ)=21(Xθy)2 , among X X X by m m m That's ok n n n Columns of the matrix ( m m m Is the number of samples , n n n For the number of features ), θ \theta θ by n n n That's ok 1 Columns of the matrix , y y y by m m m That's ok 1 Columns of the matrix , Yes J ( θ ) J(\theta ) J(θ) Make the following transformation

J ( θ ) = 1 2 ( X θ − y ) T ( X θ − y ) J(\theta )=\frac{1}{2}{ {\left( X\theta -y\right)}^{T}}\left( X\theta -y \right) J(θ)=21(Xθy)T(Xθy)

= 1 2 ( θ T X T − y T ) ( X θ − y ) =\frac{1}{2}\left( { {\theta }^{T}}{ {X}^{T}}-{ {y}^{T}} \right)\left(X\theta -y \right) =21(θTXTyT)(Xθy)

= 1 2 ( θ T X T X θ − θ T X T y − y T X θ − y T y ) =\frac{1}{2}\left( { {\theta }^{T}}{ {X}^{T}}X\theta -{ {\theta}^{T}}{ {X}^{T}}y-{ {y}^{T}}X\theta -{ {y}^{T}}y \right) =21(θTXTXθθTXTyyTXθyTy)

And then to J ( θ ) J(\theta ) J(θ) Partial Guide , We need to use the following matrix derivation rules :

d A B d B = A T \frac{dAB}{dB}={ {A}^{T}} dBdAB=AT

d X T A X d X = 2 A X \frac{d{ {X}^{T}}AX}{dX}=2AX dXdXTAX=2AX

So there is :

∂ J ( θ ) ∂ θ = 1 2 ( 2 X T X θ − X T y − ( y T X ) T − 0 ) \frac{\partial J\left( \theta \right)}{\partial \theta }=\frac{1}{2}\left(2{ {X}^{T}}X\theta -{ {X}^{T}}y -{}({ {y}^{T}}X )^{T}-0 \right) θJ(θ)=21(2XTXθXTy(yTX)T0)

= 1 2 ( 2 X T X θ − X T y − X T y − 0 ) =\frac{1}{2}\left(2{ {X}^{T}}X\theta -{ {X}^{T}}y -{ {X}^{T}}y -0 \right) =21(2XTXθXTyXTy0)

= X T X θ − X T y ={ {X}^{T}}X\theta -{ {X}^{T}}y =XTXθXTy

Make ∂ J ( θ ) ∂ θ = 0 \frac{\partial J\left( \theta \right)}{\partial \theta }=0 θJ(θ)=0,

Then there are θ = ( X T X ) − 1 X T y \theta ={ {\left( {X^{T}}X \right)}^{-1}}{X^{T}}y θ=(XTX)1XTy

原网站

版权声明
本文为[Thick Cub with thorns]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202170554155201.html