当前位置：网站首页>[machine learning] parameter learning and gradient descent

[machine learning] parameter learning and gradient descent

2022-06-25 12:57:00 【Coconut brine Engineer】

gradient descent

For hypothetical functions , We have a way to measure how well it fits the data . Now we need to estimate the parameters in the hypothesis function . This is where gradient descent comes in .
Insert picture description here
Imagine , We plot our hypothetical functions based on their fields θ0 and θ1（ actually , We plot the cost function as a function of parameter estimation ）. We're not drawing x and y In itself , Instead, we plot the parameter range of our hypothetical function and the cost of selecting a specific set of parameters .

We put θ0 stay x Axis and θ1 stay y On the shaft , The cost function is vertical z On the shaft . The point on our chart will be the result of the cost function , Use our assumptions and those specific θ Parameters . The following figure describes such a setting ：
Insert picture description here
When our cost function is at the bottom of the pit in the figure , That is, when its value is the minimum , We will know that we have succeeded . The red arrow shows the smallest point in the figure .

The way we do this is to take the derivative of our cost function （ Tangent of function ）. The slope of the tangent is the derivative of that point , It will provide us with a way forward . We reduce the cost function in the steepest direction of decline . The size of each step is determined by the parameter α decision , Called learning rate .

Look at the picture below , Is the definition of gradient descent algorithm ：
Insert picture description here
We'll go back and do this , Until it converges , We need to update the parameters θj.

:= Indicates assignment , This is an assignment operator
α It's called learning rate , It controls how far we go down the mountain
Differential terms

The subtlety of implementing the gradient descent algorithm is , If you want to update this equation , Need to update at the same time θ0 and θ1, Further describe this process , You can use the picture below ：
Insert picture description here
On the contrary , The following is an incorrect implementation , Because it doesn't update synchronously ：

Gradient descent intuition

When the function has only one parameter , Suppose we have a cost function J, Only parameters θ1, Then the differential term is expressed as follows ：
Insert picture description here
We use the gradient descent method at this point θ1 It's initialized , Here's the picture ：

The following figure shows when the slope is negative ,θ1 increase , When it is positive , value θ1 Reduce , As shown in the figure below ：

With an explanation , We should adjust our parameters α A method to ensure the convergence of gradient descent algorithm in a reasonable time . Failing to converge or taking too long to get the minimum value means that our step size is wrong , Here's the picture ：
Insert picture description here

Smaller amplitude , This is because when we approach the local minimum , When the local derivative is obviously equal to the lowest point 0, So the gradient drop will automatically take a smaller magnitude , This is how the gradient falls , So there's really no need to reduce it any more α.

It also explains why gradient descent can achieve local optimal solution , Even in the learning rate α Fixed case .

The following figure can well explain ： There's a story about θa Cost function of J, I want to minimize it , Then first initialize my algorithm ,
In gradient descent , If I take a step forward , Maybe it will take it to a new point , Go on , The less steep it is , Because the closer to the minimum , The corresponding derivative is getting closer and closer to 0. As I approach the optimal value , Every step , The new derivative will be smaller , So if I have to go one step further , Naturally take a slightly smaller step , Until it is closer to the global optimal value .
Insert picture description here
This is the gradient descent algorithm , You can use it to minimize any cost function J.

The gradient of linear regression decreases

Insert picture description here
The above three formulas , It's what we mentioned between us ： Gradient descent algorithm 、 Linear regression model and linear hypothesis 、 Squared error cost function .

Next , We will use the gradient descent method to minimize the square error cost function . So we need to find out in the gradient descent method , What is the partial derivative term , The following figure shows its calculation process ：
Insert picture description here
These differential terms are actually cost functions J The slope of , We put it back in the gradient descent method , Here are the results ： Gradient descent dedicated to linear regression , Repeat the formula in parentheses until it converges

in fact , It can also be simplified into the following formula ：
Insert picture description here
therefore , It's just the original cost function J The gradient of . This method looks at each example in the entire training set at each step , It is called batch gradient descent , We finally calculate the following formula ： We m Synthesis of training examples , In fact, we are looking for training examples of the whole batch .
Insert picture description here
Please note that , Although gradient descent is generally susceptible to local minima , But the linear regression optimization problem we proposed here has only one global optimal value , There are no other local optima ; So gradient descent always converges （ Suppose the rate of learning α Not too big ） To the global minimum . in fact ,J Is a convex quadratic function . This is an example of gradient descent , Because it runs to minimize quadratic functions .
Insert picture description here
The ellipse shown above is the outline of the quadratic function . It also shows the trajectory of gradient descent , It's in (48,30) Initialization at . In the picture x（ Connected by a straight line ） The process of gradient descent converging to the minimum value is marked θ Continuous values of .