当前位置:网站首页>[machine learning] parameter learning and gradient descent
[machine learning] parameter learning and gradient descent
2022-06-25 12:57:00 【Coconut brine Engineer】
gradient descent
For hypothetical functions , We have a way to measure how well it fits the data . Now we need to estimate the parameters in the hypothesis function . This is where gradient descent comes in .
Imagine , We plot our hypothetical functions based on their fields θ0 and θ1( actually , We plot the cost function as a function of parameter estimation ). We're not drawing x and y In itself , Instead, we plot the parameter range of our hypothetical function and the cost of selecting a specific set of parameters .
We put θ0 stay x Axis and θ1 stay y On the shaft , The cost function is vertical z On the shaft . The point on our chart will be the result of the cost function , Use our assumptions and those specific θ Parameters . The following figure describes such a setting :
When our cost function is at the bottom of the pit in the figure , That is, when its value is the minimum , We will know that we have succeeded . The red arrow shows the smallest point in the figure .
The way we do this is to take the derivative of our cost function ( Tangent of function ). The slope of the tangent is the derivative of that point , It will provide us with a way forward . We reduce the cost function in the steepest direction of decline . The size of each step is determined by the parameter α decision , Called learning rate .
Look at the picture below , Is the definition of gradient descent algorithm :
We'll go back and do this , Until it converges , We need to update the parameters θj.
- := Indicates assignment , This is an assignment operator
- α It's called learning rate , It controls how far we go down the mountain
- Differential terms
The subtlety of implementing the gradient descent algorithm is , If you want to update this equation , Need to update at the same time θ0 and θ1, Further describe this process , You can use the picture below :
On the contrary , The following is an incorrect implementation , Because it doesn't update synchronously :
Gradient descent intuition
When the function has only one parameter , Suppose we have a cost function J, Only parameters θ1, Then the differential term is expressed as follows :
We use the gradient descent method at this point θ1 It's initialized , Here's the picture :
The following figure shows when the slope is negative ,θ1 increase , When it is positive , value θ1 Reduce , As shown in the figure below :
With an explanation , We should adjust our parameters α A method to ensure the convergence of gradient descent algorithm in a reasonable time . Failing to converge or taking too long to get the minimum value means that our step size is wrong , Here's the picture :

Smaller amplitude , This is because when we approach the local minimum , When the local derivative is obviously equal to the lowest point 0, So the gradient drop will automatically take a smaller magnitude , This is how the gradient falls , So there's really no need to reduce it any more α.
It also explains why gradient descent can achieve local optimal solution , Even in the learning rate α Fixed case .
The following figure can well explain : There's a story about θa Cost function of J, I want to minimize it , Then first initialize my algorithm ,
In gradient descent , If I take a step forward , Maybe it will take it to a new point , Go on , The less steep it is , Because the closer to the minimum , The corresponding derivative is getting closer and closer to 0. As I approach the optimal value , Every step , The new derivative will be smaller , So if I have to go one step further , Naturally take a slightly smaller step , Until it is closer to the global optimal value .
This is the gradient descent algorithm , You can use it to minimize any cost function J.
The gradient of linear regression decreases

The above three formulas , It's what we mentioned between us : Gradient descent algorithm 、 Linear regression model and linear hypothesis 、 Squared error cost function .
Next , We will use the gradient descent method to minimize the square error cost function . So we need to find out in the gradient descent method , What is the partial derivative term , The following figure shows its calculation process :
These differential terms are actually cost functions J The slope of , We put it back in the gradient descent method , Here are the results : Gradient descent dedicated to linear regression , Repeat the formula in parentheses until it converges 
in fact , It can also be simplified into the following formula :
therefore , It's just the original cost function J The gradient of . This method looks at each example in the entire training set at each step , It is called batch gradient descent , We finally calculate the following formula : We m Synthesis of training examples , In fact, we are looking for training examples of the whole batch .
Please note that , Although gradient descent is generally susceptible to local minima , But the linear regression optimization problem we proposed here has only one global optimal value , There are no other local optima ; So gradient descent always converges ( Suppose the rate of learning α Not too big ) To the global minimum . in fact ,J Is a convex quadratic function . This is an example of gradient descent , Because it runs to minimize quadratic functions .
The ellipse shown above is the outline of the quadratic function . It also shows the trajectory of gradient descent , It's in (48,30) Initialization at . In the picture x( Connected by a straight line ) The process of gradient descent converging to the minimum value is marked θ Continuous values of .
边栏推荐
- 深圳民太安智能二面_秋招第一份offer
- Write regular isosceles triangle and inverse isosceles triangle with for loop in JS
- 原生js---无限滚动
- CUDA error: unspecified launch failure
- 时间过滤器(el-table)中使用
- Flutter automatically disappears after receiving push
- 【AI助力科研】loss曲线傻瓜式绘制
- 剑指 Offer II 025. 链表中的两数相加
- Parse JSON format data and save it to entity class
- 与生产环境中的 console.log 说再见
猜你喜欢

又是被Visdom搞崩溃的一夜

How to implement a high-performance load balancing architecture?

MySQL adds, modifies, and deletes table fields, field data types, and lengths (with various actual case statements)

剑指 Offer II 029. 排序的循环链表
![[Visio]平行四边形在Word中模糊问题解决](/img/04/8a1de2983d648e67f823b5d973c003.png)
[Visio]平行四边形在Word中模糊问题解决

地理空间搜索:kd树的实现原理

3+1 guarantee: how is the stability of the highly available system refined?

515. Find Largest Value in Each Tree Row

mysql导入导出数据到excel表日期出现问题
![[data visualization] 360 ° teaching you how to comprehensively learn visualization - Part 1](/img/36/167397ce61240036c865dd99463f1b.jpg)
[data visualization] 360 ° teaching you how to comprehensively learn visualization - Part 1
随机推荐
Drawing cubes with Visio
线上服务应急攻关方法论
515. Find Largest Value in Each Tree Row
Negative sample image used in yolov5 training
Resolution of PPT paper drawing
剑指Offer 第 2 天链表(简单)
Concat(), join(), reverse(), sort() method in JS array
美创入选“2022 CCIA中国网络安全竞争力50强”榜单
字节跳动Dev Better技术沙龙来啦!参与活动赢好礼,限时免费报名中!
GPS receiver design (1)
You can't specify target table 'xxx' for update in from clause
聊聊高可用的 11 个关键技巧
画图常用配色
It is extraordinary to make a move, which is very Oracle!
买基金在哪里开户安全?还请赐教
浏览器的5种观察器
MySQL writes user-defined functions and stored procedure syntax (a detailed case is attached, and the problem has been solved: errors are reported when running user-defined functions, and errors are r
AI assisted paper drawing of PPT drawing
JS SMS countdown implementation (simple code)
Jupyter notebook theme font setting and automatic code completion