当前位置:网站首页>Teacher Wu Enda's machine learning course notes 02 univariate linear regression
Teacher Wu Enda's machine learning course notes 02 univariate linear regression
2022-07-29 06:53:00 【three billion seventy-seven million four hundred and ninety-one】
2 Univariate linear regression
2.1 Model describes
Next, we will continue with the following content based on the example of predicting house prices mentioned in Chapter 1 .
Token Convention
In supervised learning, there is a data set , This data set is called the training set . Throughout the course m m m To represent the number of training samples .
use x x x Represents the input variable ( features ), use y y y Represents the output variable ( Target variables to be predicted ).
use ( x , y ) (x,y) (x,y) Represents a training sample , use ( x ( i ) , y ( i ) ) (x^{(i)},y^{(i)}) (x(i),y(i)) It means the first one i i i Training samples , Note the superscript i i i Not an index , It's an index of the dataset .
Supervise the working process of the learning algorithm
Provide training sets to learning algorithms , The task of learning algorithm is to output a hypothetical function , Usually use h h h Express .
Suppose that the function is to predict the output corresponding to the input .
Hypothesis function
Designing learning algorithms first requires deciding how to represent hypothetical functions h h h.
One possible expression is , h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_{0}+\theta_{1} x hθ(x)=θ0+θ1x. This means that the next hypothetical function will predict y y y About x x x The linear function of .
Because it contains only one feature / The input variable , And it is linear , Therefore, such regression problems are called univariate linear regression problems or univariate linear regression problems .
summary
The working process of supervising learning is , By providing a training set to the learning algorithm , The learning algorithm outputs a hypothetical function h h h.
Here we will first assume the function h h h As a linear function .
2.2 Cost function
The cost function is to find the best line to fit the data .
Model parameters
For hypothetical functions h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_{0}+\theta_{1} x hθ(x)=θ0+θ1x, θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1 Is the model parameter .
Squared error function
Through the cost function , You can find a line that fits the data better , Thus, the model parameters that can minimize the modeling error can be obtained .
Usually we can use the square error function J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2} J(θ0,θ1)=2m1∑i=1m(hθ(x(i))−y(i))2 As a cost function , For most problems , Especially the problem of return , The square error cost function is a reasonable choice , Of course, there are other types of cost functions .
Our optimization goal is to minimize the cost function .
summary
For hypothetical functions h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_{0}+\theta_{1} x hθ(x)=θ0+θ1x, θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1 Is the model parameter .
Usually, the square error function can be used as the cost function , Thus, the model parameters that can minimize the modeling error can be obtained .
2.3-2.4 An intuitive understanding of the cost function
Hypothetical function and cost function
| Hypothesis function h θ i ( x ) h_{\theta_{i}}(x) hθi(x) | Cost function J ( θ i ) J(\theta_{i}) J(θi) | |
|---|---|---|
| The independent variables | For a given θ \theta θ, h h h It's about x x x Function of | J J J It's about θ \theta θ Function of |
| Function image | ![]() | ![]() |
2.5-2.6 Gradient descent method
Gradient descent method
Gradient descent method is an algorithm for finding the minimum value of a function , This method can be used to minimize the cost function .
First, a combination of parameters is randomly selected to calculate the cost function , Then find the next parameter combination that can reduce the value of the cost function the most . Keep doing this until you reach a local minimum (local minimum).
Because I didn't try all the parameter combinations , Therefore, it is uncertain whether the local minimum found is the global minimum (global minimum).
Choose different combinations of initial parameters , Different local minima may be found .
The mathematical principle of gradient descent method
Repeat until convergence {
θ j : = θ j − α ∂ ∂ θ j J ( θ 0 , θ 1 ) (for j = 0 and j = 1 ) \theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) \quad \text { (for } j=0 \text { and } j=1 \text { ) } θj:=θj−α∂θj∂J(θ0,θ1) (for j=0 and j=1 )
}
It should be noted that : = := := Represents the assignment operator , and = = = Indicates that the left and right values are equal .
α \alpha α It's the learning rate , It's used to control the step size when the gradient drops .
What needs to be explained is the parameter θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1 Need to synchronize updates , That is, the right part of the calculation formula should be stored in the temporary variable , Then use temporary variables to update at the same time θ 0 \theta_{0} θ0 and θ 1 \theta_{1} θ1.
Understanding of gradient descent method
If α \alpha α Too small , It takes many steps to reach the local lowest point ; If α \alpha α Too big , May not converge .
also , With the operation of gradient descent method , The derivative value will become smaller and smaller , The range of movement will become smaller and smaller , Until it converges to the local minimum , Local best , The derivative value is 0, Therefore, there is no need to further reduce α \alpha α Value .
summary
Gradient descent method is an algorithm for finding the minimum value of a function , This method can be used to minimize the cost function .
α \alpha α It's the learning rate , It's used to control the step size when the gradient drops .
With the operation of gradient descent method , The derivative value will become smaller and smaller , The range of movement will become smaller and smaller , Until it converges to the local minimum .
2.7 The gradient of linear regression decreases
The local optimal solution of the linear regression cost function is the global optimal solution
Initializing with different values may lead to convergence to different local optimal solutions , The local optimal solution converged to may not be the global optimal solution .

But as shown above , The cost function of linear regression is convex , The local optimal solution is the global optimal solution .
Batch gradient descent method (Batch Gradient Descent)
Gradient descent method is also called batch gradient descent method (Batch Gradient Descent), It means that in every step of the gradient descent , All the training samples are used . When calculating partial derivatives , The calculation is the sum of the individual gradient descent of all training samples .
summary
Using the gradient descent method for the cost function of linear regression can converge to the global optimal solution of the function .
Gradient descent method is also called batch gradient descent method , In every step of the descent , The calculation is the sum of the individual gradient descent of all training samples .
边栏推荐
- 数据库使用psql及jdbc进行远程连接,不定时自动断开的解决办法
- 吴恩达老师机器学习课程笔记 02 单变量线性回归
- C语言数据类型
- The difference between pairs and ipairs
- 模拟卷Leetcode【普通】093. 复原 IP 地址
- 【备忘】关于ssh为什么会失败的原因总结?下次记得来找。
- 会话推荐中的价格偏好和兴趣偏好共同建模-论文泛读
- Hongke shares | how to test and verify complex FPGA designs (1) -- entity or block oriented simulation
- NLP-分词
- 吴恩达老师机器学习课程笔记 01 引言
猜你喜欢

Actual combat! Talk about how to solve the deep paging problem of MySQL

Etcd principle

CNN-卷积神经网络

10 frequently asked JVM questions in interviews

Shallow reading of shared lock source code of abstractqueuedsynchronizer (AQS)

JMM 内存模型概念

LDAP brief description and unified authentication description

【冷冻电镜】RELION4.0之subtomogram对位功能源码分析(自用)

实战!聊聊如何解决MySQL深分页问题

IDEA中实现Mapper接口到映射文件xml的跳转
随机推荐
Annotation
Shallow reading of condition object source code
【笔记】The art of research(明白问题的重要性)
Apisik health check test
Relationship between subnet number, host number and subnet mask
【技能积累】presentation实用技巧积累,常用句式
【flask入门系列】Flask-SQLAlchemy的安装与配置
王树尧老师运筹学课程笔记 07 线性规划与单纯形法(标准型、基、基解、基可行解、可行基)
【论文阅读 | cryoET】Gum-Net:快速准确的3D Subtomo图像对齐和平均的无监督几何匹配
Teacher wangshuyao wrote the notes of operations research course 00 in the front
API for using the new date class of instant
联邦学习后门攻击总结(2019-2022)
王树尧老师运筹学课程笔记 06 线性规划与单纯形法(几何意义)
Pytorch多GPU条件下DDP集群分布训练实现(简述-从无到有)
SQL developer graphical window to create database (tablespace and user)
吴恩达老师机器学习课程笔记 00 写在前面
leetcode-1331:数组序号转换
【冷冻电镜】RELION4.0 pipeline命令总结(自用)
CDM—码分复用(简单易懂)
Hongke shares | testing and verifying complex FPGA design (2) -- how to perform global oriented simulation in IP core

