当前位置：网站首页>Teacher Wu Enda's machine learning course notes 02 univariate linear regression

Teacher Wu Enda's machine learning course notes 02 univariate linear regression

2022-07-29 06:53:00 【three billion seventy-seven million four hundred and ninety-one】

2 Univariate linear regression

2.1 Model describes

Next, we will continue with the following content based on the example of predicting house prices mentioned in Chapter 1 .

Token Convention

In supervised learning, there is a data set , This data set is called the training set . Throughout the course $m$ To represent the number of training samples .

use $x$ Represents the input variable （ features ）, use $y$ Represents the output variable （ Target variables to be predicted ）.

use $(x, y)$ Represents a training sample , use $x^{(i)},y^{(i)})$ It means the first one $i$ Training samples , Note the superscript $i$ Not an index , It's an index of the dataset .

Supervise the working process of the learning algorithm

Provide training sets to learning algorithms , The task of learning algorithm is to output a hypothetical function , Usually use $h$ Express .

Suppose that the function is to predict the output corresponding to the input .

Hypothesis function

Designing learning algorithms first requires deciding how to represent hypothetical functions $h$ .

One possible expression is , $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ . This means that the next hypothetical function will predict $y$ About $x$ The linear function of .

Because it contains only one feature / The input variable , And it is linear , Therefore, such regression problems are called univariate linear regression problems or univariate linear regression problems .

summary

The working process of supervising learning is , By providing a training set to the learning algorithm , The learning algorithm outputs a hypothetical function $h$ .

Here we will first assume the function $h$ As a linear function .

2.2 Cost function

The cost function is to find the best line to fit the data .

Model parameters

For hypothetical functions $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ , $\theta_{0}$ and $\theta_{1}$ Is the model parameter .

Squared error function

Through the cost function , You can find a line that fits the data better , Thus, the model parameters that can minimize the modeling error can be obtained .

Usually we can use the square error function $J\left(\theta_{0}, \theta_{1}\right)=\frac{1}{2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2}$ As a cost function , For most problems , Especially the problem of return , The square error cost function is a reasonable choice , Of course, there are other types of cost functions .

Our optimization goal is to minimize the cost function .

summary

For hypothetical functions $h_{\theta}(x)=\theta_{0}+\theta_{1} x$ , $\theta_{0}$ and $\theta_{1}$ Is the model parameter .

Usually, the square error function can be used as the cost function , Thus, the model parameters that can minimize the modeling error can be obtained .

2.3-2.4 An intuitive understanding of the cost function

Hypothetical function and cost function

	Hypothesis function $h_{\theta_{i}}(x)$	Cost function $J(\theta_{i})$
The independent variables	For a given $\theta$ , $h$ It's about $x$ Function of	$J$ It's about $\theta$ Function of
Function image

2.5-2.6 Gradient descent method

Gradient descent method

Gradient descent method is an algorithm for finding the minimum value of a function , This method can be used to minimize the cost function .

First, a combination of parameters is randomly selected to calculate the cost function , Then find the next parameter combination that can reduce the value of the cost function the most . Keep doing this until you reach a local minimum （local minimum）.

Because I didn't try all the parameter combinations , Therefore, it is uncertain whether the local minimum found is the global minimum （global minimum）.

Choose different combinations of initial parameters , Different local minima may be found .

The mathematical principle of gradient descent method

Repeat until convergence {
$\theta_{j}:=\theta_{j}-\alpha \frac{\partial}{\partial \theta_{j}} J\left(\theta_{0}, \theta_{1}\right) \quad \text { (for } j=0 \text { and } j=1 \text { ) }$
}

It should be noted that $:=$ Represents the assignment operator , and $=$ Indicates that the left and right values are equal .

$\alpha$ It's the learning rate , It's used to control the step size when the gradient drops .

What needs to be explained is the parameter $\theta_{0}$ and $\theta_{1}$ Need to synchronize updates , That is, the right part of the calculation formula should be stored in the temporary variable , Then use temporary variables to update at the same time $\theta_{0}$ and $\theta_{1}$ .

Understanding of gradient descent method

If $\alpha$ Too small , It takes many steps to reach the local lowest point ; If $\alpha$ Too big , May not converge .

also , With the operation of gradient descent method , The derivative value will become smaller and smaller , The range of movement will become smaller and smaller , Until it converges to the local minimum , Local best , The derivative value is 0, Therefore, there is no need to further reduce $\alpha$ Value .

summary

Gradient descent method is an algorithm for finding the minimum value of a function , This method can be used to minimize the cost function .

$\alpha$ It's the learning rate , It's used to control the step size when the gradient drops .

With the operation of gradient descent method , The derivative value will become smaller and smaller , The range of movement will become smaller and smaller , Until it converges to the local minimum .

2.7 The gradient of linear regression decreases

The local optimal solution of the linear regression cost function is the global optimal solution

Initializing with different values may lead to convergence to different local optimal solutions , The local optimal solution converged to may not be the global optimal solution .

Insert picture description here

But as shown above , The cost function of linear regression is convex , The local optimal solution is the global optimal solution .

Batch gradient descent method （Batch Gradient Descent）

Gradient descent method is also called batch gradient descent method （Batch Gradient Descent）, It means that in every step of the gradient descent , All the training samples are used . When calculating partial derivatives , The calculation is the sum of the individual gradient descent of all training samples .

summary

Using the gradient descent method for the cost function of linear regression can converge to the global optimal solution of the function .

Gradient descent method is also called batch gradient descent method , In every step of the descent , The calculation is the sum of the individual gradient descent of all training samples .

原网站

版权声明
本文为[three billion seventy-seven million four hundred and ninety-one]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290522038231.html