当前位置：网站首页>Li Hongyi, machine learning 3. Gradient descent

Li Hongyi, machine learning 3. Gradient descent

2022-07-26 15:05:00 【Hua Weiyun】

One 、 Source of error

error （Error） There are two main sources ： deviation （Bias） And variance （Variance）.

Error It reflects the accuracy of the whole model ,Bias It reflects the error between the output of the model on the sample and the real value , The accuracy of the model itself ,Variance It reflects the error between each output of the model and the expected output of the model , That is, the stability of the model .
—— In machine learning Bias( deviation ),Error( error ), and Variance( variance ) What are the differences and connections ？

1.1 Under fitting and over fitting

▲ deviation v.s. variance

Simple model （ On the left ） It is the error caused by large deviation , This situation is called under fitting , And complex models （ On the right ） It is the error caused by too large variance , This situation is called over fitting .

If the deviation of the model in the training set is too large , That is, under fitting . resolvent ： Redesign the model ; Consider more powers 、 More complex models .
If the model gets a small error on the training set , But we get a big error in the test set , This means that the model may have a large variance , It's over fitting . resolvent ： Add more data ; Regularization processing .

1.2 Model selection

It is mainly a trade-off between bias and variance , Minimize the total error .

Cross validation （Cross Validation）： Divide the training set into two parts , Part of it is a training set , Part as validation set . Train the model with the training set , Then compare... On the validation set , Choose the best model , Then train the best model with all the training sets .
N- Crossover verification （N-fold Cross Validation）： Divide the training set into N Share , Will this N Training sets train separately , Then find out Average error , choice Average The model with the least error , All training sets will be used to train the model with the minimum average error .

Two 、 gradient descent

Why gradient descent method is needed ？

1. Gradient descent method is a kind of iterative method , It can be used to solve the least squares problem .
2. In solving the model parameters of machine learning algorithm , When there are no constraints , There are mainly gradient descent method , Least square method .
3. When solving the minimum value of loss function , It can be solved iteratively by gradient descent method , The loss function of the minimum value and the parameters of the model are obtained .
4. If we need to find the maximum of the loss function , You can iterate through the gradient rise method , Gradient descent method and gradient rise method can be converted to each other .
5. In machine learning , Gradient descent method mainly includes random gradient descent method and batch Gradient descent method .

—— machine learning ： Why gradient descent method is needed

In the third step of the problem , The gradient descent method is used to optimize the model , That is to solve the following optimization problem ：

$\theta^* = arg\ \underset{\theta}{\operatorname{\min}} L(\theta)$

$L$ ： Loss function （Loss Function）
$\theta$ ： Parameters （parameters）（ Represents a set of parameters , There may be more than one ）

The goal is ： Find a set of parameters $\theta$ To minimize the loss function .（ Use the gradient descent method to solve this problem ）

2.1 Adjust the learning rate

▲ Carefully adjust the learning rate

When the parameter is one-dimensional or two-dimensional , The learning rate can be adjusted through visualization , But it is difficult to visualize high-dimensional situations .

resolvent ： The effect of parameter change on the loss function is visualized .

2.2 Gradient descent optimization

SGD(Stochastic Gradient Descent, Stochastic gradient descent )

Learning principles ： Select a piece of data , Just train a piece of data
shortcoming ∶
① It is sensitive to parameters , Attention should be paid to parameter initialization
② Easy to fall into local minima
③ When there is more data , Long training time
④ Every step of the iteration , All the data in the training set

Adagrad(Adaptive gradient, Adaptive gradient )
Learning principles ： Add the squares of the respective historical gradients of each dimension , Then, when updating, divide by the historical gradient value
So the learning rate of each parameter is related to its gradient , Then the learning rate of each parameter is different
shortcoming ： Vulnerable to past gradients , This leads to a rapid decline in the learning rate , The ability to learn more knowledge is getting weaker and weaker , Will stop learning ahead of time .
RMSProp(root mean square prop, Root mean square )
Learning principles ∶ The attenuation factor is introduced on the basis of adaptive gradient , When the gradient accumulates , Would be right “ In the past ” And “ Now? ” Make a balance , Adjust the attenuation through super parameters .
Suitable for dealing with non-stationary targets （ That is, time related ), about RNN The effect is very good .
Adam(Adaptive momentum optimization, Adaptive momentum optimization )
It is the most popular optimization method in deep learning , It combines adaptive gradients Handle sparse gradients And root mean square Good at handling non-stationary targets The advantages of , It is suitable for large data sets and high-dimensional space .

2.3 Feature scaling

The distribution range of different characteristics varies greatly , Use feature scaling to make the range of different inputs the same .

So that different features have a considerable impact on the output , It is convenient to update parameters efficiently .

As shown in the figure below , For each dimension $i$ （ Green box ） Calculate the average , Remember to do $m_i$ ; Also calculate the standard deviation , Remember to do $\sigma _i$ . Then use the $r$ The... In the first example $i$ Inputs , Subtract the average $m_i$ , Then divide by the standard deviation $\sigma _i$ , The result is that all dimensions are 0, All variances are 1.

▲ Method of feature scaling

3、 ... and 、 Limitation of gradient descent

It is easy to fall into local extremum （local minimal）;
Stuck is not extreme , But the differential value is 0 The place of （ Stagnation point ）;
The differential value is close to 0 Just stop , But it's just gentle here , It's not the extreme point .

▲ Limitation of gradient descent

Four 、 summary

Datawhale Team learning , Li Hongyi 《 machine learning 》Task3. Gradient Descent（ gradient descent ）. Mainly including error sources 、 Judgment of under fitting and over fitting 、 gradient descent 、 Adjust the learning rate 、 Optimization of gradient descent method and limitation of gradient descent .

——END——

原网站

版权声明
本文为[Hua Weiyun]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207261452189721.html