当前位置:网站首页>2021 Li Hongyi machine learning (3): what if neural network training fails

2021 Li Hongyi machine learning (3): what if neural network training fails

2022-07-05 02:39:00 Three ears 01

1 Mission Introduction

1.1 If on the training set loss Always not small enough

  • Situation 1 :model bias( The model itself has great limitations )—— Build more complex models
  • Situation two : optimization problem ( The optimization method used cannot be optimized to loss minimum value )—— More effective optimization methods

How to judge what kind of situation ?

  • First, train on a shallow network that is easier to optimize
  • If the deep network cannot get smaller than the above on the training set loss, It is case two

1.2 If loss On the training set , On the test set

  • Situation 1 :overfitting—— The simplest solution is to add training data Data augmentation:
     Insert picture description here
    The original data can be processed , Get new data that can be added to the training set : The first picture above is the original , second 、 The three pictures are the inversion and local enlargement of the original picture . And the fourth picture is the picture upside down , This makes model recognition more difficult , So the method of the fourth picture is not good .
    The second method is to add restrictions to the model , For example, the model must be a conic , But be careful not to limit too much .
    How to select a model ?
     Insert picture description here
  • Situation two :mismatch
    overfitting It can be solved by adding the data of the training set , however mismatch Of training and testing Different distribution of , It can't be solved like that
     Insert picture description here

1.3 Schematic diagram of mission strategy

 Insert picture description here

2 Local minimum (local minima) And saddle point (saddle point)

When loss When unable to descend , Maybe the gradient is close 0 了 , This is a , Local minimum (local minima) And saddle point (saddle point) It's possible , They are collectively referred to as critical point. Insert picture description here
here , We need to judge which situation it belongs to , Calculation Hessian that will do :
 Insert picture description here
give an example :
 Insert picture description here
If it's a saddle point (saddle point), You can also find the direction of descent and continue training .
 Insert picture description here
But in fact, this kind of situation is rare .

3 batch (batch) And momentum (momentum)

3.1 batch (batch)

Small batch size Have better results .
 Insert picture description here
There are many articles that want to have it both ways :
 Insert picture description here

3.2 momentum (momentum)

With momentum , Will not stay in critical point, But will continue down :
 Insert picture description here
The previous gradient descent is like this :
 Insert picture description here

After adding momentum , Is to add the consideration of the previous action , Then the next action is the result of the combination of the previous action and the previous action .
 Insert picture description here
The red line below is the gradient , The blue dotted line is momentum , The blue solid line is the result of the combination of the first two , You can see , Maybe even climb over the hill , Reach the real loss minimum value .
 Insert picture description here

4 Automatically adjust the learning rate (Learning Rate)

In the front loss When not falling , We said it might be critical point, But it may also be the following :
 Insert picture description here
learning rate It should be customized for each parameter :
 Insert picture description here
Number of original parameters t + 1 t+1 t+1 In the next iteration , Learning rate η \eta η It is the same. ; And after we modify , η \eta η Turned into η σ i t \frac{\eta}{\sigma^t_i} σitη, After this modification , The learning rate is parameter independent Of course. , It's also iteration independent Of ( Parameter independent 、 Iterative independence ).

4.1 The most common way to modify the learning rate is the root mean square

 Insert picture description here
This method is used in Adagrad Inside :
 Insert picture description here
When the gradient is small , Calculated σ i t \sigma^t_i σit Just small , Then the learning rate is large ; On the contrary, the learning rate becomes smaller .

4.2 You can adjust your current gradient Importance ——RMSProp

This method is achieved by setting α \alpha α To adjust the importance of the current gradient :
 Insert picture description here
As shown in the figure below :
 Insert picture description here
You can adjust α \alpha α The relatively small , Give Way σ i t \sigma^t_i σit More dependent g i t g^t_i git, such , When the gradient suddenly changes from smooth to steep , g i t g^t_i git Bigger , σ i t \sigma^t_i σit It's getting bigger , It will make the pace at this time smaller ; Empathy , When the gradient turns smooth again , σ i t \sigma^t_i σit Will quickly become smaller , The pace will get bigger .
In fact, it is to make every step react quickly according to the changing situation .

4.3 Adam: RMSProp + Momentum

 Insert picture description here

4.4 Learning Rate Scheduling

Make learning rate η \eta η Over time :
 Insert picture description here

4.5 summary

Add the front 3.2, We use three methods to improve gradient descent : momentum 、 Adjust the learning rate 、 The learning rate changes over time .
m i t m^t_i mit and σ i t \sigma^t_i σit Will not offset each other , because m i t m^t_i mit Including the direction , and σ i t \sigma^t_i σit Only in size .
 Insert picture description here

5 Loss function (Loss) It may also have an impact

 Insert picture description here
In classification , It's usually added softmax:
 Insert picture description here
If it is classified into two categories , Is more commonly used sigmoid, But in fact, the results of these two methods are the same .

Here is the loss function :
 Insert picture description here
in fact , Cross entropy in classification is The most commonly used Of , stay PyTorch in ,CrossEntropyLoss This function already contains softmax, The two are bound together .

Why? ?
 Insert picture description here
As you can see from the diagram , When loss When a large ,MSE Very flat , Cannot gradient down to loss A small place , Get stuck ; But the cross entropy can decline all the way .

6 Batch standardization (Batch Normalization)

Hope for different parameters , Yes loss The scope of influence is relatively uniform , Like the right figure below :
 Insert picture description here
The method is feature normalization (Feature Normalization):
 Insert picture description here
After normalization , The average value of the characteristics of each dimension is 0, The variance of 1.
Generally speaking , Feature normalization makes gradient descent converge faster .

The output of the next step also needs to be normalized , These normalization are aimed at a Batch Of :
 Insert picture description here
In order to make the mean value of the output not 0, The variance is not 1, It will add β \beta β and γ \gamma γ
 Insert picture description here
β \beta β and γ \gamma γ The initial values of these two vectors are 0 and 1, Then learn and update in the network step by step , So at the beginning ,dimension The distribution of is close , follow-up error surface After performing better , That's what makes β \beta β and γ \gamma γ add .

stay Testing in : take train Parameters of are used in test in .
 Insert picture description here
Some famous Normalization:
 Insert picture description here

原网站

版权声明
本文为[Three ears 01]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202140912208462.html