当前位置:网站首页>2021 Li Hongyi machine learning (3): what if neural network training fails
2021 Li Hongyi machine learning (3): what if neural network training fails
2022-07-05 02:39:00 【Three ears 01】
2021 Li hongyi machine learning (3): What if the neural network can't be trained
1 Mission Introduction
1.1 If on the training set loss Always not small enough
- Situation 1 :model bias( The model itself has great limitations )—— Build more complex models
- Situation two : optimization problem ( The optimization method used cannot be optimized to loss minimum value )—— More effective optimization methods
How to judge what kind of situation ?
- First, train on a shallow network that is easier to optimize
- If the deep network cannot get smaller than the above on the training set loss, It is case two
1.2 If loss On the training set , On the test set
- Situation 1 :overfitting—— The simplest solution is to add training data Data augmentation:
The original data can be processed , Get new data that can be added to the training set : The first picture above is the original , second 、 The three pictures are the inversion and local enlargement of the original picture . And the fourth picture is the picture upside down , This makes model recognition more difficult , So the method of the fourth picture is not good .
The second method is to add restrictions to the model , For example, the model must be a conic , But be careful not to limit too much .
How to select a model ? - Situation two :mismatch
overfitting It can be solved by adding the data of the training set , however mismatch Of training and testing Different distribution of , It can't be solved like that
1.3 Schematic diagram of mission strategy
2 Local minimum (local minima) And saddle point (saddle point)
When loss When unable to descend , Maybe the gradient is close 0 了 , This is a , Local minimum (local minima) And saddle point (saddle point) It's possible , They are collectively referred to as critical point.
here , We need to judge which situation it belongs to , Calculation Hessian that will do :
give an example :
If it's a saddle point (saddle point), You can also find the direction of descent and continue training .
But in fact, this kind of situation is rare .
3 batch (batch) And momentum (momentum)
3.1 batch (batch)
Small batch size Have better results .
There are many articles that want to have it both ways :
3.2 momentum (momentum)
With momentum , Will not stay in critical point, But will continue down :
The previous gradient descent is like this :
After adding momentum , Is to add the consideration of the previous action , Then the next action is the result of the combination of the previous action and the previous action .
The red line below is the gradient , The blue dotted line is momentum , The blue solid line is the result of the combination of the first two , You can see , Maybe even climb over the hill , Reach the real loss minimum value .
4 Automatically adjust the learning rate (Learning Rate)
In the front loss When not falling , We said it might be critical point, But it may also be the following :
learning rate It should be customized for each parameter :
Number of original parameters t + 1 t+1 t+1 In the next iteration , Learning rate η \eta η It is the same. ; And after we modify , η \eta η Turned into η σ i t \frac{\eta}{\sigma^t_i} σitη, After this modification , The learning rate is parameter independent Of course. , It's also iteration independent Of ( Parameter independent 、 Iterative independence ).
4.1 The most common way to modify the learning rate is the root mean square
This method is used in Adagrad Inside :
When the gradient is small , Calculated σ i t \sigma^t_i σit Just small , Then the learning rate is large ; On the contrary, the learning rate becomes smaller .
4.2 You can adjust your current gradient Importance ——RMSProp
This method is achieved by setting α \alpha α To adjust the importance of the current gradient :
As shown in the figure below :
You can adjust α \alpha α The relatively small , Give Way σ i t \sigma^t_i σit More dependent g i t g^t_i git, such , When the gradient suddenly changes from smooth to steep , g i t g^t_i git Bigger , σ i t \sigma^t_i σit It's getting bigger , It will make the pace at this time smaller ; Empathy , When the gradient turns smooth again , σ i t \sigma^t_i σit Will quickly become smaller , The pace will get bigger .
In fact, it is to make every step react quickly according to the changing situation .
4.3 Adam: RMSProp + Momentum
4.4 Learning Rate Scheduling
Make learning rate η \eta η Over time :
4.5 summary
Add the front 3.2, We use three methods to improve gradient descent : momentum 、 Adjust the learning rate 、 The learning rate changes over time .
m i t m^t_i mit and σ i t \sigma^t_i σit Will not offset each other , because m i t m^t_i mit Including the direction , and σ i t \sigma^t_i σit Only in size .
5 Loss function (Loss) It may also have an impact
In classification , It's usually added softmax:
If it is classified into two categories , Is more commonly used sigmoid, But in fact, the results of these two methods are the same .
Here is the loss function :
in fact , Cross entropy in classification is The most commonly used Of , stay PyTorch in ,CrossEntropyLoss This function already contains softmax, The two are bound together .
Why? ?
As you can see from the diagram , When loss When a large ,MSE Very flat , Cannot gradient down to loss A small place , Get stuck ; But the cross entropy can decline all the way .
6 Batch standardization (Batch Normalization)
Hope for different parameters , Yes loss The scope of influence is relatively uniform , Like the right figure below :
The method is feature normalization (Feature Normalization):
After normalization , The average value of the characteristics of each dimension is 0, The variance of 1.
Generally speaking , Feature normalization makes gradient descent converge faster .
The output of the next step also needs to be normalized , These normalization are aimed at a Batch Of :
In order to make the mean value of the output not 0, The variance is not 1, It will add β \beta β and γ \gamma γ:
β \beta β and γ \gamma γ The initial values of these two vectors are 0 and 1, Then learn and update in the network step by step , So at the beginning ,dimension The distribution of is close , follow-up error surface After performing better , That's what makes β \beta β and γ \gamma γ add .
stay Testing in : take train Parameters of are used in test in .
Some famous Normalization:
边栏推荐
- How to find hot projects in 2022? Dena community project progress follow-up, there is always a dish for you (1)
- Naacl 2021 | contrastive learning sweeping text clustering task
- Learn game model 3D characters, come out to find a job?
- February database ranking: how long can Oracle remain the first?
- GFS distributed file system
- Asp+access campus network goods trading platform
- Hmi-30- [motion mode] the module on the right side of the instrument starts to write
- When to catch an exception and when to throw an exception- When to catch the Exception vs When to throw the Exceptions?
- [understanding of opportunity -38]: Guiguzi - Chapter 5 flying clamp - warning one: there is a kind of killing called "killing"
- Practice of tdengine in TCL air conditioning energy management platform
猜你喜欢
Marubeni Baidu applet detailed configuration tutorial, approved.
A tab Sina navigation bar
Day_ 17 IO stream file class
Zabbix
[source code attached] Intelligent Recommendation System Based on knowledge map -sylvie rabbit
[uc/os-iii] chapter 1.2.3.4 understanding RTOS
Design and practice of kubernetes cluster and application monitoring scheme
Chinese natural language processing, medical, legal and other public data sets, sorting and sharing
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
Video display and hiding of imitation tudou.com
随机推荐
TCP security of network security foundation
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
Some query constructors in laravel (2)
Hmi-32- [motion mode] add light panel and basic information column
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
Video display and hiding of imitation tudou.com
LeetCode --- 1071. Great common divisor of strings problem solving Report
Timescaledb 2.5.2 release, time series database based on PostgreSQL
Uniapp navigateto jump failure
Design and implementation of campus epidemic prevention and control system based on SSM
Missile interception -- UPC winter vacation training match
[illumination du destin - 38]: Ghost Valley - chapitre 5 Flying clamp - one of the Warnings: There is a kind of killing called "hold Kill"
Good documentation
Using druid to connect to MySQL database reports the wrong type
Official announcement! The third cloud native programming challenge is officially launched!
Three properties that a good homomorphic encryption should satisfy
Kotlin - coroutine
CAM Pytorch
Advanced learning of MySQL -- Application -- Introduction
Subject 3 how to turn on the high beam diagram? Is the high beam of section 3 up or down