当前位置:网站首页>Deep learning training strategy -- warming up the learning rate
Deep learning training strategy -- warming up the learning rate
2022-07-29 04:14:00 【ytusdc】
background
Learning rate is one of the super parameters that affect performance most , If we can only adjust one super parameter , Then the best choice is it . In fact, in most of our cases , encounter loss become NaN Most of the situation is caused by the improper selection of learning rate .
One 、Warmup The original proposal of ?
Warmup Is in ResNet A method of preheating learning rate mentioned in the paper , It chooses to use a smaller learning rate at the beginning of training , Trained some steps(15000steps, See the last code 1) perhaps epoches(5epoches, See the last code 2), Then modify it to the preset learning for training .
Code view :https://blog.csdn.net/dou3516/article/details/105329103/
Two 、 Why use Warmup?
Because at the beginning of training , Model weight (weights) It's randomly initialized , At this time, if you choose a larger learning rate , It may lead to the instability of the model ( Oscillate ), choice Warmup The way to warm up the learning rate , A few that can make you start training epoches Or some steps The internal learning rate is small , Under the preheating primary school attendance rate , The model can gradually become stable , After the model is relatively stable, select the preset learning rate for training , It makes the convergence speed of the model faster , The effect of the model is better .
Example:Resnet The paper uses a 110 Layer of ResNet stay cifar10 In training , First use 0.01 The learning rate is trained until the training error is less than 80%( Probably trained 400 individual steps), And then use 0.1 The learning rate of training .
3、 ... and 、Warmup Improvement
The setting of learning rate — “ rising -> Normal learning rate -> falling ”
As mentioned in the above two Warmup yes constant warmup, Its disadvantage is that changing from a small learning rate to a large learning rate may lead to a sudden increase in training error . therefore 18 year Facebook Put forward gradual warmup To solve this problem , That is, from the initial primary school attendance rate , Every step A little bit bigger , Until the relatively large learning rate originally set is reached , Use the learning rate initially set for training .
analysis : Because the neural network is very unstable at the beginning of training , therefore The initial learning rate should be set very low , This can ensure that the network can have good convergence . But the lower learning rate will make the training process very slow , So there will be The network training is realized by gradually increasing the learning rate from low to high “ Warm up ” Stage , be called warmup stage. But if we want to make network training loss Minimum , So it's not appropriate to use a high learning rate all the time , Because it will make the gradient of weight fluctuate back and forth all the time , It is difficult to make the loss value of training reach the lowest point in the whole world . So when warmup The learning rate of the stage until it reaches the relatively large learning rate initially set , Use the learning rate initially set for training . The following code uses cosine Attenuation mode of , This stage can be called consine decay stage.
tf-yolov3 The author's relevant source code
with tf.name_scope('learn_rate'):
self.global_step = tf.Variable(1.0, dtype=tf.float64, trainable=False, name='global_step')
warmup_steps = tf.constant(self.warmup_periods * self.steps_per_period,
dtype=tf.float64, name='warmup_steps') # warmup_periods epochs
train_steps = tf.constant((self.first_stage_epochs + self.second_stage_epochs) * self.steps_per_period,
dtype=tf.float64, name='train_steps')
self.learn_rate = tf.cond(
pred=self.global_step < warmup_steps,
true_fn=lambda: self.global_step / warmup_steps * self.learn_rate_init,
false_fn=lambda: self.learn_rate_end + 0.5 * (self.learn_rate_init - self.learn_rate_end) * (
1 + tf.cos((self.global_step - warmup_steps) / (train_steps - warmup_steps) * np.pi)))
global_step_update = tf.assign_add(self.global_step, 1.0)
"""
The training is divided into two stages , In the first stage, another section is divided as “ Warm up phase ”:
Warm up phase :learn_rate = (global_step / warmup_steps) * learn_rate_init
Other stages :learn_rate_end + 0.5 * (learn_rate_init - learn_rate_end) * (
1 + tf.cos((global_step - warmup_steps) / (train_steps - warmup_steps) * np.pi))
"""
Learning rate curve :
Four 、 Application scenarios
(1) Training appears NaN: When the network is very easy nan When , use warm up Training , The network can be trained normally ;
(2) Over fitting : The training set loss is very low , High accuracy , But the test set loses a lot , Low accuracy , You can use warm up; Specific to see : Resnet-18- Training experiments -warm up operation wu Valid reason
This problem has not been fully proved yet , At present, the effect is :
(1) It helps to slow down the impact of the model on mini-batch The phenomenon of over fitting ahead of time , Keep the distribution stable ;
(2) It helps to maintain the deep stability of the model .
During the training, there are the following situations :
(1) At the beginning of training , Model weights change rapidly ;
(2)mini-batch size smaller , The sample variance is large .
Effective cause analysis :
(1) First, because at the beginning , Model to data “ Distribution ” Understood as zero , Or to say “ Uniform distribution ”( Initialization is generally based on uniform distribution ); In the first round of training , Every data is new to the model , With the training model, the data distribution will be corrected quickly , At this time, the learning rate is very high , It is likely to lead to over fitting at the beginning , In the later stage, you need to go through multiple rounds of training to pull back . When training for a period of time ( For example, two rounds. 、 Round ) after , The model has been applied to each data several times , Or right now batch There are some correct priors , A higher learning rate is not so easy, which will make the model learning biased , So we can adjust the college attendance rate properly . This process is warmup.
Then why does the learning rate decrease in the later stage ? This is us During normal training , A lower learning rate contributes to better convergence , When the model is learned to a certain extent , The distribution of the model is relatively stable . If you still use a large learning rate , Will destroy this stability , The network fluctuates greatly , Now it is very close to the optimal , In order to get close to this best , We need to use a small learning rate .
(2) The second reason is if there is mini-batch The variance of data distribution in is particularly large , This will lead to violent fluctuations in model learning , The weight that makes it learn is very unstable , This is most obvious at the beginning of training , The final phase is relatively relieved .
So for the above two reasons , We We cannot reduce the learning rate by several times ;
stay resnet In the article , It is said that if you use a large learning rate at the beginning , Although it will eventually converge , But the accuracy of the test will not improve after that ; If used warmup, It can be improved after convergence . in other words , use warm up And not warm up Convergence point reached , It has an impact on the best performance of the later model . That means No warm up Converge to the point with warm up Converge to a worse point . This also shows that , If you just start learning the wrong weight, you can't pull it back .
So why didn't neural networks work before warm up How about the technique ?
The main reason is :
(1) The previous network was not big enough 、 Not deep enough ;
(2) Data sets are generally small .
5、 ... and 、 summary
Use Warmup The way to warm up the learning rate , That is, first use the initial primary school practice rate to train , Then each step A little bit bigger , Until the relatively large learning rate originally set is reached ( notes : At this time, the warm-up learning rate is completed ), Use the learning rate initially set for training ( notes : Warm up the training process after completing the learning rate , Learning rate is decaying ), It helps to make the convergence speed of the model faster , The effect is better. .
边栏推荐
- Taobao product details interface (product details page data interface)
- BIO、NIO、AIO的区别和原理
- The solution of porting stm32f103zet6 program to c8t6+c8t6 download program flash timeout
- Const char* and char*, string constants
- The function "postgis_version" cannot be found when installing PostGIS
- RMAN do not mark expired backups
- Const read only variable constant
- SQL server当存储过程接收的参数是int类型时,如何做判断?
- Pix2.4.8 from start to installation (2021.4.4)
- Labelme cannot open the picture
猜你喜欢
VScode连接ssh遇到的问题
First knowledge of C language (3)
Compilation and linking
Machine vision Series 1: Visual Studio 2019 dynamic link library DLL establishment
不会就坚持62天吧 单词之和
[hands on deep learning] environment configuration (detailed records, starting from the installation of VMware virtual machine)
Fu Yingna: Yuan universe is the new generation of Internet!
Locally call tensorboard and Jupiter notebook on the server (using mobaxterm)
Whole house WiFi solution: mesh router networking and ac+ap
全屋WiFi方案:Mesh路由器组网和AC+AP
随机推荐
MySQL gets the maximum value record by field grouping
Object detection: object_ Detection API +ssd target detection model
How to write the filter conditions of data integration and what syntax to use? SQL syntax processing bizdate can not be
Locally call tensorboard and Jupiter notebook on the server (using mobaxterm)
Shielding ODBC load balancing mode in gbase 8A special scenarios?
不会就坚持69天吧 合并区间
[common commands]
C语言:枚举知识点总结
[hands on deep learning] environment configuration (detailed records, starting from the installation of VMware virtual machine)
14.haproxy+keepalived负载均衡和高可用
[paper translation] vectornet: encoding HD maps and agent dynamics from vectorized representation
There is a special cryptology language called asn.1
C语言:浅谈各种复杂的声明
SVG--loading动画
Database SQL statement realizes function query of data decomposition
Is there any way for Youxuan database to check the log volume that the primary cluster transmits to the standby cluster every day?
Value transmission and address transmission of C language, pointer of pointer
The solution of porting stm32f103zet6 program to c8t6+c8t6 download program flash timeout
Svg -- loading animation
Mmdetection preliminary use