当前位置：网站首页>Deep learning training strategy -- warming up the learning rate

Deep learning training strategy -- warming up the learning rate

2022-07-29 04:14:00 【ytusdc】

background

Learning rate is one of the super parameters that affect performance most , If we can only adjust one super parameter , Then the best choice is it . In fact, in most of our cases , encounter loss become NaN Most of the situation is caused by the improper selection of learning rate .

One 、Warmup The original proposal of ?

Warmup Is in ResNet A method of preheating learning rate mentioned in the paper , It chooses to use a smaller learning rate at the beginning of training , Trained some steps（15000steps, See the last code 1） perhaps epoches(5epoches, See the last code 2), Then modify it to the preset learning for training .

Code view ：https://blog.csdn.net/dou3516/article/details/105329103/

Two 、 Why use Warmup?

Because at the beginning of training , Model weight (weights) It's randomly initialized , At this time, if you choose a larger learning rate , It may lead to the instability of the model ( Oscillate ), choice Warmup The way to warm up the learning rate , A few that can make you start training epoches Or some steps The internal learning rate is small , Under the preheating primary school attendance rate , The model can gradually become stable , After the model is relatively stable, select the preset learning rate for training , It makes the convergence speed of the model faster , The effect of the model is better .

Example：Resnet The paper uses a 110 Layer of ResNet stay cifar10 In training , First use 0.01 The learning rate is trained until the training error is less than 80%( Probably trained 400 individual steps), And then use 0.1 The learning rate of training .

3、 ... and 、Warmup Improvement

The setting of learning rate — “ rising -> Normal learning rate -> falling ”
As mentioned in the above two Warmup yes constant warmup, Its disadvantage is that changing from a small learning rate to a large learning rate may lead to a sudden increase in training error . therefore 18 year Facebook Put forward gradual warmup To solve this problem , That is, from the initial primary school attendance rate , Every step A little bit bigger , Until the relatively large learning rate originally set is reached , Use the learning rate initially set for training .

analysis ： Because the neural network is very unstable at the beginning of training , therefore The initial learning rate should be set very low , This can ensure that the network can have good convergence . But the lower learning rate will make the training process very slow , So there will be The network training is realized by gradually increasing the learning rate from low to high “ Warm up ” Stage , be called warmup stage. But if we want to make network training loss Minimum , So it's not appropriate to use a high learning rate all the time , Because it will make the gradient of weight fluctuate back and forth all the time , It is difficult to make the loss value of training reach the lowest point in the whole world . So when warmup The learning rate of the stage until it reaches the relatively large learning rate initially set , Use the learning rate initially set for training . The following code uses cosine Attenuation mode of , This stage can be called consine decay stage.

tf-yolov3 The author's relevant source code

with tf.name_scope('learn_rate'):
    self.global_step = tf.Variable(1.0, dtype=tf.float64, trainable=False, name='global_step')
    warmup_steps = tf.constant(self.warmup_periods * self.steps_per_period,
                               dtype=tf.float64, name='warmup_steps') # warmup_periods epochs
    train_steps = tf.constant((self.first_stage_epochs + self.second_stage_epochs) * self.steps_per_period,
                              dtype=tf.float64, name='train_steps')
    self.learn_rate = tf.cond(
        pred=self.global_step < warmup_steps,
        true_fn=lambda: self.global_step / warmup_steps * self.learn_rate_init,
        false_fn=lambda: self.learn_rate_end + 0.5 * (self.learn_rate_init - self.learn_rate_end) * (
                    1 + tf.cos((self.global_step - warmup_steps) / (train_steps - warmup_steps) * np.pi)))
    global_step_update = tf.assign_add(self.global_step, 1.0)
    """
     The training is divided into two stages , In the first stage, another section is divided as “ Warm up phase ”：
     Warm up phase ：learn_rate = (global_step / warmup_steps) * learn_rate_init
     Other stages ：learn_rate_end + 0.5 * (learn_rate_init - learn_rate_end) * (
             1 + tf.cos((global_step - warmup_steps) / (train_steps - warmup_steps) * np.pi))
	"""

Learning rate curve ：

Four 、 Application scenarios

（1） Training appears NaN： When the network is very easy nan When , use warm up Training , The network can be trained normally ;

（2） Over fitting ： The training set loss is very low , High accuracy , But the test set loses a lot , Low accuracy , You can use warm up; Specific to see ： Resnet-18- Training experiments -warm up operation wu Valid reason

This problem has not been fully proved yet , At present, the effect is ：
（1） It helps to slow down the impact of the model on mini-batch The phenomenon of over fitting ahead of time , Keep the distribution stable ;
（2） It helps to maintain the deep stability of the model .

During the training, there are the following situations ：
（1） At the beginning of training , Model weights change rapidly ;
（2）mini-batch size smaller , The sample variance is large .

Effective cause analysis ：
（1） First, because at the beginning , Model to data “ Distribution ” Understood as zero , Or to say “ Uniform distribution ”（ Initialization is generally based on uniform distribution ）; In the first round of training , Every data is new to the model , With the training model, the data distribution will be corrected quickly , At this time, the learning rate is very high , It is likely to lead to over fitting at the beginning , In the later stage, you need to go through multiple rounds of training to pull back . When training for a period of time （ For example, two rounds. 、 Round ） after , The model has been applied to each data several times , Or right now batch There are some correct priors , A higher learning rate is not so easy, which will make the model learning biased , So we can adjust the college attendance rate properly . This process is warmup.

Then why does the learning rate decrease in the later stage ？ This is us During normal training , A lower learning rate contributes to better convergence , When the model is learned to a certain extent , The distribution of the model is relatively stable . If you still use a large learning rate , Will destroy this stability , The network fluctuates greatly , Now it is very close to the optimal , In order to get close to this best , We need to use a small learning rate .
（2） The second reason is if there is mini-batch The variance of data distribution in is particularly large , This will lead to violent fluctuations in model learning , The weight that makes it learn is very unstable , This is most obvious at the beginning of training , The final phase is relatively relieved .
So for the above two reasons , We We cannot reduce the learning rate by several times ;

stay resnet In the article , It is said that if you use a large learning rate at the beginning , Although it will eventually converge , But the accuracy of the test will not improve after that ; If used warmup, It can be improved after convergence . in other words , use warm up And not warm up Convergence point reached , It has an impact on the best performance of the later model . That means No warm up Converge to the point with warm up Converge to a worse point . This also shows that , If you just start learning the wrong weight, you can't pull it back .

So why didn't neural networks work before warm up How about the technique ？
The main reason is ：
（1） The previous network was not big enough 、 Not deep enough ;
（2） Data sets are generally small .

5、 ... and 、 summary
Use Warmup The way to warm up the learning rate , That is, first use the initial primary school practice rate to train , Then each step A little bit bigger , Until the relatively large learning rate originally set is reached （ notes ： At this time, the warm-up learning rate is completed ）, Use the learning rate initially set for training （ notes ： Warm up the training process after completing the learning rate , Learning rate is decaying ）, It helps to make the convergence speed of the model faster , The effect is better. .

pytorch DistributedDataParallel Analysis of the problem that the result of multi card training becomes worse

原网站

版权声明
本文为[ytusdc]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130552017105.html