当前位置:网站首页>Tricks | [trick6] learning rate adjustment strategy of yolov5 (one cycle policy, cosine annealing, etc.)
Tricks | [trick6] learning rate adjustment strategy of yolov5 (one cycle policy, cosine annealing, etc.)
2022-06-09 05:29:00 【Clichong】
If there is a mistake , Please point out .
List of articles
The adjustment of learning rate has always been a difficult problem , stay yolov5 There are two ways to adjust the learning rate , One is linear adjustment , The other is One Cycle Policy. And in the process of searching for information , Learned about other learning rate adjustment strategies , It is summarized in this note .
These include :LR Range Test、Cyclical LR、One Cycle Policy、SGDR、AdamW 、SGDW、pytorch Implementation of cosine annealing strategy . Specific learning rate adjustment strategies , See resources... For details .
0. Yolov5 Learning rate adjustment program
yolov5 Two learning rate adjustment schemes are provided in the code : Linear learning rate vs One Cycle Learning rate adjustment
Simple code , As shown below :
# Scheduler
if opt.linear_lr:
lf = lambda x: (1 - x / (epochs - 1)) * (1.0 - hyp['lrf']) + hyp['lrf'] # linear
else:
lf = one_cycle(1, hyp['lrf'], epochs) # cosine 1->hyp['lrf']
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf) # plot_lr_scheduler(optimizer, scheduler, epochs)
With auxiliary drawing function plot_lr_scheduler, Here, the learning rates of the two learning rate adjustment strategies can be changed with epochs The changes are plotted , Here I rewrite a function that is more convenient to call lf.
Reference code :
def plot_lr_scheduler(optimizer, scheduler, epochs=300, save_dir=''):
# Plot LR simulating training for full epochs
optimizer, scheduler = copy(optimizer), copy(scheduler) # do not modify originals
y = []
for _ in range(epochs):
scheduler.step()
y.append(optimizer.param_groups[0]['lr'])
plt.plot(y, '.-', label='LR')
plt.xlabel('epoch')
plt.ylabel('LR')
plt.grid()
plt.xlim(0, epochs)
plt.ylim(0)
plt.savefig(Path(save_dir) / 'LR.png', dpi=200)
plt.close()
# function : Draw in the learning rate adjustment method lr Next , The learning rate varies with epoch The curve of
def plot_lr(lf, epochs=30):
# load model
weight = r"./runs/train/mask/weights/last.pt"
device = torch.device('cpu')
ckpt = torch.load(weight, map_location=device)
model = Model(ckpt['model'].yaml, ch=3, nc=3, anchors=None).to(device)
model.load_state_dict(ckpt['model'].state_dict())
# optimizer
g0, g1, g2 = [], [], [] # optimizer parameter groups
for v in model.modules():
if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter): # bias
g2.append(v.bias)
if isinstance(v, nn.BatchNorm2d): # weight (no decay)
g0.append(v.weight)
elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter): # weight (with decay)
g1.append(v.weight)
optimizer = SGD(g0, lr=0.01, momentum=0.937, nesterov=True)
optimizer.add_param_group({
'params': g1, 'weight_decay': 0.0005}) # add g1 with weight_decay
optimizer.add_param_group({
'params': g2}) # add g2 (biases)
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
plot_lr_scheduler(optimizer, scheduler, epochs, save_dir='./runs/test')
print('plot successes')
Let's use the above functions to view the linear learning rate and One Cycle The learning rate curve
- Linear learning rate curve
Draw curve code :
if __name__ == '__main__':
epochs = 30
lrf = 0.1
lf = lambda x: (1 - x / (epochs - 1)) * (1.0 - lrf) + lrf
plot_lr(lf, epochs)
Learning rate curve :
- OneCycle Learning rate curve
Draw curve code :
def one_cycle(y1=0.0, y2=1.0, steps=100):
# lambda function for sinusoidal ramp from y1 to y2 https://arxiv.org/pdf/1812.01187.pdf
return lambda x: ((1 - math.cos(x * math.pi / steps)) / 2) * (y2 - y1) + y1
if __name__ == '__main__':
epochs = 30
lf = one_cycle(1, 0.1, 30) # cosine 1->hyp['lrf']
plot_lr(lf, epochs)
Learning rate curve :
analysis :One Cycle The learning rate changes from lr0=0.01 Decay to... With cosine change lr0*lrf = 0.01*0.1 = 0.001 On . After understanding the lawsuit one cycle, From the side yolov5 The change curve of learning rate of , It is not entirely in accordance with One Cycle Policy Image , More inclined to the common cosine annealing strategy .
The following content is the theoretical analysis, introduction and induction of various learning rate adjustment methods .
1. LR Range Test
2015 year ,Leslie N. Smith This technology is proposed . Its core is to iterate the model several times , In the beginning , Set the learning rate small enough , then , As the number of iterations increases , Gradually increase the learning rate , Record the loss for each learning rate , And draw :(LR The initial value of is only 1e-7, Then add to 10)

LR Range Test The diagram should include three areas , In the first area, the learning rate is so small that the loss is hardly reduced , The loss converges quickly in the second region , The learning rate in the last region is so high that the loss begins to diverge . therefore , The learning rate range in the second area is what we should use in training .
therefore , This method is just like its name , Is the learning rate range test , Find a suitable learning rate range for training .
2. Cyclical LR
In some classical methods , The learning rate always drops gradually , So that the model can converge stably , but Leslie Smith Questions are raised about this ,Leslie Smith Think that the learning rate should change periodically within a reasonable range ( namely Cyclical LR: stay lr and max_lr In range learning rate ) Is a more reasonable way , It can improve the accuracy of the model in less steps .

As shown in the figure above ,max_lr And lr Can pass LR Range test determine , The author thinks that : The optimal learning rate will be in this range , So if the learning rate changes in this range , In most cases, you will get a learning rate close to the optimal learning rate .
summary :
- Cyclical LR It is an effective method to avoid saddle point , Because the gradient is small near the saddle point , By increasing the learning rate, the model can get out of the dilemma .
- Cyclical LR It can accelerate the model training process
- Cyclical LR To some extent, it can improve the generalization ability of the model ( Bring the model into a flat minimum area )
3. One Cycle Policy
stay Cyclical LR and LR Range Test On the basis of ,Leslie Continue to improve , Put forward The 1cycle policy. That is, periodic learning rate adjustment , The period is set to 1. In a one cycle strategy , The maximum learning rate is set to LR Range test The highest value that can be found in , The minimum learning rate is several orders of magnitude smaller than the maximum learning rate ( For example, set to the maximum value 0.1 times ).

Pictured above , The whole training cycle is about 400 individual iter, front 175 individual iter be used for warm-up, middle 175 individual iter Used to anneal to the initial learning rate , The last dozens iter The learning rate is further attenuated . We call the above three processes three stages .
- The first stage : linear warm-up: Its effect is similar to that of general warm-up The effect is similar to , Prevent some problems caused by cold start .
- The second stage : Linear descent to the initial learning rate : Because first 、 In the second stage, a considerable number of time models are at a high learning rate , The author thinks that , This will play a role in regularization , Prevent the model from staying at the steep minimum , Thus, it is more inclined to find a flat local minimum .
- The third stage : The learning rate decays to 0: Will make the model in a ‘ flat ’ In the region, it converges to a relatively ‘ steep ’ The local minimum of .

The figure above shows a cycle of strategy training , The loss variation of the model on the training set and the verification set . In this diagram , The learning rate is 0 and 41 Between periods from 0.08 Rise to 0.8, stay 41 and 82 Return to... Between periods 0.08, And then in the last few periods 0.08 One percent of . so , When the learning rate is high , Verification set loss becomes unstable , But on average , The difference between the loss of the verification set and the loss of the training set does not change much , It shows that the knowledge learned by this stage model has good generalization ability ( That is, the large learning rate plays a role of regularization to some extent ). At the end of training , The learning rate is declining , At this time The training set loss decreased significantly , The loss of verification set did not decrease significantly , The difference between the two widens , therefore , At the end of training , The model began to produce a certain degree of over fitting .

In this picture , The learning rate is 0 and 22.5 Between periods from 0.15 Rise to 3, stay 22.5 and 45 Return to... Between periods 0.15, And then in the last few periods 0.15 One percent of . With a very high learning rate , We can learn faster and prevent overfitting . Before we eliminate the learning rate , The difference between verification loss and training loss has been very small . This is it. Leslie Smith The superconvergence phenomenon described . Using this technology , We can do it in 50 individual epoch Train one inside resnet-56, Make it cifar10 The accuracy rate of 92.3%. Enter a 70 individual epoch The cycle of can make us reach 93% The accuracy of .
- Cyclical momentum
In resources 3 It also mentions Cyclical momentum The method of periodic momentum .
Along with the shift to a higher learning rate , Leslie Smith It was found in his experiment that , Lower momentum leads to better results . This supports the intuition that , In the training part , We hope SGD Quickly enter a new direction to find a flatter area , So we need to give more weight to the new gradient . In practice , He suggested choosing 0.85 and 0.95 These two values , When we improve our learning rate , Decrease from a higher value to a lower value , Then as the learning rate decreases, it returns to a higher momentum . As shown in the figure below :

according to Leslie That's what I'm saying , The exact optimal momentum value selected during the whole training period can give us the same final result , But using cyclic momentum eliminates the hassle of trying multiple values and running several complete cycles , Thus wasting precious time .
- summary :
One Cycle Policy You can also see the meaning of , The change of learning rate is divided into 3 There are stages but only one cycle , It's called 1 Learning rate adjustment of cycle strategy . It can also be viewed from the side yolov5 The change curve of learning rate of , It is not entirely in accordance with One Cycle Policy Image , More inclined to the common cosine annealing strategy .
4. SGDR
Source see resources 2.
SGDR It is a good old version hot restart SGD. In principle, ,SGDR And CLR The essence is very similar , That is, the learning rate is constantly changing during the training process .

among , Active annealing strategy ( Cosine annealing ) Combined with restart plan . Restart is a 「 heat 」 restart , Because the model is not restarted like the new model , But after restarting the learning rate , Use the parameters before restart as the initial solution of the model . This is very simple in implementation , Because you don't have to do anything with the model , Just update the learning rate immediately .
up to now ,Adam And other adaptive optimization methods are still the fastest way to train deep neural networks . However , Many of the best solutions for various benchmarks may be found in Kaggle The winning solution is still SGD, Because they think ,Adam The local minimum obtained will lead to poor generalization .
SGDR Combine the two , rapid 「 heat 」 Restart to a higher learning rate , Then the active annealing strategy is used to help the model and Adam Just as fast ( Even faster ) Study , And keep the normal SGD Generalization ability .
5. AdamW 、SGDW
Source see resources 1,2.( Reference material 1 This part may be more detailed )
「 heat 」 The startup strategy is very good , And it seems feasible to change the learning rate during training . But why didn't the last paper extend to AdamR Well ? The paper 《Fixing Weight Decay Regularization in Adam》 My author once said :
Although our initial version Adam stay 「 heat 」 Startup performance ratio Adam Better , But compared to hot start SGD It's not competitive .
The author puts forward the following opinions in the paper :
- L2 Regularization and weight attenuation are different .
- L2 Regularize in Adam Invalid in .
- The weight decays at Adam and SGD It's also effective .
- stay SGD in , Reparameterization can make L2 Regularization and weight attenuation are equivalent .
- Mainstream libraries regard weight attenuation as SGD and Adam Of L2 Regularization .
They proposed AdamW and SGDW, These two methods can attenuate the weight and L2 The regularization steps are separated .
Adopt new AdamW, The author proves that AdamW( restart AdamWR) In terms of speed and performance SGDWR Quite a .
This part is a little confused , Do not know much about .
6. Pytorch Cosine annealing learning rate strategy
See resources... For details 4.
Here is an additional episode , stay pytorch In fact, it also implements the cosine annealing strategy , It's basically two functions :CosineAnnealingLR And CosineAnnealingWarmRestarts
- CosineAnnealingLR
This is simpler , Only for the most critical Tmax Parameter for a description , This can be understood as the half period of the cosine function . If max_epoch=50 Time , Then set T_max=5 Will cause the cosine of the learning rate to change periodically 5 Time .

- CosineAnnealingWarmRestarts
There are two main parameters :T_0 It is the first time that the learning rate returns to the initial value epoch Location ;T_mult It controls the speed at which the learning rate changes .
If T m u l t = 1 T_{mult}=1 Tmult=1, Then the learning rate is T 0 T_0 T0, 2 ∗ T 0 2*T_0 2∗T0, 3 ∗ T 0 3*T_0 3∗T0, . . . ... ..., i ∗ T 0 i*T_0 i∗T0,… Return to maximum at ( Initial learning rate );

If T m u l t > 1 T_{mult}>1 Tmult>1, Then the learning rate is T 0 T_0 T0, ( 1 + T m u l t ) T 0 (1+T_{mult})T_0 (1+Tmult)T0, ( 1 + T m u l t + T m u l t 2 ) T 0 (1+T_{mult}+T_{mult}^2)T_0 (1+Tmult+Tmult2)T0, . . . ... ..., ( T m u l t + T m u l t 2 + . . . + T m u l t i ) ∗ T 0 (T_{mult}+T_{mult}^2+...+T_{mult}^i)*T0 (Tmult+Tmult2+...+Tmulti)∗T0, Return to maximum at .

Reference material :
2. since Adam Since its appearance , What has changed in the deep learning optimizer ?
边栏推荐
- Local redis cluster setup
- Recurrence and solution of long jump in data warehouse
- Cloud computing technology
- seaweedfs-client适配高版本的seaweedfs服务
- Alibaba cloud AI training camp -sql foundation 2: query and sorting
- AQS之 ReentrantLock 源码分析
- “Ran out of input” while use WikiExtractor
- Fundamentals of deep learning: face based common expression recognition (2) - data acquisition and collation
- Stack
- Finding JS in the two-dimensional array of sword fingers (clear version)
猜你喜欢

Mysql5.7 one master multi slave configuration

The difference between traditional method and lean method

Gstreamer应用开发实战指南(一)

SET DECIMAL_V2=FALSE及UDF ERROR: Cannot divide decimal by zero及Incompatible return types DECIMAL问题排查

Mysql5 available clusters

TCP error control, flow control, congestion control

Swagger basic use quick start

Alibaba cloud AI training camp -sql basics 5: window functions, etc

Summary of Android Engineer interview experience with 5 years' work experience, summary of real interview questions of Ali + Tencent + byte jump

June 2022 Tsinghua Management Tsinghua University Ning Xiangdong
随机推荐
爬取html入mysql插入失败
材料之kube-dns.yaml
Fundamentals of deep learning: face based common expression recognition (2) - data acquisition and collation
AQS 之 CountdownLatch 源码分析
Data Summit 2022 大会资料分享(共23个)
Missing digit JS in sword finger 0~n-1
SQL optimization notes - forward
MQ消息丢失,消息一致性,重复消费解决方案
Alibaba cloud AI training camp -sql basics 6: test questions
Load research of Marathon LB
TCP error control, flow control, congestion control
csv文件读取(v3&v5)
崔健没变,北汽极狐该做出改变了
优视慕V8投影仪,打开高清新“视”界
Leetcode 929.独特的电子邮件地址
Gstreamer应用开发实战指南(二)
Codeigniter3 learning notes 5 (form verification)
MySQL one master multi slave configuration centos7 pro test
reids 缓存与数据库数据不一致、缓存过期删除问题
How to change the color of WPS ppt background picture