当前位置：网站首页>Tricks | [trick6] learning rate adjustment strategy of yolov5 (one cycle policy, cosine annealing, etc.)

Tricks | [trick6] learning rate adjustment strategy of yolov5 (one cycle policy, cosine annealing, etc.)

2022-06-09 05:29:00 【Clichong】

If there is a mistake , Please point out .

List of articles

0. Yolov5 Learning rate adjustment program
1. LR Range Test
2. Cyclical LR
3. One Cycle Policy
4. SGDR
5. AdamW 、SGDW
6. Pytorch Cosine annealing learning rate strategy

The adjustment of learning rate has always been a difficult problem , stay yolov5 There are two ways to adjust the learning rate , One is linear adjustment , The other is One Cycle Policy. And in the process of searching for information , Learned about other learning rate adjustment strategies , It is summarized in this note .

These include ：LR Range Test、Cyclical LR、One Cycle Policy、SGDR、AdamW 、SGDW、pytorch Implementation of cosine annealing strategy . Specific learning rate adjustment strategies , See resources... For details .

0. Yolov5 Learning rate adjustment program

yolov5 Two learning rate adjustment schemes are provided in the code ： Linear learning rate vs One Cycle Learning rate adjustment

Simple code , As shown below ：

# Scheduler
if opt.linear_lr:
    lf = lambda x: (1 - x / (epochs - 1)) * (1.0 - hyp['lrf']) + hyp['lrf']  # linear
else:
    lf = one_cycle(1, hyp['lrf'], epochs)  # cosine 1->hyp['lrf']
scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)  # plot_lr_scheduler(optimizer, scheduler, epochs)

With auxiliary drawing function plot_lr_scheduler, Here, the learning rates of the two learning rate adjustment strategies can be changed with epochs The changes are plotted , Here I rewrite a function that is more convenient to call lf.

Reference code ：

def plot_lr_scheduler(optimizer, scheduler, epochs=300, save_dir=''):
    # Plot LR simulating training for full epochs
    optimizer, scheduler = copy(optimizer), copy(scheduler)  # do not modify originals
    y = []
    for _ in range(epochs):
        scheduler.step()
        y.append(optimizer.param_groups[0]['lr'])
    plt.plot(y, '.-', label='LR')
    plt.xlabel('epoch')
    plt.ylabel('LR')
    plt.grid()
    plt.xlim(0, epochs)
    plt.ylim(0)
    plt.savefig(Path(save_dir) / 'LR.png', dpi=200)
    plt.close()

#  function :  Draw in the learning rate adjustment method lr Next ,  The learning rate varies with epoch The curve of 
def plot_lr(lf, epochs=30):
	# load model
    weight = r"./runs/train/mask/weights/last.pt"
    device = torch.device('cpu')
    ckpt = torch.load(weight, map_location=device)
    model = Model(ckpt['model'].yaml, ch=3, nc=3, anchors=None).to(device)
    model.load_state_dict(ckpt['model'].state_dict())
	
	# optimizer 
    g0, g1, g2 = [], [], []  # optimizer parameter groups
    for v in model.modules():
        if hasattr(v, 'bias') and isinstance(v.bias, nn.Parameter):  # bias
            g2.append(v.bias)
        if isinstance(v, nn.BatchNorm2d):  # weight (no decay)
            g0.append(v.weight)
        elif hasattr(v, 'weight') and isinstance(v.weight, nn.Parameter):  # weight (with decay)
            g1.append(v.weight)

    optimizer = SGD(g0, lr=0.01, momentum=0.937, nesterov=True)
    optimizer.add_param_group({
    'params': g1, 'weight_decay': 0.0005})  # add g1 with weight_decay
    optimizer.add_param_group({
    'params': g2})  # add g2 (biases)

    scheduler = lr_scheduler.LambdaLR(optimizer, lr_lambda=lf)
    plot_lr_scheduler(optimizer, scheduler, epochs, save_dir='./runs/test')
    print('plot successes')

Let's use the above functions to view the linear learning rate and One Cycle The learning rate curve

Linear learning rate curve

Draw curve code ：

if __name__ == '__main__':
    epochs = 30
    lrf = 0.1 
    lf = lambda x: (1 - x / (epochs - 1)) * (1.0 - lrf) + lrf 
    plot_lr(lf, epochs)

Learning rate curve ：
Insert picture description here

OneCycle Learning rate curve

Draw curve code ：

def one_cycle(y1=0.0, y2=1.0, steps=100):
    # lambda function for sinusoidal ramp from y1 to y2 https://arxiv.org/pdf/1812.01187.pdf
    return lambda x: ((1 - math.cos(x * math.pi / steps)) / 2) * (y2 - y1) + y1

if __name__ == '__main__':
    epochs = 30
    lf = one_cycle(1, 0.1, 30)  # cosine 1->hyp['lrf']
    plot_lr(lf, epochs)

Learning rate curve ：
Insert picture description here
analysis ：One Cycle The learning rate changes from lr0=0.01 Decay to... With cosine change lr0*lrf = 0.01*0.1 = 0.001 On . After understanding the lawsuit one cycle, From the side yolov5 The change curve of learning rate of , It is not entirely in accordance with One Cycle Policy Image , More inclined to the common cosine annealing strategy .

The following content is the theoretical analysis, introduction and induction of various learning rate adjustment methods .

1. LR Range Test

2015 year ,Leslie N. Smith This technology is proposed . Its core is to iterate the model several times , In the beginning , Set the learning rate small enough , then , As the number of iterations increases , Gradually increase the learning rate , Record the loss for each learning rate , And draw ：（LR The initial value of is only 1e-7, Then add to 10）

Insert picture description here

LR Range Test The diagram should include three areas , In the first area, the learning rate is so small that the loss is hardly reduced , The loss converges quickly in the second region , The learning rate in the last region is so high that the loss begins to diverge . therefore , The learning rate range in the second area is what we should use in training .

therefore , This method is just like its name , Is the learning rate range test , Find a suitable learning rate range for training .

2. Cyclical LR

In some classical methods , The learning rate always drops gradually , So that the model can converge stably , but Leslie Smith Questions are raised about this ,Leslie Smith Think that the learning rate should change periodically within a reasonable range （ namely Cyclical LR： stay lr and max_lr In range learning rate ） Is a more reasonable way , It can improve the accuracy of the model in less steps .

Insert picture description here

As shown in the figure above ,max_lr And lr Can pass LR Range test determine , The author thinks that ： The optimal learning rate will be in this range , So if the learning rate changes in this range , In most cases, you will get a learning rate close to the optimal learning rate .

summary ：

Cyclical LR It is an effective method to avoid saddle point , Because the gradient is small near the saddle point , By increasing the learning rate, the model can get out of the dilemma .
Cyclical LR It can accelerate the model training process
Cyclical LR To some extent, it can improve the generalization ability of the model （ Bring the model into a flat minimum area ）

3. One Cycle Policy

stay Cyclical LR and LR Range Test On the basis of ,Leslie Continue to improve , Put forward The 1cycle policy. That is, periodic learning rate adjustment , The period is set to 1. In a one cycle strategy , The maximum learning rate is set to LR Range test The highest value that can be found in , The minimum learning rate is several orders of magnitude smaller than the maximum learning rate （ For example, set to the maximum value 0.1 times ）.

Insert picture description here

Pictured above , The whole training cycle is about 400 individual iter, front 175 individual iter be used for warm-up, middle 175 individual iter Used to anneal to the initial learning rate , The last dozens iter The learning rate is further attenuated . We call the above three processes three stages .

The first stage ： linear warm-up： Its effect is similar to that of general warm-up The effect is similar to , Prevent some problems caused by cold start .
The second stage ： Linear descent to the initial learning rate ： Because first 、 In the second stage, a considerable number of time models are at a high learning rate , The author thinks that , This will play a role in regularization , Prevent the model from staying at the steep minimum , Thus, it is more inclined to find a flat local minimum .
The third stage ： The learning rate decays to 0： Will make the model in a ‘ flat ’ In the region, it converges to a relatively ‘ steep ’ The local minimum of .

Insert picture description here

The figure above shows a cycle of strategy training , The loss variation of the model on the training set and the verification set . In this diagram , The learning rate is 0 and 41 Between periods from 0.08 Rise to 0.8, stay 41 and 82 Return to... Between periods 0.08, And then in the last few periods 0.08 One percent of . so , When the learning rate is high , Verification set loss becomes unstable , But on average , The difference between the loss of the verification set and the loss of the training set does not change much , It shows that the knowledge learned by this stage model has good generalization ability （ That is, the large learning rate plays a role of regularization to some extent ）. At the end of training , The learning rate is declining , At this time The training set loss decreased significantly , The loss of verification set did not decrease significantly , The difference between the two widens , therefore , At the end of training , The model began to produce a certain degree of over fitting .

Insert picture description here

In this picture , The learning rate is 0 and 22.5 Between periods from 0.15 Rise to 3, stay 22.5 and 45 Return to... Between periods 0.15, And then in the last few periods 0.15 One percent of . With a very high learning rate , We can learn faster and prevent overfitting . Before we eliminate the learning rate , The difference between verification loss and training loss has been very small . This is it. Leslie Smith The superconvergence phenomenon described . Using this technology , We can do it in 50 individual epoch Train one inside resnet-56, Make it cifar10 The accuracy rate of 92.3%. Enter a 70 individual epoch The cycle of can make us reach 93% The accuracy of .

Cyclical momentum

In resources 3 It also mentions Cyclical momentum The method of periodic momentum .

Along with the shift to a higher learning rate , Leslie Smith It was found in his experiment that , Lower momentum leads to better results . This supports the intuition that , In the training part , We hope SGD Quickly enter a new direction to find a flatter area , So we need to give more weight to the new gradient . In practice , He suggested choosing 0.85 and 0.95 These two values , When we improve our learning rate , Decrease from a higher value to a lower value , Then as the learning rate decreases, it returns to a higher momentum . As shown in the figure below ：

Insert picture description here

according to Leslie That's what I'm saying , The exact optimal momentum value selected during the whole training period can give us the same final result , But using cyclic momentum eliminates the hassle of trying multiple values and running several complete cycles , Thus wasting precious time .

summary ：

One Cycle Policy You can also see the meaning of , The change of learning rate is divided into 3 There are stages but only one cycle , It's called 1 Learning rate adjustment of cycle strategy . It can also be viewed from the side yolov5 The change curve of learning rate of , It is not entirely in accordance with One Cycle Policy Image , More inclined to the common cosine annealing strategy .

4. SGDR

Source see resources 2.

SGDR It is a good old version hot restart SGD. In principle, ,SGDR And CLR The essence is very similar , That is, the learning rate is constantly changing during the training process .

Insert picture description here

among , Active annealing strategy （ Cosine annealing ） Combined with restart plan . Restart is a 「 heat 」 restart , Because the model is not restarted like the new model , But after restarting the learning rate , Use the parameters before restart as the initial solution of the model . This is very simple in implementation , Because you don't have to do anything with the model , Just update the learning rate immediately .

up to now ,Adam And other adaptive optimization methods are still the fastest way to train deep neural networks . However , Many of the best solutions for various benchmarks may be found in Kaggle The winning solution is still SGD, Because they think ,Adam The local minimum obtained will lead to poor generalization .

SGDR Combine the two , rapid 「 heat 」 Restart to a higher learning rate , Then the active annealing strategy is used to help the model and Adam Just as fast （ Even faster ） Study , And keep the normal SGD Generalization ability .

5. AdamW 、SGDW

Source see resources 1,2.（ Reference material 1 This part may be more detailed ）

「 heat 」 The startup strategy is very good , And it seems feasible to change the learning rate during training . But why didn't the last paper extend to AdamR Well ？ The paper 《Fixing Weight Decay Regularization in Adam》 My author once said ：

Although our initial version Adam stay 「 heat 」 Startup performance ratio Adam Better , But compared to hot start SGD It's not competitive .

The author puts forward the following opinions in the paper ：

L2 Regularization and weight attenuation are different .
L2 Regularize in Adam Invalid in .
The weight decays at Adam and SGD It's also effective .
stay SGD in , Reparameterization can make L2 Regularization and weight attenuation are equivalent .
Mainstream libraries regard weight attenuation as SGD and Adam Of L2 Regularization .

They proposed AdamW and SGDW, These two methods can attenuate the weight and L2 The regularization steps are separated .

Adopt new AdamW, The author proves that AdamW（ restart AdamWR） In terms of speed and performance SGDWR Quite a .

This part is a little confused , Do not know much about .

6. Pytorch Cosine annealing learning rate strategy

See resources... For details 4.

Here is an additional episode , stay pytorch In fact, it also implements the cosine annealing strategy , It's basically two functions ：CosineAnnealingLR And CosineAnnealingWarmRestarts

CosineAnnealingLR

This is simpler , Only for the most critical Tmax Parameter for a description , This can be understood as the half period of the cosine function . If max_epoch=50 Time , Then set T_max=5 Will cause the cosine of the learning rate to change periodically 5 Time .

Insert picture description here

max_opoch=50, T_max=5

CosineAnnealingWarmRestarts

There are two main parameters ：T_0 It is the first time that the learning rate returns to the initial value epoch Location ;T_mult It controls the speed at which the learning rate changes .

If $T_{mult}=1$ , Then the learning rate is $T_0$ , $2*T_0$ , $3*T_0$ , $. . .$ , $i*T_0$ ,… Return to maximum at ( Initial learning rate );

Insert picture description here

T_0=5, T_mult=1

If $T_{mult}>1$ , Then the learning rate is $T_0$ , $1+T_{mult})T_0$ , $1+T_{mult}+T_{mult}^2)T_0$ , $. . .$ , $T_{mult}+T_{mult}^2+...+T_{mult}^i)*T0$ , Return to maximum at .