当前位置：网站首页>Bag of tricks training convolution network skills

Bag of tricks training convolution network skills

2022-07-28 06:23:00 【A tavern on the mountain】

0. introduction

Many recent advances in image classification research can be attributed to the refinement of the training process （training procedure refinements） The author summarizes these refining techniques , It is proved that the model trained with these skills also performs well when migrating to downstream tasks .

1. introduction

from AlexNet To VGG、ResNet Until then NASNet—A, The accuracy keeps improving , This is not just a change in model architecture , Loss function 、 Data enhancement and optimization methods also play an important role in improving the accuracy of the model .

However, these advancements did not solely come from improved model architecture. Training procedure refinements, including changes in loss functions, data preprocessing, and optimization methods also played a major role.

The author summarizes these training skills and based on this ResNet50 Network improvement , stay imagenet On dataset top1 The accuracy is determined by 75.3% Up to the 79.29%.

The second section introduces the basic steps of the training model , Set up a baseline.

The third section introduces the skills of efficiently training the network on the new hardware .FP16+ Big batchsize、 Regularization and BN Use

The fourth section introduces the results of the adjustment of the three main model structure using skills .kernel-size Greater than stride Prevent information loss .

The fifth chapter introduces four training skills .

Cosine learning rate , Label smoothing , Distillation of knowledge 、mixup Data to enhance

The sixth chapter introduces transfer Application of time .

2. Baseline training process

（1） Decode the original pixels as [0,255] Value .

（2） Random cutting 224*224

（3） Flip horizontal

（4） tonal 、 Saturation coefficient scaling

（5） Add a normal distribution PCA noise

（6） Standardization RGB passageway .

3. Efficient training

3.1 Use large batch size

batch size The bigger it is , The slower the convergence rate , That is to say, the training is the same epoch,batch size The bigger it is , The accuracy on the validation set will decrease .

（1）Learning rate warmup And Linear scaling learning rate

Because each batch It was chosen at random , increase batch size Will reduce the impact of noise , Therefore, the learning rate can be increased . At the beginning of training , The random value is far from the final optimal solution , Using a large learning rate will cause numerical instability , So choose a smaller learning rate at the beginning .

（2） Only the convolution layer and the full connection layer use regularization , deviation bias No regularization .

3.2 Use Low-precision training

FP16+ Big batchsize

4. Fine tuning the model

original ResNet The residual structure of the model is determined by 1*1 Convolution kernel stride=2 Take the next sample , But this will lose information , therefore ResNet-B stay pathA Upper use 3*3 Convolution kernel stride=2 Take the next sample , This will not lose information .ResNet-C Use in the initial lower sampling layer 3 individual 3*3 Convolution kernel replaces a large 7*7 Convolution kernel . therefore ResNet-D stay pathB Use average pooling instead of 1*1 Convolution kernel for down sampling , This will not lose information . The final results are shown in the following table .

5. Refined training skills

5.1 Cosine learning rate

Corresponding 3.1 section , The learning rate began to be very small , Then increase to the initial learning rate , This process becomes Learning rate warmup, Then as the training goes on , Gradually reduce the learning rate to reach the optimal solution .

5.2Label Smoothing Label smoothing

Tag in one-hot Further improvement on the basis of coding , bring Label information is not so absolute （ I feel a bit like SVM Penalty stiffness in , So can it be regarded as a kind of regularity ）, There is a certain margin . The label information is changed to ：

At this time, the objective function is ：

5.3 Distillation of knowledge Knowledge Distillatiuon

Use teacher model Help training student model, The true probability distribution is p,teacher model The distribution of is r,student model The distribution of is z. The loss function is ：

The first is the true distribution of learning based on tags , The second is learning teacher model Prior knowledge of .

The probability distribution of real tags provides hard label, High accuracy but low information entropy . and teacher model Provides soft label, The accuracy is low, but the information entropy is high . For example, cars and SUV It's all cars , But labels do not fall into one category ,one-hot Code is （1,0） But in teacher model The middle label may be （0.8,0.2）. such soft label Provides links between categories （ Information entropy is larger ）, label （0.8,0.2） Than （0.7,03） The two categories of are more relevant .soft label And hard label combination , While learning real labels , Add Interclass Association Information .

5.4mixup Data to enhance

Use linear interpolation to generate new samples , Make up for the exact zone of the inter class sample .

In the table , The way of knowledge distillation is Inception-V3 And MobelNet It's because ,teacher model And student model Not from the same race , It has a negative impact on the model .

原网站

版权声明
本文为[A tavern on the mountain]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518575850.html

当前位置：网站首页>Bag of tricks training convolution network skills

Bag of tricks training convolution network skills

边栏推荐

猜你喜欢

随机推荐