当前位置:网站首页>Bag of tricks training convolution network skills
Bag of tricks training convolution network skills
2022-07-28 06:23:00 【A tavern on the mountain】
0. introduction
Many recent advances in image classification research can be attributed to the refinement of the training process (training procedure refinements) The author summarizes these refining techniques , It is proved that the model trained with these skills also performs well when migrating to downstream tasks .
1. introduction
from AlexNet To VGG、ResNet Until then NASNet—A, The accuracy keeps improving , This is not just a change in model architecture , Loss function 、 Data enhancement and optimization methods also play an important role in improving the accuracy of the model .
However, these advancements did not solely come from improved model architecture. Training procedure refinements, including changes in loss functions, data preprocessing, and optimization methods also played a major role.
The author summarizes these training skills and based on this ResNet50 Network improvement , stay imagenet On dataset top1 The accuracy is determined by 75.3% Up to the 79.29%.
The second section introduces the basic steps of the training model , Set up a baseline.
The third section introduces the skills of efficiently training the network on the new hardware .FP16+ Big batchsize、 Regularization and BN Use
The fourth section introduces the results of the adjustment of the three main model structure using skills .kernel-size Greater than stride Prevent information loss .
The fifth chapter introduces four training skills .
Cosine learning rate , Label smoothing , Distillation of knowledge 、mixup Data to enhance
The sixth chapter introduces transfer Application of time .
2. Baseline training process
(1) Decode the original pixels as [0,255] Value .
(2) Random cutting 224*224
(3) Flip horizontal
(4) tonal 、 Saturation coefficient scaling
(5) Add a normal distribution PCA noise
(6) Standardization RGB passageway .

3. Efficient training
3.1 Use large batch size
batch size The bigger it is , The slower the convergence rate , That is to say, the training is the same epoch,batch size The bigger it is , The accuracy on the validation set will decrease .
(1)Learning rate warmup And Linear scaling learning rate
Because each batch It was chosen at random , increase batch size Will reduce the impact of noise , Therefore, the learning rate can be increased . At the beginning of training , The random value is far from the final optimal solution , Using a large learning rate will cause numerical instability , So choose a smaller learning rate at the beginning .
(2) Only the convolution layer and the full connection layer use regularization , deviation bias No regularization .
3.2 Use Low-precision training
FP16+ Big batchsize

4. Fine tuning the model


original ResNet The residual structure of the model is determined by 1*1 Convolution kernel stride=2 Take the next sample , But this will lose information , therefore ResNet-B stay pathA Upper use 3*3 Convolution kernel stride=2 Take the next sample , This will not lose information .ResNet-C Use in the initial lower sampling layer 3 individual 3*3 Convolution kernel replaces a large 7*7 Convolution kernel . therefore ResNet-D stay pathB Use average pooling instead of 1*1 Convolution kernel for down sampling , This will not lose information . The final results are shown in the following table .

5. Refined training skills
5.1 Cosine learning rate

Corresponding 3.1 section , The learning rate began to be very small , Then increase to the initial learning rate , This process becomes Learning rate warmup, Then as the training goes on , Gradually reduce the learning rate to reach the optimal solution .

5.2Label Smoothing Label smoothing
Tag in one-hot Further improvement on the basis of coding , bring Label information is not so absolute ( I feel a bit like SVM Penalty stiffness in , So can it be regarded as a kind of regularity ), There is a certain margin . The label information is changed to :

At this time, the objective function is :

5.3 Distillation of knowledge Knowledge Distillatiuon
Use teacher model Help training student model, The true probability distribution is p,teacher model The distribution of is r,student model The distribution of is z. The loss function is :
![]()
The first is the true distribution of learning based on tags , The second is learning teacher model Prior knowledge of .
The probability distribution of real tags provides hard label, High accuracy but low information entropy . and teacher model Provides soft label, The accuracy is low, but the information entropy is high . For example, cars and SUV It's all cars , But labels do not fall into one category ,one-hot Code is (1,0) But in teacher model The middle label may be (0.8,0.2). such soft label Provides links between categories ( Information entropy is larger ), label (0.8,0.2) Than (0.7,03) The two categories of are more relevant .soft label And hard label combination , While learning real labels , Add Interclass Association Information .
5.4mixup Data to enhance
Use linear interpolation to generate new samples , Make up for the exact zone of the inter class sample .

In the table , The way of knowledge distillation is Inception-V3 And MobelNet It's because ,teacher model And student model Not from the same race , It has a negative impact on the model .
边栏推荐
- CLIP Learning Transferable Visual Models From Natural Language Supervision
- 端接电阻详解 信号完整系列 硬件学习笔记7
- (PHP graduation project) based on thinkphp5 community property management system
- set_ clock_ groups
- Shuffle Net_ v1-shuffle_ v2
- 详解爬电距离和电气间隙
- 在win7 上安装 Visual Studio 2019 步骤 及 vs2019离线安装包
- Perl入门学习(十)格式化输出
- Detailed explanation of creepage distance and electrical clearance
- 天线效应解决办法
猜你喜欢

Agilent安捷伦 E5071测试阻抗、衰减均正常,惟独串扰NG?---修复方案

EXFO 730c optical time domain reflectometer only has IOLm optical eye to upgrade OTDR (open OTDR permission)

TVs tube parameters and selection

Reversible watermarking method based on difference expansion

MAE 掩码自编码是可扩展的学习

Arduino reads the analog voltage_ How mq2 gas / smoke sensor works and its interface with Arduino

IMS-FACNN(Improved Multi-Scale Convolution Neural Network integrated with a Feature Attention Mecha

Surge impact immunity experiment (surge) -emc series Hardware Design Notes 6

TVS管参数与选型

1、 Speech synthesis and autoregressive model
随机推荐
弹出消息对话框的方法
Communication between DSP and FPGA
Redhawk Dynamic Analysis
Beta分布(概率的概率)
set_clock_groups
For a large amount of data, matlab generates Excel files and typesetting processing source code
LED发光二极管选型-硬件学习笔记3
set_ clock_ groups
Agilent Agilent e5071 test impedance and attenuation are normal, except crosstalk ng--- Repair plan
N positions of bouncing shell
论福禄克DTX-1800如何测试CAT7网线?
天线效应解决办法
3、 Openvino practice: image classification
Shuffle Net_ v1-shuffle_ v2
VS Code 基础配置与美化
How to pop up the message dialog box
How does fluke dtx-1800 test cat7 network cable?
8类网线测试仪AEM testpro CV100 和FLUKE DSX-8000哪些事?
ICC2(三)Clock Tree Synthesis
Ctfshow single dog -- Web