当前位置：网站首页>Paper recommendation: efficientnetv2 - get smaller models and faster training speed through NAS, scaling and fused mbconv

Paper recommendation: efficientnetv2 - get smaller models and faster training speed through NAS, scaling and fused mbconv

2022-06-28 06:06:00 【deephub】

EfficientNetV2 By Google Research,Brain Team Published in the 2021 ICML A paper on , It's used in combination NAS And zoom , Optimize training speed and parameter efficiency . And new operations are used in the model （ Such as Fused-MBConv） Search in the search space .EfficientNetV2 Model ratio EfficientNetV1 Training is much faster , At the same time, the volume is small 6.8 times .

The outline of the thesis is as follows ：

Understand and improve EfficientNetV1 Training efficiency of
NAS And zoom
Progressive Learning
SOTA Compare
Melting research

Understand and improve EfficientNetV1 Training efficiency of

1、 Training with a very large image size is slow

EfficientNet Large image size will lead to a lot of memory usage . because GPU/TPU The total memory on the is fixed , So use a smaller batch size , This will greatly slow down the training speed .

FixRes （ The paper FixRes: Fixing the Train-Test Resolution Discrepancy） It can be used for training rather than reasoning by using a smaller image size . Smaller image sizes result in less computation and support larger batch sizes , This will increase the training speed by up to 2.2 times , And the accuracy will be improved .

2、Depth-wise Convolution performs slowly in the early layer of the model but is effective in the later layer

Fused-MBConv In Google AI In the blog , Fused-MBConv take MBConv Medium depthwise conv3×3 And extend conv1×1 Replace with a single regular conv3×3.

MBConv and Fused-MBConv Structure

Fused-MBConv Gradually EfficientNet-B4 The primordial in MBConv Replace with Fused-MBConv.

At an early stage 1-3 When applied in ,Fused-MBConv It can speed up training , And the parameters and FLOP It's a small expense .

But if all blocks use Fused-MBConv（ Stage 1-7）, Then it will significantly increase the parameters and FLOP, It also slows down your training .

3、 It is not optimal to expand the scale at each stage

EfficientNet Use simple compound scaling rules to extend all stages equally . for example , When the depth coefficient is 2 when , All stages in the network will double the number of layers . But in fact, the contribution of these stages to training speed and parameter efficiency is not the same . stay EfficientNetV2 in , Use a non-uniform scaling strategy to gradually add more layers to the later part of the model .EfficientNets Actively expand the image size , Lead to a lot of memory consumption and slow training . To solve this problem ,EfficientNetV2 Slightly modify the scaling rules , The maximum image size is limited to a smaller value .

NAS And zoom

1、NAS Search for

Neural architecture search (NAS) The search space is similar to PNASNet. adopt NAS Type of convolution operation {MBConv, Fused-MBConv} Design options for , Including the number of layers , Kernel size {3×3, 5×5}, Expansion ratio {1, 4, 6} wait . On the other hand , The search space size is optimized in the following ways ：

Remove unnecessary search options , for example pooling skip operation , Because they were never in the original EfficientNets Use in ;
Reuse and in EfficientNets The channel size that has been searched in .

In the case of reducing the image size , Yes 1000 Multiple models , Go on about 10 Rounds of sampling and training , Through model accuracy A、 Normalized training step size S And parameter size P To search , And use simple weighted product ax (S^w)×(Pv), Identified one of them w=-0.07 and v=-0.05.

EfficientNetV2 And EfficientNetV1 There are several main differences ：

EfficientNetV2 Widely used in early layers MBConv And newly added fused-MBConv.
EfficientNetV2 prefer MBConv Smaller expansion ratio of , Because smaller scaling ratios tend to have less memory access overhead .
EfficientNetV2 Prefer smaller core sizes （ 3×3）, But it adds more layers to compensate for the decrease in receptive fields caused by the smaller kernel size .
EfficientNetV2 The original... Is completely removed EfficientNet Last of stride-1 Stage , This may be due to its large parameter size and memory access overhead .

2、 The zoom

EfficientNetV2-S Use with EfficientNet A similar composite zoom scale enlarges to obtain EfficientNetV2-M/L, Some additional optimizations have been made ：

The maximum inference image size is limited to 480, Because very large images usually lead to expensive memory and training speed overhead ;

As a heuristic , More layers will be added to later stages （ for example , Stage 5 and 6）, So as to increase the network capacity without increasing too much runtime overhead .

Through training NAS And zoom , The proposed EfficientNetV2 The training speed of the model is much faster than that of other models .

Progressive Learning

Improve the training process in learning

EfficientNetV2 Your training settings

ImageNet top-1 Accuracy rate

The model performs best when the image size is small and the enlargement is weak ; But for larger images , It performs better in the case of stronger amplification . Small image size and weak regularization （epoch = 1） Start , Then, with the larger image size and stronger regularization, the learning difficulty gradually increases ： Bigger Dropout rate 、RandAugment Amplitude and mixing ratio （ for example ,epoch = 300）.

The pseudocode of the process is shown below ：

SOTA Compare

1、ImageNet

The model size 、FLOP And reasoning delay —— The delay is in V100 GPU In order to 16 Lot size measurement

Be marked with 21k The model of ImageNet21k On the use of 13M Image pre training , Other models are directly in ImageNet ILSVRC2012 On the use of 128 Million images to train from scratch .

EfficientNetV2 The model is better than before ImageNet Upper ConvNets and Transformer The model is significantly faster , Better accuracy and parameter efficiency are achieved .
EfficientNetV2-M Achieved and EfficientNet-B7 Quite accurate , At the same time, the speed of training with the same computing resources is improved 11 times .
EfficientNetV2 The model is also significantly superior to all recent models in terms of accuracy and reasoning speed RegNet and ResNeSt
The first figure at the top shows the results .

By means of ImageNet21k Pre training on （ 32 individual TPU, Two days ）,EfficientNetV2-L(21k) take top-1 Improved accuracy 1.5%（85.3% Yes 86.8%）, Fewer parameters are used at run time 2.5 times ,FLOP Less 3.6 times Training and reasoning speed increased 6 times — 7 times .

2、 The migration study

In this paper, the following data sets are used for migration learning test ：

Each model is fine tuned in a few steps .EfficientNetV2 The model outperforms the previous one on all these datasets ConvNets and Vision Transformers.

stay CIFAR-100 On ,EfficientNetV2-L Is more accurate than the previous GPipe/EfficientNets high 0.6%, Than before ViT/DeiT Model height 1.5%. These results suggest that ,EfficientNetV2 Its generalization ability is far beyond ImageNet.

Melting research

1、 Performance of the same training

Performance comparison using the same learning settings ,EfficientNetV2 The performance of the model is still much better than EfficientNets：EfficientNetV2-M Reduce the parameters 17%,FLOPs Less 37%, At the same time, the running speed in training is faster than EfficientNet-B7 fast 4.1 times , Reasoning is fast 3.1 times .

2、 Model scaling

By using EfficientNet Compound zoom zoom out EfficientNetV2-S To compare smaller models . All models are in the absence of Progressive Learning Training under the same circumstances . EfficientNetV2 (V2) Models are usually faster , At the same time, considerable parameter efficiency is maintained .

3、 Different networks Progressive Learning

Progressive Learning It usually reduces training time , Improve the accuracy of all different networks at the same time .

4、 Adaptive regularization （Adaptive Regularization） Importance

Adaptive regularization

The reason for random resizing is TPU You need to recompile the dynamic run diagram for each new size , So every 8 individual epoch Sample the image size at random , Not every batch .

Adaptive regularization uses very little regularization for small images in the early training period , Make the model converge faster and obtain better final accuracy .