当前位置：网站首页>[2206] An Improved One millisecond Mobile Backbone

[2206] An Improved One millisecond Mobile Backbone

2022-06-22 11:16:00 【koukouvagia】

Abstract

propose MobileOne, achieving SOTA performance on classification, detection, and segmentation
analyze performance bottlenecks in activations and branching that incur high latency on mobile
introduce re-parameterization, weight decay annealing, and progressive learning curriculum in training

Method

metric correlations

Left: FLOPs vs Latency on iPhone12. Right: Parameter Count vs Latency on iPhone 12. We indicate some networks using numbers as shown in the table above.

models with higher FLOPs or parameter counts can have lower latency
CNN-models have lower latency for similar FLOPs and parameter counts than ViT-models

Spearman rank correlation coefficient between latency-flops.

latency is moderately correlated with FLOPs and weakly correlated with parameter counts on mobile devices, even lower on CPUs

key bottlenecks

construct a 30-layer CNN on iPhone12 with different activation functions or architecture blocks common in efficient networks

activation functions

Comparison of latency on mobile device of different activation functions in a 30-layer convolutional neural network.

recently introduced activation functions have low FLOPs but high synchronization cost
$\implies$ ReLU is used in MobileOne

architecture blocks

Ablation on latency of different architectural blocks in a 30-layer convolutional neural network.

2 key factors that affect runtime performance are memory access cost and degree of parallelism
global pooling in SE blocks $\implies$ larger synchronization cost
skip connection bring in multi-branches $\implies$ larger memory access cost
$\implies$ no branches and limited SE blocks are used in MobileOne

MobileOne block

MobileOne block has two different structures at train time and test time. Left: Train time MobileOne block with re-parameterizable branches. Right: MobileOne block at inference where the branches are re-parameterized. Either ReLU or SE-ReLU is used as activation. The trivial over-parameterization factor k is a hyper-parameter which is tuned for every variant.

base on MobileNet-V1 block of 3x3 depth-wise conv followed by 1x1 point-wise conv
introduce re-parameterizable skip connection with BN along with branches

model scaling

scaling model dimensions like width, depth, and resolution can improve performance
do not explore scaling up of input resolution as both FLOPs and memory cost increase, detrimental to runtime on mobile devices

MobileOne Network Specifications.

Comparison of Top-1 Accuracy on ImageNet against recent train time over-parameterization works. Number of parameters listed above is at inference.

MobileOne have single-branched structure at inference $\implies$ aggressively scale model parameters compared to competing multi-branched models

Experiment

measurement of latency

Mobile: iOS App on iPhone12 with Core ML
CPU: 2.3 GHz Intel Xeon Gold 5118 processor

image classification

datasets ImageNet
optimizer SGD-momentum: batch size 256, 300 epochs, weight decay 1e-4, cosine annealing to 1e-5
lr_schedule init 0.1, cosine annealing
label smoothing 0.1

Performance of various models on ImageNet-1k validation set. Note: All results are without distillation for a fair comparison. Results are grouped based on latency on mobile device.

key findings

a lower latency on mobile, where the smallest transformer has a latency of 4ms
- MobileFormer attains top-1 accuracy of 79.3% with a latency of 70.76ms, while MobileOne-S4 attains 79.4% with a latency of only 1.86ms (38x faster)
- MobileOne-S3 has 1% better top-1 accuracy than EfficientNet-B0 and is faster by 11% on mobile
a lower latency on CPU
- MobileOne-S4 has 2.3% better top-1 accuracy than EfficientNet-B0 while being faster by 7.3% on CPU

object detection and semantic segmentation

datasets MS-COCO
optimizer SGD-momentum: batch size 192, 200 epochs, momentum 0.9, weight decay 1e-4
lr_schedule init 0.05, linear warm-up 4500 iters, cosine annealing

datasets Pascal-VOC, ADE20k
optimizer AdamW: 50 epochs, weight decay 0.01
lr_schedule init 1e-4, warm-up 500 iters, cosine annealing to 1e-6

(a) Quantitative performance of object detection on MS-COCO. (b) Quantitative performance of semantic segmentation on Pascal-VOC and ADE20k datasets. $\dag$ This model was trained without Squeeze-Excite layers.

key findings

MobileOne-S4 outperforms MNASNet by 27.8% and best version of MobileViT by 6.1%
MobileOne-S4 outperforms Mobile ViT by 1.3% and MobileNetV2 by 5.8%;
MobileOne-S4 outperforms MobilenetV2 by 12%; With the smaller MobileOne-S1 backbone, we still outperform it by 2.9%

ablation studies

training settings

opposed to large models, small models need less regularization to combat overfitting
adopt weight decay regularization in early stages of training, instead of completely removing it