当前位置：网站首页>Micronet: image recognition with very low flop

Micronet: image recognition with very low flop

2022-06-09 10:21:00 【AI Hao】

Abstract

Insert picture description here

In this paper , We introduced MicroNet, It is an efficient convolutional neural network , Use very low computing costs （ for example ImageNet Classified 6 individual MFLOP）. This kind of low-cost network is very necessary on edge devices , But it usually suffers from significant performance degradation . We deal with very low... Based on two design principles FLOP：（a） Reduce the network width by reducing the node connectivity , as well as （b） The reduction of network depth is compensated by introducing more complex nonlinearity in each layer . First , We propose the micro factorial convolution to decompose the point by point and depth convolution into low rank matrices , So that the number of channels and input / A good compromise between output connections . secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , The nonlinearity is improved by maximizing multiple dynamic fusion between the input feature map and its cyclic channel shift . Fusion is dynamic , Because their parameters adapt to the input . In micro factor convolution and dynamics Shift-Max On the basis of , A series of MicroNet At low FLOP State-of-the-art technology to achieve significant performance improvements . for example ,MicroNet-M1 In possession of 12 individual MFLOP Of ImageNet Classification is realized 61.1% Of top-1 Accuracy rate , Than MobileNetV3 Higher than 11.3%.

One 、 brief introduction

lately , Design efficient CNN framework [15,12,26,11,42,24,29] It has always been an active research field . These efforts enable high-quality service on edge devices . However , When computing costs become extremely low , Even the most advanced and efficient CNN（ for example MobileNetV3 [11]） Also suffer significant performance degradation . for example , When the resolution is 224 × 224 Image classification [8] Admiral MobileNetV3 from 112M Limit to 12M MAdds when ,top-1 The accuracy is from 71.7% Down to 49.8%. This makes it useful in low power devices （ Internet of things devices, for example ） It is more difficult to adopt . In this paper , We solved a more challenging problem by cutting our budget in half ： Can we 6 MFLOP In the following 224 × 224 The resolution of the execution exceeds 1,000 Image classification of categories ？

This extremely low computational cost （6M FLOPs） Each layer needs to be carefully redesigned . for example , Even in 112×112 grid (stride=2) Contains a single 3×3 Convolution sum 3 Two input channels and 8 The thin stem layer of the output channel also needs 2.7M MAdds. Used to design convolution and 1000 The resources of classifiers of categories are too limited , Unable to learn good expression . To accommodate such a low budget , Apply existing high efficiency CNN（ for example MobileNet [12, 26, 11] and ShuffleNet [42, 24]） The simple strategy is to significantly reduce the width or depth of the network . This leads to severe performance degradation .

We put forward a method called MicroNet New architecture to handle very low FLOP. It is based on two design principles ：

Reduce the network width by reducing the node connectivity .
The reduction of network depth is compensated by improving the nonlinearity of each layer .

These principles guide us to design more efficient convolution and activation functions .

First , We have put forward Micro-Factorized Convolution decomposes point by point and depth convolution into low rank matrices . This is in the input / There is a good balance between the number of output connections and the number of channels . say concretely , We design a group of adaptive convolutions to decompose the pointwise convolution . It adapts the number of groups to the number of channels through the square root relationship . Stacking two groups of adaptive convolution essentially approximates the pointwise convolution matrix through the block matrix , The rank of each block is 1. Decomposition of deep convolution （rank-1） It's simple , take k × k The depth convolution is decomposed into 1 × k and k × 1 Deep convolution . We show that , Without sacrificing the number of channels , The appropriate combination of these two approximations at different levels significantly reduces the computational cost .

secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , To improve nonlinearity from two aspects ：（a） It maximizes multiple fusions between the input characteristic graph and its cyclic channel shift , as well as （b） Each fusion is dynamic , Because its parameters adapt to the input . Besides , It effectively enhances node connectivity and nonlinearity in a function with low computational cost .
Insert picture description here

Experimental results show that ,MicroNet Greatly superior to the most advanced technology （ See the picture 1）. for example , And MobileNetV3 comparison , Our approach is ImageNet Classified top-1 The accuracy has been improved respectively 11.3% and 7.7%, Respectively in 12M and 21M FLOPs Under the constraint of . In the challenging 6 MFLOPs Constrained by , Our method realizes 53.0% Of top-1 Accuracy rate , Double the complexity （12 MFLOPs） Of MobileNetV3 Improved 3.2%. Besides , A series of MicroNet Provides a powerful baseline for two pixel level tasks , The calculation cost is very low ： Semantic segmentation and key point detection .

Two 、 Related work

efficient CNN：MobileNets [12, 26, 11] take k ×k Convolution is decomposed into depth convolution and pointwise convolution . ShuffleNets [42, 24] Group convolution and channel shuffling are used to simplify point by point convolution . [33] Use butterfly transform to approximate point by point convolution . EfficientNet [29, 31] The input resolution and network width are found / The proper relationship between depth . MixNet [30] Multiple kernel sizes are mixed in a convolution . AdderNet [2] Trade a lot of multiplication for cheaper addition . GhostNet [10] Using cheap linear transformation to generate ghost feature graph . Sandglass [43] Inverted residual block structure is reversed to reduce information loss . [39] and [1] Train a network to support multiple subnetworks .

Dynamic neural networks ： Dynamic networks improve representation by adapting parameters to input . HyperNet [9] Use another network to generate parameters for the master network . SENet [13] Reweighting channels by compressing the global context . SKNet [18] Adapt attention to kernels of different sizes . Dynamic convolution [37, 5] Aggregate multiple convolution kernels according to attention . Dynamic ReLU [6] To adapt to ReLU [25, 16] The slope and intercept of two linear functions in . [23] Convolution weights are directly generated using the full connection layer of the packet . [3] The dynamic convolution is extended from space agnostic to space specific . [27] The dynamic group convolution of adaptive packet input channel is proposed . [32] Dynamic convolution is applied to instance segmentation . [19] Learn dynamic routing across scales for semantic segmentation .

3、 ... and 、 Our approach ：MicroNet

Let's introduce in detail MicroNet Design principle and key components of .

3.1、 Design principles

Very low FLOPs Limits network width （ The channel number ） And network depth （ The layer number ）, The two are analyzed separately . If we treat the convolution layer as a graph , Input and output （ node ） Connection between （ edge ） Weighted by kernel parameters . ad locum , We define connectivity as the number of connections per output node . therefore , The number of connections is equal to the product of the number of output channels and connectivity . When calculating costs （ Proportional to the number of connections ） When fixed , The number of channels conflicts with connectivity . We think , A good balance between them can effectively avoid channel reduction and improve the presentation ability of the layer . therefore , Our first design principle is ： Reduce the connectivity of nodes to avoid the reduction of network width . We do this by decomposing the point by point and depth convolutions on a finer scale .

When the depth of the network （ The layer number ） When significantly reduced , Its nonlinearity （ stay ReLU Code in ） Be bound , Leading to significant performance degradation . This inspired our second design principle ： The reduction of network depth is compensated by improving the nonlinearity of each layer . We design a new activation function Dynamic Shift-Max To achieve this .

3.2、Micro-Factorized Convolution

We decompose the point by point and depth convolution at a finer scale ,Micro-Factorized Convolution gets its name from this . The goal is in the number of channels and inputs / Balance between output connections .

Micro-Factorized Pointwise Convolution： We propose a group of adaptive convolutions to decompose the pointwise convolution . For the sake of brevity , Let's assume that the convolution kernel W With the same number of input and output channels （ $C_{in} = C_{out} = C$ ） And ignore the deviation . Kernel matrix W Is decomposed into two groups of adaptive convolutions , The number of groups G Depending on the number of channels C. In Mathematics , It can be expressed as ：
$\boldsymbol{W}=\boldsymbol{P} \boldsymbol{\Phi} \boldsymbol{Q}^{T} \tag{1}$
among $W$ It's a $C \times C$ matrix . Q The shape of is $C×\frac{C}{R}$ , In proportion R Number of compression channels . P The shape of is $C×\frac{C}{R}$ , Expand the number of channels back to C As the output . P and Q Yes, there is G Diagonal block matrix of blocks , Each block corresponds to the convolution of a group .$\boldsymbol{\Phi} $ It's a $\frac{C}{R} × \frac{C}{R}$ Permutation matrix , Be similar to [42] Mixed washing channel in . The calculation complexity is $\mathcal{O}=\frac{2 C^{2}}{R G}$ . chart 2- The left shows C = 18、R = 2 and G = 3 An example of .
Insert picture description here

Please note that , Group number G Not fixed , It adapts to the number of channels C And reduction ratio R by ：
$G=\sqrt{C / R} \tag{2}$
This square root relationship is derived from the number of channels C And the input / Balance between output connectivity . ad locum , We will connect E Defined as the number of I / O connections per output channel . Each output channel is connected to the... Between two groups of adaptive convolutions $\frac{C}{RG}$ Hidden channels , Each hidden channel is connected to CG Input channel . therefore $\frac{ C^{2}}{R G^{2}}$ . When we fix the computational complexity $\mathcal{O}=\frac{2 C^{2}}{R G}$ And reduction ratio R when , The channel number $C$ And connectivity $E $ stay $G$ Change in the opposite direction ：
$C=\sqrt{\frac{\mathcal{O R G}}{2}}, \quad E=\frac{\mathcal{O}}{2 G}$
This is shown in the figure 3 Shown . With the number of groups G An increase in ,C Increase but E Reduce . When G = $KaTeX parse error: \tag works only in display equations$ when , Two curves intersect $(C = E)$ , At this time, each output channel is connected to all input channels once . In Mathematics , The resulting convolution matrix W Is divided into $G \times G$ block , Is the rank of each block 1（ See the picture 2- Left ）.
Insert picture description here

Micro-Factorized Deep convolution ： Pictured 2-Middle Shown , We will $k \times k$ The deep convolution kernel is decomposed into a $k \times 1$ Nuclear and a $1 \times k$ nucleus . This is related to Micro-Factorized Point by point convolution （ equation 1） Have the same mathematical format . Every channel W The shape of the kernel matrix of is $k \times k$ , It is broken down into one $k \times 1$ vector P And a $1 \times k$ vector $Q^{T}$ . here $ \boldsymbol{\Phi}$ Is a value of 1 scalar . This low rank approximation will reduce the computational complexity from $\mathcal{O}(k^{2}C)$ Down to $\mathcal{O}(kC)$ .

Combine Micro-Factorized Point by point and depth convolution ： We combine the point by point and depth convolution of micro factors in two different ways ：（a） General combination and （b） Streamline composition . The former just connects two convolutions . lite Use a combination of Micro-Factorized depthwise Convolution extends the number of channels by applying multiple spatial filters to each channel . then , It applies a group adaptive convolution to fuse and compress the number of channels （ Pictured 2- On the right ）. Compared with conventional counterparts , Thin combinations are more effective at lower levels , Because it saves channel fusion （ Point by point ） The calculation of , To compensate for learning more spatial filters （ depth ）.

3.3、 dynamic Shift-Max

Now let's move on Shift-Max, This is a new activation function to enhance nonlinearity . It dynamically fuses the input feature graph with its cyclic group shift , One of the channels is shifted . Dynamic Shift-Max It also strengthens the connection between groups . This is a supplement to the pointwise convolution of the micro factors connected in the group of interest .

Definition ： Make ${x_{i}} (i = 1, \ldots,C)$ Represents an input vector （ Or tensor ）, Its C The two channels are divided into G A set of . Each group has $\frac{C}{G}$ passageway . its N The channel cyclic shift can be expressed as $x_{N}(i)=x_{(i+N) mod C}$ . We define the group loop function as ：
$x_{\frac{C}{G}}(i, j)=x_{\left(i+j \frac{C}{G}\right) \bmod C}, j=0, \ldots, G-1 \tag{4}$
among $x_{\frac{C}{G}}(i,j)$ It corresponds to the following i Channels xi Move j A set of . Dynamic Shift-Max Multiple combinations (J) Group shift , As shown below ：
$y_{i}=\max _{1 \leq k \leq K}\left\{\sum_{j=0}^{J-1} a_{i, j}^{k}(\boldsymbol{x}) x_{\frac{C}{G}}(i, j)\right\} \tag{5}$
The parameter $a_{i,j}^{k}(x)$ Adapt input through hyperfunctions x, This can be easily achieved by using two full connection layers after average pooling , Be similar to Squeeze-and-Excitation [15].

nonlinear ：Dynamic Shift-Max Two nonlinear ：(a) It outputs J Of the group K The maximum values of different blends , as well as (b) Parameters $a_{i,j}^{k}(x)$ It's not static , It's input. x Function of . These provide dynamic ShiftMax More expressive ability , To compensate for the reduction in the number of layers . Recent developments ReLU [6] It's dynamic Shift-Max (J = 1) A special case of , Each of these channels is individually activated .
Insert picture description here

Connectivity ：Dynamic Shift-Max Improved connectivity between channel groups . It's right MicroFactorized pointwise convolution A supplement to , The latter focuses on connectivity within each group . chart 4 indicate , Even static group shifts $\left(y_{i}=a_{i, 0} x_{\frac{C}{G}}(i, 0)+a_{i, 1} x_{\frac{C}{G}}(i, 1)\right)$ It can also effectively increase the rank of the pointwise convolution of the micro factors . By inserting it between two groups of adaptive convolutions , The resulting convolution matrix W（G×G Block matrix ） The rank of each block in is from 1 Add to 2. Note that static group shifting is a simple special case of dynamic shifting - K = 1、J = 2 And static $a_{i,j}^{k}$ The maximum value of .

Computational complexity ：Dynamic Shift-Max From input x Generate $C J K$ Parameters $a_{i,j}^{k}(x)$ . The computational complexity consists of three parts ：(a) The average pooling $\mathcal{O}(HWC)$ ,(b) stay Eq.5 $\mathcal{O}(C^{2}JK)$ Generate parameters in $a_{i,j}^{k}(x)$ , as well as Each channel and each channel applies dynamic Shift Max Space location $\mathcal{O}(HWCJK)$ . When $ J$ and $K$ It's Lightweight . Based on experience , stay J = 2 and K = 2 A good compromise was achieved .

3.4、 Relationship with previous work

MicroNet With two popular efficient networks （MobileNet [12, 26, 11] and ShuffleNet [42, 24]） of . It is associated with MobileNet Shared reverse bottleneck structure , And with ShuffleNet Sharing uses group convolution . by comparison ,MicroNet It is different from them in terms of convolution and activation function . First , It decomposes point by point convolution into group adaptive convolution , The number of groups is suitable for the number of channels $G=\sqrt{C / R}$ . secondly , It decomposes the depth convolution . Last , A new activation method is proposed （ Dynamic Shift-Max） To improve channel connectivity and nonlinearity .

Four 、MicroNet framework

Now let's describe four kinds of MicroNet The architecture of the model , They have from 6M To 44M Different FLOP. They consist of three types of Micro-Blocks form （ See the picture 5）, They combine in different ways Micro-Factorized pointwise and depthwise Convolution . They all use dynamic ShiftMax As an activation function . Details are as follows ：
Insert picture description here

Micro-Block-A： Pictured 5a Shown ,Micro-Block-A Use Micro-Factorized pointwise and depthwise Compact combination of convolutions （ See the picture 2- Right ）. It is at a lower level with a higher resolution （ for example 112×112 or 56×56） On the effective . Please note that , The number of channels passes Micro-Factorized depthwise Convolution expansion , And compression by using group adaptive convolution .

Micro-Block-B：Micro-Block-B Used to connect to Micro-Block-A and Micro-Block-C. And Micro-Block-A The difference is , It uses the complete micro factor point by point convolution , This includes two groups of adaptive convolutions （ Pictured 5b Shown ）. The former compresses the number of channels , The latter increases the number of channels . Every MicroNet only one Micro-Block-B（ See table 1）.
Insert picture description here

Micro-Block-C：Micro-Block-C（ Pictured 5c Shown ） Use connections Micro-Factorized depthwise and pointwise Conventional combinations of convolutions . It is used at higher levels （ See table 1）, Because it's converging in the channel （ Point by point ） It takes more computation than lite Combine more . Use skip connection when size matching .

Each micro block has four super parameters ： Kernel size k、 Number of output channels C、 Point by point convolution of micro factors R The reduction rate in the bottleneck of , And two group number pairs of group adaptive convolution (G1; G2). Please note that , We will Eq.2 To $G 1 G 2 = C / R$ And find the approximate integer solution .

Stem Layer： We redesigned Stem Layers to meet low FLOP constraint . It includes a 3×1 Convolution and a 1×3 Group convolution , And then a ReLU. The second convolution expands the number of channels R times . This big saving saves the cost of calculation . for example ,MicroNet-M3 The backbone layer in （ See table 1） It only needs 1.5M MAdds.

Four kinds of MicroNet Model (M0–M3)： We have designed four types with different computational costs （6M、12M、21M、44M MAdds） Model of （M0、M1、M2、M3）. surface 1 Shows their full specifications . These networks follow the same pattern from low to high ： Trunk layer ！ Microblock -A！ Microblock B！ Microblock -C. Please note that , All models are designed manually , No network architecture search （NAS）.

5、 ... and 、 experiment ：ImageNet classification

Let's evaluate four MicroNet Model (M0-M3) as well as ImageNet [8] Comprehensive ablation of classification . ImageNet Yes 1000 Classes , Include 1,281,167 Images and for training 50,000 An image for verification .

5.1、 Implementation details

Training strategy ： Each model is trained in two ways ：（a） Independent , and （b） Learn from each other . The former is simple , Learn by yourself . The latter along each MicroNet Learn from a full level partner together , The full level partners share the same network width / Height , But with the original point by point and depth (k×k) Convolution substitution Micro-Factorized Point by point and depth convolution . KL Divergence is used to encourage MicroNet Learn from their corresponding full level partners .

Training settings ： All models use 0.9 Momentum SGD The optimizer trains . The image resolution is 224×224. We use 512 Of mini-batch Size and 0.02 Learning rate of . Each model goes through 600 individual epoch Cosine learning rate decay training . For the smaller MicroNet（M0 and M1）, The weight decays to 3e-5,dropout by 0.05. For larger models （M2 and M3）, The weight decays to 4e-5,dropout Rate is 0.1. Label smoothing (0.1) and Mixup [41] (0.2) be used for MicroNet-M3 To avoid over fitting .

5.2、 The main result

Insert picture description here

surface 2 Four different calculation costs are compared MicroNets With the most advanced ImgageNet classification . MicroNets In all four FLOP Performance under constraints is significantly better than all previous work . for example , Without learning from each other ,MicroNets stay 6M、12M、21M and 44M FLOP The performance ratio on is MobileNetV3 high 9.6%、9.6%、6.1% and 4.4%. When training through mutual learning , All four MicroNets Always get about 1.5% Of top-1 Accuracy rate . Our approach is 6M FLOPs It has been realized. 53.0% Of top-1 Accuracy rate , Than MobileNetV3 Double the complexity of （12M FLOPs） Higher than 3.2%. Compare with the recent MobileNet and ShuffleNet Compared with , Such as GhostNet [10]、WeightNet [23] and ButterflyTransform
[33], Our approach is similar FLOPs More than 5% Of top-1 Accuracy rate . This shows that MicroNet Can effectively deal with very low FLOP.

5.3、 Melting research

We ran a lot of ablation to analyze MicroNet. MicroNet-M1 (12M FLOPs) For all ablations , Every model training 300 individual epoch. dynamic Shift-Max The default superparameter for is set to J = 2,K = 2.
Insert picture description here

from MobileNet To MicroNet： surface 3 Show from MobileNet To our MicroNet The path of . Both share the reverse bottleneck structure . ad locum , We changed it MobileNetV2[26]（ No, SE [13]）, Make it have similar complexity （10.5M MAdds） And three Micro-Factorized Convolution variants （ The first 2-4 That's ok ）. Micro-Factorized pointwise and depthwise Convolution and its low level lite The combination will top-1 The accuracy is from 44.9% Gradually increase to 51.7%. Besides , Use static and dynamic Shift-Max Each of them gained an extra 2.7% and 6.8% Of top-1 Accuracy , And add a small amount of extra cost . This shows that the proposed Micro-Factorized Convolution and dynamics Shift-Max It is effective and complementary in dealing with extremely low computing costs .
Insert picture description here

Group number G：Micro-Factorized pointwise convolution Includes two groups of adaptive convolutions , The number of groups is relaxed $KaTeX parse error: \tag works only in display equations$ To select a near integer . surface 4a It has similar structure and FLOP（ about 10.5M MAdds） However, a fixed number of groups are used for comparison . Group adaptive convolution achieves higher accuracy , The number of channels and input are proved / Good balance between output connectivity .

surface 4b Different options for the number of adaptive groups are compared , These options are determined by the multiplier λ control , bring $KaTeX parse error: \tag works only in display equations$ . The larger λ Values correspond to more channels but fewer inputs / Output connection （ See chart 3）. When λ Be situated between 0.5 and 1 Between time , Achieved a good balance . When λ increase （ More channels but less connectivity ） Or decrease （ Less channels but more connectivity ） when ,top-1 The accuracy is reduced . therefore , We use... For the rest of the paper λ = 1. Please note that , surface 4b All models in have similar computational costs （ about 10.5M MAdds）.

Different levels of Lite Combine ： surface 4c Compares the use of... At different levels Micro-Factorized pointwise and depthwise Convolution Lite Combine （ See chart 2- Right ）. Use it only at low levels to get the highest accuracy . This verifies that thin composition is more effective at lower levels . Compared with conventional combination , It saves channel convergence （ Point by point ） The calculation of , To compensate for learning more spatial filters （ depth ）.
Insert picture description here

Comparison with other activation functions ： We will be dynamic Shift-Max With three existing activation functions ReLU [25]、SE+ReLU [13] And dynamic ReLU [6] Compare . Results such as table 5 Shown . Our news Shift-Max Obviously better than the other three （2.5%）, Proved its superiority . Please note that , dynamic ReLU yes J = 1 The dynamics of the Shift-Max A special case of （ See formula 5）.
Insert picture description here

Dynamics of different layers Shift-Max： surface 6 Shows the use of dynamic... In three different layers in the microblock Shift-Max Of top-1 precision （ See chart 5）. Using it in more layers leads to continuous improvement . The best accuracy can be achieved when it is used in all three layers . If only one layer is allowed to use dynamic Shift-Max, It is suggested that depthwise convolution Then use .
Insert picture description here

dynamic Shift-Max Different superparameters in ： surface 7 Shows the use of different K and J The result of the combination （ In the equation 5 in ）. We are K = 1 add ReLU, because max There is only one element left in the operator . Baseline of the first line （J = 1,K = 1） Equivalent to SE+ReLU [13]. When fixed J = 2（ Merge the two groups ） when , The winner of the two fusions （K = 2） Better than a single fusion （K = 1）. Adding a third blend doesn't help , Because it is mainly covered by the other two fusion , But more parameters are involved . When fixed K = 2（ Two fusion at most ） when , More groups involved J Always better , But more FLOP. stay J = 2 and K = 2 A good compromise was achieved , among 4.1% Gain with extra 150 ten thousand MAdds Realization .

6、 ... and 、 For pixel level classification MicroNet

MicroNet It is not only effective for image level classification , It is also very effective for pixel level tasks . In this section , We will show its application in human posture estimation and semantic segmentation .

6.1、 Body posture estimation

We use COCO 2017 Data sets [21] Evaluate on single person key point detection MicroNet. Our model is in train2017 Training , Include 57K Images and 150K Human instance , Marked with 17 A key point . We are including 5000 Picture of val2017 To evaluate our methods on the Internet , And use more than 10 Similarity of key points of objects (OKS) Average accuracy of threshold (AP) As an indicator .

Implementation details ： Similar to image classification , We have four MicroNet Model （M0-M3） For different FLOP Key point detection . By adding a set of selected blocks （ for example , All steps are 32 The block ） The resolution of the （×2） To adapt the model to the key detection task . For different MicroNet Model , The choice will be different （ For more details , Please see Appendix 8.1）. Each model has a head containing three micro blocks （ One step is 8, The two steps are 4） And a pointwise convolution to generate 17 Heat map of key points . We use bilinear upsampling to improve the head resolution , And use spatial attention at each level [6].

Training settings ： Use [28] Training settings in . The human body detection frame is cut and adjusted to 256 × 192. Data enhancement includes random rotation （[−45°; 45°]）、 Random scaling （[0.65, 1.35]）、 Flipping and half volume data enhancement . All models use Adam Optimizer [17] Train from the beginning 250 individual epoch. The initial learning rate is set to 1e-3, In the 210 and 240 epoch Down to 1e-4 and 1e-5.

test ： A two-stage top-down paradigm [36, 28] Used for testing ： Detection personnel instance , And then predict the key points . We use [36] Same personnel detector provided . The original image is combined with the heat map of the flipped image , Key points are predicted by adjusting the position of the highest calorific value to a quarter offset of the second highest response .
Insert picture description here

The main result ： surface 8 take MicroNets With previous work [6, 5] In the aspect of effective attitude estimation , The calculated cost is less than 850 MFLOPs. Both of these works are used in the trunk and head MobileNet Inverse residual bottleneck block of , And by convoluting [5] And activation functions [6] The parameters in adapt to the input to show significant improvement .
our MicroNet-M3 In these jobs, only 33% Of FLOP, But it achieves similar performance , It is proved that our method is also effective for key point detection . Besides ,MicroNet-M2、M1、M0 It provides a good baseline for key point detection , Lower computational complexity , from 77M To 163M FLOPs.

6.2、 Semantic segmentation

We are Cityscape Data sets [7] Experiments with fine annotations were carried out on , To evaluate MicroNet Semantic segmentation . Our model is trained on the training fine set , Include 2,975 Zhang image . We are including 500 Picture of val On our evaluation set , And use mIOU As a measure .

Implementation details ： We have revised four MicroNet Model （M0-M3） As the backbone , By setting the step size to 32 The resolution of all blocks of is increased to a step of 16, Be similar to MobileNetV3 [11]. Our model is in 1024×2048 It has very low computational cost in image resolution , from 2.5B To 0.8B FLOPs. We follow in the split header Atrous Spatial Pyramid Pooling (LR-ASPP) [11] Of Lite Reduced Design , Draw the feature map 2 Times bilinear upsampling , Apply spatial attention , And 8 The stride of is merged with the feature map from the trunk . We use Micro-Factorized Convolution instead of 1×1 Convolute to make LR-ASPP Lighter , be called Micro-Reduce ASPP (MR-ASPP).

Training settings ： All models are initialized and trained randomly 240 individual epoch. The initial learning rate is set to 0.2, And decay to... By cosine function 1e-4. Weight falloff set to 4e-5. Used [4] Data enhancement in .
Insert picture description here

The main result ： surface 9 All four are reported MicroNet Of mIOU. And MobileNetV3（68.4 mIOU and 2.90B MAdds） comparison , our MicroNet-M3 More accurate （69.1 mIOU）, Lower computational costs （2.52B MAdds）. This proves the superiority of our method in semantic segmentation . Besides , our MicroNet-M2、M1、M0 It provides a good baseline for semantic segmentation , Even lower FLOPs from 1.75B To 0.81B MAdds.

7、 ... and 、 Conclusion

In this paper , We proposed MicroNet To handle extremely low computing costs .
It builds on two proposed operators ：Micro-Factorized Convolution and dynamic shift maxima . The former is approximated by point convolution and depth convolution in terms of channel number and input / Balance between output connections . The latter dynamically fuses continuous channel groups , Enhance node connectivity and nonlinearity to compensate for depth reduction . A series of MicroNet At very low FLOP Three tasks are realized under （ Image classification 、 Human pose estimation and semantic segmentation ） Reliable improvement of . We hope this work will be efficient CNN Provide a good baseline for multi vision tasks .

原网站

版权声明
本文为[AI Hao]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/160/202206090936553541.html