当前位置:网站首页>Micronet: image recognition with very low flop

Micronet: image recognition with very low flop

2022-06-09 10:21:00 AI Hao

Abstract

 Insert picture description here

In this paper , We introduced MicroNet, It is an efficient convolutional neural network , Use very low computing costs ( for example ImageNet Classified 6 individual MFLOP). This kind of low-cost network is very necessary on edge devices , But it usually suffers from significant performance degradation . We deal with very low... Based on two design principles FLOP:(a) Reduce the network width by reducing the node connectivity , as well as (b) The reduction of network depth is compensated by introducing more complex nonlinearity in each layer . First , We propose the micro factorial convolution to decompose the point by point and depth convolution into low rank matrices , So that the number of channels and input / A good compromise between output connections . secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , The nonlinearity is improved by maximizing multiple dynamic fusion between the input feature map and its cyclic channel shift . Fusion is dynamic , Because their parameters adapt to the input . In micro factor convolution and dynamics Shift-Max On the basis of , A series of MicroNet At low FLOP State-of-the-art technology to achieve significant performance improvements . for example ,MicroNet-M1 In possession of 12 individual MFLOP Of ImageNet Classification is realized 61.1% Of top-1 Accuracy rate , Than MobileNetV3 Higher than 11.3%.

One 、 brief introduction

lately , Design efficient CNN framework [15,12,26,11,42,24,29] It has always been an active research field . These efforts enable high-quality service on edge devices . However , When computing costs become extremely low , Even the most advanced and efficient CNN( for example MobileNetV3 [11]) Also suffer significant performance degradation . for example , When the resolution is 224 × 224 Image classification [8] Admiral MobileNetV3 from 112M Limit to 12M MAdds when ,top-1 The accuracy is from 71.7% Down to 49.8%. This makes it useful in low power devices ( Internet of things devices, for example ) It is more difficult to adopt . In this paper , We solved a more challenging problem by cutting our budget in half : Can we 6 MFLOP In the following 224 × 224 The resolution of the execution exceeds 1,000 Image classification of categories ?

This extremely low computational cost (6M FLOPs) Each layer needs to be carefully redesigned . for example , Even in 112×112 grid (stride=2) Contains a single 3×3 Convolution sum 3 Two input channels and 8 The thin stem layer of the output channel also needs 2.7M MAdds. Used to design convolution and 1000 The resources of classifiers of categories are too limited , Unable to learn good expression . To accommodate such a low budget , Apply existing high efficiency CNN( for example MobileNet [12, 26, 11] and ShuffleNet [42, 24]) The simple strategy is to significantly reduce the width or depth of the network . This leads to severe performance degradation .

We put forward a method called MicroNet New architecture to handle very low FLOP. It is based on two design principles :

  • Reduce the network width by reducing the node connectivity .
  • The reduction of network depth is compensated by improving the nonlinearity of each layer .

These principles guide us to design more efficient convolution and activation functions .

First , We have put forward Micro-Factorized Convolution decomposes point by point and depth convolution into low rank matrices . This is in the input / There is a good balance between the number of output connections and the number of channels . say concretely , We design a group of adaptive convolutions to decompose the pointwise convolution . It adapts the number of groups to the number of channels through the square root relationship . Stacking two groups of adaptive convolution essentially approximates the pointwise convolution matrix through the block matrix , The rank of each block is 1. Decomposition of deep convolution (rank-1) It's simple , take k × k The depth convolution is decomposed into 1 × k and k × 1 Deep convolution . We show that , Without sacrificing the number of channels , The appropriate combination of these two approximations at different levels significantly reduces the computational cost .

secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , To improve nonlinearity from two aspects :(a) It maximizes multiple fusions between the input characteristic graph and its cyclic channel shift , as well as (b) Each fusion is dynamic , Because its parameters adapt to the input . Besides , It effectively enhances node connectivity and nonlinearity in a function with low computational cost .
 Insert picture description here

Experimental results show that ,MicroNet Greatly superior to the most advanced technology ( See the picture 1). for example , And MobileNetV3 comparison , Our approach is ImageNet Classified top-1 The accuracy has been improved respectively 11.3% and 7.7%, Respectively in 12M and 21M FLOPs Under the constraint of . In the challenging 6 MFLOPs Constrained by , Our method realizes 53.0% Of top-1 Accuracy rate , Double the complexity (12 MFLOPs) Of MobileNetV3 Improved 3.2%. Besides , A series of MicroNet Provides a powerful baseline for two pixel level tasks , The calculation cost is very low : Semantic segmentation and key point detection .

Two 、 Related work

efficient CNN:MobileNets [12, 26, 11] take k ×k Convolution is decomposed into depth convolution and pointwise convolution . ShuffleNets [42, 24] Group convolution and channel shuffling are used to simplify point by point convolution . [33] Use butterfly transform to approximate point by point convolution . EfficientNet [29, 31] The input resolution and network width are found / The proper relationship between depth . MixNet [30] Multiple kernel sizes are mixed in a convolution . AdderNet [2] Trade a lot of multiplication for cheaper addition . GhostNet [10] Using cheap linear transformation to generate ghost feature graph . Sandglass [43] Inverted residual block structure is reversed to reduce information loss . [39] and [1] Train a network to support multiple subnetworks .

Dynamic neural networks : Dynamic networks improve representation by adapting parameters to input . HyperNet [9] Use another network to generate parameters for the master network . SENet [13] Reweighting channels by compressing the global context . SKNet [18] Adapt attention to kernels of different sizes . Dynamic convolution [37, 5] Aggregate multiple convolution kernels according to attention . Dynamic ReLU [6] To adapt to ReLU [25, 16] The slope and intercept of two linear functions in . [23] Convolution weights are directly generated using the full connection layer of the packet . [3] The dynamic convolution is extended from space agnostic to space specific . [27] The dynamic group convolution of adaptive packet input channel is proposed . [32] Dynamic convolution is applied to instance segmentation . [19] Learn dynamic routing across scales for semantic segmentation .

3、 ... and 、 Our approach :MicroNet

Let's introduce in detail MicroNet Design principle and key components of .

3.1、 Design principles

Very low FLOPs Limits network width ( The channel number ) And network depth ( The layer number ), The two are analyzed separately . If we treat the convolution layer as a graph , Input and output ( node ) Connection between ( edge ) Weighted by kernel parameters . ad locum , We define connectivity as the number of connections per output node . therefore , The number of connections is equal to the product of the number of output channels and connectivity . When calculating costs ( Proportional to the number of connections ) When fixed , The number of channels conflicts with connectivity . We think , A good balance between them can effectively avoid channel reduction and improve the presentation ability of the layer . therefore , Our first design principle is : Reduce the connectivity of nodes to avoid the reduction of network width . We do this by decomposing the point by point and depth convolutions on a finer scale .

When the depth of the network ( The layer number ) When significantly reduced , Its nonlinearity ( stay ReLU Code in ) Be bound , Leading to significant performance degradation . This inspired our second design principle : The reduction of network depth is compensated by improving the nonlinearity of each layer . We design a new activation function Dynamic Shift-Max To achieve this .

3.2、Micro-Factorized Convolution

We decompose the point by point and depth convolution at a finer scale ,Micro-Factorized Convolution gets its name from this . The goal is in the number of channels and inputs / Balance between output connections .

Micro-Factorized Pointwise Convolution: We propose a group of adaptive convolutions to decompose the pointwise convolution . For the sake of brevity , Let's assume that the convolution kernel W With the same number of input and output channels ( C i n = C o u t = C C_{in} = C_{out} = C Cin=Cout=C) And ignore the deviation . Kernel matrix W Is decomposed into two groups of adaptive convolutions , The number of groups G Depending on the number of channels C. In Mathematics , It can be expressed as :
W = P Φ Q T (1) \boldsymbol{W}=\boldsymbol{P} \boldsymbol{\Phi} \boldsymbol{Q}^{T} \tag{1} W=PΦQT(1)
among W W W It's a C × C C×C C×C matrix . Q The shape of is C × C R C×\frac{C}{R} C×RC, In proportion R Number of compression channels . P The shape of is C × C R C×\frac{C}{R} C×RC, Expand the number of channels back to C As the output . P and Q Yes, there is G Diagonal block matrix of blocks , Each block corresponds to the convolution of a group .$\boldsymbol{\Phi} $ It's a C R × C R \frac{C}{R} × \frac{C}{R} RC×RC Permutation matrix , Be similar to [42] Mixed washing channel in . The calculation complexity is O = 2 C 2 R G \mathcal{O}=\frac{2 C^{2}}{R G} O=RG2C2 . chart 2- The left shows C = 18、R = 2 and G = 3 An example of .
 Insert picture description here

Please note that , Group number G Not fixed , It adapts to the number of channels C And reduction ratio R by :
G = C / R (2) G=\sqrt{C / R} \tag{2} G=C/R(2)
This square root relationship is derived from the number of channels C And the input / Balance between output connectivity . ad locum , We will connect E Defined as the number of I / O connections per output channel . Each output channel is connected to the... Between two groups of adaptive convolutions C R G \frac{C}{RG} RGC Hidden channels , Each hidden channel is connected to CG Input channel . therefore E = C 2 R G 2 E = \frac{ C^{2}}{R G^{2}} E=RG2C2 . When we fix the computational complexity O = 2 C 2 R G \mathcal{O}=\frac{2 C^{2}}{R G} O=RG2C2 And reduction ratio R when , The channel number C C C And connectivity $E $ stay G G G Change in the opposite direction :
C = O R G 2 , E = O 2 G C=\sqrt{\frac{\mathcal{O R G}}{2}}, \quad E=\frac{\mathcal{O}}{2 G} C=2ORG,E=2GO
This is shown in the figure 3 Shown . With the number of groups G An increase in ,C Increase but E Reduce . When G = KaTeX parse error: \tag works only in display equations when , Two curves intersect ( C = E ) (C = E) (C=E), At this time, each output channel is connected to all input channels once . In Mathematics , The resulting convolution matrix W Is divided into G × G G×G G×G block , Is the rank of each block 1( See the picture 2- Left ).
 Insert picture description here

Micro-Factorized Deep convolution : Pictured 2-Middle Shown , We will k × k k×k k×k The deep convolution kernel is decomposed into a k × 1 k×1 k×1 Nuclear and a 1 × k 1×k 1×k nucleus . This is related to Micro-Factorized Point by point convolution ( equation 1) Have the same mathematical format . Every channel W The shape of the kernel matrix of is k × k k × k k×k, It is broken down into one k × 1 k × 1 k×1 vector P And a 1 × k 1 × k 1×k vector Q T Q^{T} QT . here $ \boldsymbol{\Phi}$ Is a value of 1 scalar . This low rank approximation will reduce the computational complexity from O ( k 2 C ) \mathcal{O}(k^{2}C) O(k2C) Down to O ( k C ) \mathcal{O}(kC) O(kC).

Combine Micro-Factorized Point by point and depth convolution : We combine the point by point and depth convolution of micro factors in two different ways :(a) General combination and (b) Streamline composition . The former just connects two convolutions . lite Use a combination of Micro-Factorized depthwise Convolution extends the number of channels by applying multiple spatial filters to each channel . then , It applies a group adaptive convolution to fuse and compress the number of channels ( Pictured 2- On the right ). Compared with conventional counterparts , Thin combinations are more effective at lower levels , Because it saves channel fusion ( Point by point ) The calculation of , To compensate for learning more spatial filters ( depth ).

3.3、 dynamic Shift-Max

Now let's move on Shift-Max, This is a new activation function to enhance nonlinearity . It dynamically fuses the input feature graph with its cyclic group shift , One of the channels is shifted . Dynamic Shift-Max It also strengthens the connection between groups . This is a supplement to the pointwise convolution of the micro factors connected in the group of interest .

Definition : Make x = x i ( i = 1 , … , C ) x = {x_{i}} (i = 1, \ldots,C) x=xi(i=1,,C) Represents an input vector ( Or tensor ), Its C The two channels are divided into G A set of . Each group has C G \frac{C}{G} GC passageway . its N The channel cyclic shift can be expressed as x N ( i ) = x ( i + N ) m o d C x_{N}(i)=x_{(i+N) mod C} xN(i)=x(i+N)modC. We define the group loop function as :
x C G ( i , j ) = x ( i + j C G )   m o d   C , j = 0 , … , G − 1 (4) x_{\frac{C}{G}}(i, j)=x_{\left(i+j \frac{C}{G}\right) \bmod C}, j=0, \ldots, G-1 \tag{4} xGC(i,j)=x(i+jGC)modC,j=0,,G1(4)
among x C G ( i , j ) x_{\frac{C}{G}}(i,j) xGC(i,j) It corresponds to the following i Channels xi Move j A set of . Dynamic Shift-Max Multiple combinations (J) Group shift , As shown below :
y i = max ⁡ 1 ≤ k ≤ K { ∑ j = 0 J − 1 a i , j k ( x ) x C G ( i , j ) } (5) y_{i}=\max _{1 \leq k \leq K}\left\{\sum_{j=0}^{J-1} a_{i, j}^{k}(\boldsymbol{x}) x_{\frac{C}{G}}(i, j)\right\} \tag{5} yi=1kKmax{ j=0J1ai,jk(x)xGC(i,j)}(5)
The parameter a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x) Adapt input through hyperfunctions x, This can be easily achieved by using two full connection layers after average pooling , Be similar to Squeeze-and-Excitation [15].

nonlinear :Dynamic Shift-Max Two nonlinear :(a) It outputs J Of the group K The maximum values of different blends , as well as (b) Parameters a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x) It's not static , It's input. x Function of . These provide dynamic ShiftMax More expressive ability , To compensate for the reduction in the number of layers . Recent developments ReLU [6] It's dynamic Shift-Max (J = 1) A special case of , Each of these channels is individually activated .
 Insert picture description here

Connectivity :Dynamic Shift-Max Improved connectivity between channel groups . It's right MicroFactorized pointwise convolution A supplement to , The latter focuses on connectivity within each group . chart 4 indicate , Even static group shifts ( y i = a i , 0 x C G ( i , 0 ) + a i , 1 x C G ( i , 1 ) ) \left(y_{i}=a_{i, 0} x_{\frac{C}{G}}(i, 0)+a_{i, 1} x_{\frac{C}{G}}(i, 1)\right) (yi=ai,0xGC(i,0)+ai,1xGC(i,1)) It can also effectively increase the rank of the pointwise convolution of the micro factors . By inserting it between two groups of adaptive convolutions , The resulting convolution matrix W(G×G Block matrix ) The rank of each block in is from 1 Add to 2. Note that static group shifting is a simple special case of dynamic shifting - K = 1、J = 2 And static a i , j k a_{i,j}^{k} ai,jk The maximum value of .

Computational complexity :Dynamic Shift-Max From input x Generate C J K CJK CJK Parameters a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x). The computational complexity consists of three parts :(a) The average pooling O ( H W C ) \mathcal{O}(HWC) O(HWC),(b) stay Eq.5 O ( C 2 J K ) \mathcal{O}(C^{2}JK) O(C2JK) Generate parameters in a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x), as well as Each channel and each channel applies dynamic Shift Max Space location O ( H W C J K ) \mathcal{O}(HWCJK) O(HWCJK). When $ J$ and K K K It's Lightweight . Based on experience , stay J = 2 and K = 2 A good compromise was achieved .

3.4、 Relationship with previous work

MicroNet With two popular efficient networks (MobileNet [12, 26, 11] and ShuffleNet [42, 24]) of . It is associated with MobileNet Shared reverse bottleneck structure , And with ShuffleNet Sharing uses group convolution . by comparison ,MicroNet It is different from them in terms of convolution and activation function . First , It decomposes point by point convolution into group adaptive convolution , The number of groups is suitable for the number of channels G = C / R G=\sqrt{C / R} G=C/R. secondly , It decomposes the depth convolution . Last , A new activation method is proposed ( Dynamic Shift-Max) To improve channel connectivity and nonlinearity .

Four 、MicroNet framework

Now let's describe four kinds of MicroNet The architecture of the model , They have from 6M To 44M Different FLOP. They consist of three types of Micro-Blocks form ( See the picture 5), They combine in different ways Micro-Factorized pointwise and depthwise Convolution . They all use dynamic ShiftMax As an activation function . Details are as follows :
 Insert picture description here

Micro-Block-A: Pictured 5a Shown ,Micro-Block-A Use Micro-Factorized pointwise and depthwise Compact combination of convolutions ( See the picture 2- Right ). It is at a lower level with a higher resolution ( for example 112×112 or 56×56) On the effective . Please note that , The number of channels passes Micro-Factorized depthwise Convolution expansion , And compression by using group adaptive convolution .

Micro-Block-B:Micro-Block-B Used to connect to Micro-Block-A and Micro-Block-C. And Micro-Block-A The difference is , It uses the complete micro factor point by point convolution , This includes two groups of adaptive convolutions ( Pictured 5b Shown ). The former compresses the number of channels , The latter increases the number of channels . Every MicroNet only one Micro-Block-B( See table 1).
 Insert picture description here

Micro-Block-C:Micro-Block-C( Pictured 5c Shown ) Use connections Micro-Factorized depthwise and pointwise Conventional combinations of convolutions . It is used at higher levels ( See table 1), Because it's converging in the channel ( Point by point ) It takes more computation than lite Combine more . Use skip connection when size matching .

Each micro block has four super parameters : Kernel size k、 Number of output channels C、 Point by point convolution of micro factors R The reduction rate in the bottleneck of , And two group number pairs of group adaptive convolution (G1; G2). Please note that , We will Eq.2 To G 1 G 2 = C / R G1G2 = C / R G1G2=C/R And find the approximate integer solution .

Stem Layer: We redesigned Stem Layers to meet low FLOP constraint . It includes a 3×1 Convolution and a 1×3 Group convolution , And then a ReLU. The second convolution expands the number of channels R times . This big saving saves the cost of calculation . for example ,MicroNet-M3 The backbone layer in ( See table 1) It only needs 1.5M MAdds.

Four kinds of MicroNet Model (M0–M3): We have designed four types with different computational costs (6M、12M、21M、44M MAdds) Model of (M0、M1、M2、M3). surface 1 Shows their full specifications . These networks follow the same pattern from low to high : Trunk layer ! Microblock -A! Microblock B! Microblock -C. Please note that , All models are designed manually , No network architecture search (NAS).

5、 ... and 、 experiment :ImageNet classification

Let's evaluate four MicroNet Model (M0-M3) as well as ImageNet [8] Comprehensive ablation of classification . ImageNet Yes 1000 Classes , Include 1,281,167 Images and for training 50,000 An image for verification .

5.1、 Implementation details

Training strategy : Each model is trained in two ways :(a) Independent , and (b) Learn from each other . The former is simple , Learn by yourself . The latter along each MicroNet Learn from a full level partner together , The full level partners share the same network width / Height , But with the original point by point and depth (k×k) Convolution substitution Micro-Factorized Point by point and depth convolution . KL Divergence is used to encourage MicroNet Learn from their corresponding full level partners .

Training settings : All models use 0.9 Momentum SGD The optimizer trains . The image resolution is 224×224. We use 512 Of mini-batch Size and 0.02 Learning rate of . Each model goes through 600 individual epoch Cosine learning rate decay training . For the smaller MicroNet(M0 and M1), The weight decays to 3e-5,dropout by 0.05. For larger models (M2 and M3), The weight decays to 4e-5,dropout Rate is 0.1. Label smoothing (0.1) and Mixup [41] (0.2) be used for MicroNet-M3 To avoid over fitting .

5.2、 The main result

 Insert picture description here

surface 2 Four different calculation costs are compared MicroNets With the most advanced ImgageNet classification . MicroNets In all four FLOP Performance under constraints is significantly better than all previous work . for example , Without learning from each other ,MicroNets stay 6M、12M、21M and 44M FLOP The performance ratio on is MobileNetV3 high 9.6%、9.6%、6.1% and 4.4%. When training through mutual learning , All four MicroNets Always get about 1.5% Of top-1 Accuracy rate . Our approach is 6M FLOPs It has been realized. 53.0% Of top-1 Accuracy rate , Than MobileNetV3 Double the complexity of (12M FLOPs) Higher than 3.2%. Compare with the recent MobileNet and ShuffleNet Compared with , Such as GhostNet [10]、WeightNet [23] and ButterflyTransform
[33], Our approach is similar FLOPs More than 5% Of top-1 Accuracy rate . This shows that MicroNet Can effectively deal with very low FLOP.

5.3、 Melting research

We ran a lot of ablation to analyze MicroNet. MicroNet-M1 (12M FLOPs) For all ablations , Every model training 300 individual epoch. dynamic Shift-Max The default superparameter for is set to J = 2,K = 2.
 Insert picture description here

from MobileNet To MicroNet: surface 3 Show from MobileNet To our MicroNet The path of . Both share the reverse bottleneck structure . ad locum , We changed it MobileNetV2[26]( No, SE [13]), Make it have similar complexity (10.5M MAdds) And three Micro-Factorized Convolution variants ( The first 2-4 That's ok ). Micro-Factorized pointwise and depthwise Convolution and its low level lite The combination will top-1 The accuracy is from 44.9% Gradually increase to 51.7%. Besides , Use static and dynamic Shift-Max Each of them gained an extra 2.7% and 6.8% Of top-1 Accuracy , And add a small amount of extra cost . This shows that the proposed Micro-Factorized Convolution and dynamics Shift-Max It is effective and complementary in dealing with extremely low computing costs .
 Insert picture description here

Group number G:Micro-Factorized pointwise convolution Includes two groups of adaptive convolutions , The number of groups is relaxed KaTeX parse error: \tag works only in display equations To select a near integer . surface 4a It has similar structure and FLOP( about 10.5M MAdds) However, a fixed number of groups are used for comparison . Group adaptive convolution achieves higher accuracy , The number of channels and input are proved / Good balance between output connectivity .

surface 4b Different options for the number of adaptive groups are compared , These options are determined by the multiplier λ control , bring KaTeX parse error: \tag works only in display equations. The larger λ Values correspond to more channels but fewer inputs / Output connection ( See chart 3). When λ Be situated between 0.5 and 1 Between time , Achieved a good balance . When λ increase ( More channels but less connectivity ) Or decrease ( Less channels but more connectivity ) when ,top-1 The accuracy is reduced . therefore , We use... For the rest of the paper λ = 1. Please note that , surface 4b All models in have similar computational costs ( about 10.5M MAdds).

Different levels of Lite Combine : surface 4c Compares the use of... At different levels Micro-Factorized pointwise and depthwise Convolution Lite Combine ( See chart 2- Right ). Use it only at low levels to get the highest accuracy . This verifies that thin composition is more effective at lower levels . Compared with conventional combination , It saves channel convergence ( Point by point ) The calculation of , To compensate for learning more spatial filters ( depth ).
 Insert picture description here

Comparison with other activation functions : We will be dynamic Shift-Max With three existing activation functions ReLU [25]、SE+ReLU [13] And dynamic ReLU [6] Compare . Results such as table 5 Shown . Our news Shift-Max Obviously better than the other three (2.5%), Proved its superiority . Please note that , dynamic ReLU yes J = 1 The dynamics of the Shift-Max A special case of ( See formula 5).
 Insert picture description here

Dynamics of different layers Shift-Max: surface 6 Shows the use of dynamic... In three different layers in the microblock Shift-Max Of top-1 precision ( See chart 5). Using it in more layers leads to continuous improvement . The best accuracy can be achieved when it is used in all three layers . If only one layer is allowed to use dynamic Shift-Max, It is suggested that depthwise convolution Then use .
 Insert picture description here

dynamic Shift-Max Different superparameters in : surface 7 Shows the use of different K and J The result of the combination ( In the equation 5 in ). We are K = 1 add ReLU, because max There is only one element left in the operator . Baseline of the first line (J = 1,K = 1) Equivalent to SE+ReLU [13]. When fixed J = 2( Merge the two groups ) when , The winner of the two fusions (K = 2) Better than a single fusion (K = 1). Adding a third blend doesn't help , Because it is mainly covered by the other two fusion , But more parameters are involved . When fixed K = 2( Two fusion at most ) when , More groups involved J Always better , But more FLOP. stay J = 2 and K = 2 A good compromise was achieved , among 4.1% Gain with extra 150 ten thousand MAdds Realization .

6、 ... and 、 For pixel level classification MicroNet

MicroNet It is not only effective for image level classification , It is also very effective for pixel level tasks . In this section , We will show its application in human posture estimation and semantic segmentation .

6.1、 Body posture estimation

We use COCO 2017 Data sets [21] Evaluate on single person key point detection MicroNet. Our model is in train2017 Training , Include 57K Images and 150K Human instance , Marked with 17 A key point . We are including 5000 Picture of val2017 To evaluate our methods on the Internet , And use more than 10 Similarity of key points of objects (OKS) Average accuracy of threshold (AP) As an indicator .

Implementation details : Similar to image classification , We have four MicroNet Model (M0-M3) For different FLOP Key point detection . By adding a set of selected blocks ( for example , All steps are 32 The block ) The resolution of the (×2) To adapt the model to the key detection task . For different MicroNet Model , The choice will be different ( For more details , Please see Appendix 8.1). Each model has a head containing three micro blocks ( One step is 8, The two steps are 4) And a pointwise convolution to generate 17 Heat map of key points . We use bilinear upsampling to improve the head resolution , And use spatial attention at each level [6].

Training settings : Use [28] Training settings in . The human body detection frame is cut and adjusted to 256 × 192. Data enhancement includes random rotation ([−45°; 45°])、 Random scaling ([0.65, 1.35])、 Flipping and half volume data enhancement . All models use Adam Optimizer [17] Train from the beginning 250 individual epoch. The initial learning rate is set to 1e-3, In the 210 and 240 epoch Down to 1e-4 and 1e-5.

test : A two-stage top-down paradigm [36, 28] Used for testing : Detection personnel instance , And then predict the key points . We use [36] Same personnel detector provided . The original image is combined with the heat map of the flipped image , Key points are predicted by adjusting the position of the highest calorific value to a quarter offset of the second highest response .
 Insert picture description here

The main result : surface 8 take MicroNets With previous work [6, 5] In the aspect of effective attitude estimation , The calculated cost is less than 850 MFLOPs. Both of these works are used in the trunk and head MobileNet Inverse residual bottleneck block of , And by convoluting [5] And activation functions [6] The parameters in adapt to the input to show significant improvement .
our MicroNet-M3 In these jobs, only 33% Of FLOP, But it achieves similar performance , It is proved that our method is also effective for key point detection . Besides ,MicroNet-M2、M1、M0 It provides a good baseline for key point detection , Lower computational complexity , from 77M To 163M FLOPs.

6.2、 Semantic segmentation

We are Cityscape Data sets [7] Experiments with fine annotations were carried out on , To evaluate MicroNet Semantic segmentation . Our model is trained on the training fine set , Include 2,975 Zhang image . We are including 500 Picture of val On our evaluation set , And use mIOU As a measure .

Implementation details : We have revised four MicroNet Model (M0-M3) As the backbone , By setting the step size to 32 The resolution of all blocks of is increased to a step of 16, Be similar to MobileNetV3 [11]. Our model is in 1024×2048 It has very low computational cost in image resolution , from 2.5B To 0.8B FLOPs. We follow in the split header Atrous Spatial Pyramid Pooling (LR-ASPP) [11] Of Lite Reduced Design , Draw the feature map 2 Times bilinear upsampling , Apply spatial attention , And 8 The stride of is merged with the feature map from the trunk . We use Micro-Factorized Convolution instead of 1×1 Convolute to make LR-ASPP Lighter , be called Micro-Reduce ASPP (MR-ASPP).

Training settings : All models are initialized and trained randomly 240 individual epoch. The initial learning rate is set to 0.2, And decay to... By cosine function 1e-4. Weight falloff set to 4e-5. Used [4] Data enhancement in .
 Insert picture description here

The main result : surface 9 All four are reported MicroNet Of mIOU. And MobileNetV3(68.4 mIOU and 2.90B MAdds) comparison , our MicroNet-M3 More accurate (69.1 mIOU), Lower computational costs (2.52B MAdds). This proves the superiority of our method in semantic segmentation . Besides , our MicroNet-M2、M1、M0 It provides a good baseline for semantic segmentation , Even lower FLOPs from 1.75B To 0.81B MAdds.

7、 ... and 、 Conclusion

In this paper , We proposed MicroNet To handle extremely low computing costs .
It builds on two proposed operators :Micro-Factorized Convolution and dynamic shift maxima . The former is approximated by point convolution and depth convolution in terms of channel number and input / Balance between output connections . The latter dynamically fuses continuous channel groups , Enhance node connectivity and nonlinearity to compensate for depth reduction . A series of MicroNet At very low FLOP Three tasks are realized under ( Image classification 、 Human pose estimation and semantic segmentation ) Reliable improvement of . We hope this work will be efficient CNN Provide a good baseline for multi vision tasks .

原网站

版权声明
本文为[AI Hao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206090936553541.html