当前位置:网站首页>Micronet: image recognition with very low flop
Micronet: image recognition with very low flop
2022-06-09 10:21:00 【AI Hao】
Abstract

In this paper , We introduced MicroNet, It is an efficient convolutional neural network , Use very low computing costs ( for example ImageNet Classified 6 individual MFLOP). This kind of low-cost network is very necessary on edge devices , But it usually suffers from significant performance degradation . We deal with very low... Based on two design principles FLOP:(a) Reduce the network width by reducing the node connectivity , as well as (b) The reduction of network depth is compensated by introducing more complex nonlinearity in each layer . First , We propose the micro factorial convolution to decompose the point by point and depth convolution into low rank matrices , So that the number of channels and input / A good compromise between output connections . secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , The nonlinearity is improved by maximizing multiple dynamic fusion between the input feature map and its cyclic channel shift . Fusion is dynamic , Because their parameters adapt to the input . In micro factor convolution and dynamics Shift-Max On the basis of , A series of MicroNet At low FLOP State-of-the-art technology to achieve significant performance improvements . for example ,MicroNet-M1 In possession of 12 individual MFLOP Of ImageNet Classification is realized 61.1% Of top-1 Accuracy rate , Than MobileNetV3 Higher than 11.3%.
One 、 brief introduction
lately , Design efficient CNN framework [15,12,26,11,42,24,29] It has always been an active research field . These efforts enable high-quality service on edge devices . However , When computing costs become extremely low , Even the most advanced and efficient CNN( for example MobileNetV3 [11]) Also suffer significant performance degradation . for example , When the resolution is 224 × 224 Image classification [8] Admiral MobileNetV3 from 112M Limit to 12M MAdds when ,top-1 The accuracy is from 71.7% Down to 49.8%. This makes it useful in low power devices ( Internet of things devices, for example ) It is more difficult to adopt . In this paper , We solved a more challenging problem by cutting our budget in half : Can we 6 MFLOP In the following 224 × 224 The resolution of the execution exceeds 1,000 Image classification of categories ?
This extremely low computational cost (6M FLOPs) Each layer needs to be carefully redesigned . for example , Even in 112×112 grid (stride=2) Contains a single 3×3 Convolution sum 3 Two input channels and 8 The thin stem layer of the output channel also needs 2.7M MAdds. Used to design convolution and 1000 The resources of classifiers of categories are too limited , Unable to learn good expression . To accommodate such a low budget , Apply existing high efficiency CNN( for example MobileNet [12, 26, 11] and ShuffleNet [42, 24]) The simple strategy is to significantly reduce the width or depth of the network . This leads to severe performance degradation .
We put forward a method called MicroNet New architecture to handle very low FLOP. It is based on two design principles :
- Reduce the network width by reducing the node connectivity .
- The reduction of network depth is compensated by improving the nonlinearity of each layer .
These principles guide us to design more efficient convolution and activation functions .
First , We have put forward Micro-Factorized Convolution decomposes point by point and depth convolution into low rank matrices . This is in the input / There is a good balance between the number of output connections and the number of channels . say concretely , We design a group of adaptive convolutions to decompose the pointwise convolution . It adapts the number of groups to the number of channels through the square root relationship . Stacking two groups of adaptive convolution essentially approximates the pointwise convolution matrix through the block matrix , The rank of each block is 1. Decomposition of deep convolution (rank-1) It's simple , take k × k The depth convolution is decomposed into 1 × k and k × 1 Deep convolution . We show that , Without sacrificing the number of channels , The appropriate combination of these two approximations at different levels significantly reduces the computational cost .
secondly , We put forward a proposal called Dynamic Shift-Max New activation function for , To improve nonlinearity from two aspects :(a) It maximizes multiple fusions between the input characteristic graph and its cyclic channel shift , as well as (b) Each fusion is dynamic , Because its parameters adapt to the input . Besides , It effectively enhances node connectivity and nonlinearity in a function with low computational cost .
Experimental results show that ,MicroNet Greatly superior to the most advanced technology ( See the picture 1). for example , And MobileNetV3 comparison , Our approach is ImageNet Classified top-1 The accuracy has been improved respectively 11.3% and 7.7%, Respectively in 12M and 21M FLOPs Under the constraint of . In the challenging 6 MFLOPs Constrained by , Our method realizes 53.0% Of top-1 Accuracy rate , Double the complexity (12 MFLOPs) Of MobileNetV3 Improved 3.2%. Besides , A series of MicroNet Provides a powerful baseline for two pixel level tasks , The calculation cost is very low : Semantic segmentation and key point detection .
Two 、 Related work
efficient CNN:MobileNets [12, 26, 11] take k ×k Convolution is decomposed into depth convolution and pointwise convolution . ShuffleNets [42, 24] Group convolution and channel shuffling are used to simplify point by point convolution . [33] Use butterfly transform to approximate point by point convolution . EfficientNet [29, 31] The input resolution and network width are found / The proper relationship between depth . MixNet [30] Multiple kernel sizes are mixed in a convolution . AdderNet [2] Trade a lot of multiplication for cheaper addition . GhostNet [10] Using cheap linear transformation to generate ghost feature graph . Sandglass [43] Inverted residual block structure is reversed to reduce information loss . [39] and [1] Train a network to support multiple subnetworks .
Dynamic neural networks : Dynamic networks improve representation by adapting parameters to input . HyperNet [9] Use another network to generate parameters for the master network . SENet [13] Reweighting channels by compressing the global context . SKNet [18] Adapt attention to kernels of different sizes . Dynamic convolution [37, 5] Aggregate multiple convolution kernels according to attention . Dynamic ReLU [6] To adapt to ReLU [25, 16] The slope and intercept of two linear functions in . [23] Convolution weights are directly generated using the full connection layer of the packet . [3] The dynamic convolution is extended from space agnostic to space specific . [27] The dynamic group convolution of adaptive packet input channel is proposed . [32] Dynamic convolution is applied to instance segmentation . [19] Learn dynamic routing across scales for semantic segmentation .
3、 ... and 、 Our approach :MicroNet
Let's introduce in detail MicroNet Design principle and key components of .
3.1、 Design principles
Very low FLOPs Limits network width ( The channel number ) And network depth ( The layer number ), The two are analyzed separately . If we treat the convolution layer as a graph , Input and output ( node ) Connection between ( edge ) Weighted by kernel parameters . ad locum , We define connectivity as the number of connections per output node . therefore , The number of connections is equal to the product of the number of output channels and connectivity . When calculating costs ( Proportional to the number of connections ) When fixed , The number of channels conflicts with connectivity . We think , A good balance between them can effectively avoid channel reduction and improve the presentation ability of the layer . therefore , Our first design principle is : Reduce the connectivity of nodes to avoid the reduction of network width . We do this by decomposing the point by point and depth convolutions on a finer scale .
When the depth of the network ( The layer number ) When significantly reduced , Its nonlinearity ( stay ReLU Code in ) Be bound , Leading to significant performance degradation . This inspired our second design principle : The reduction of network depth is compensated by improving the nonlinearity of each layer . We design a new activation function Dynamic Shift-Max To achieve this .
3.2、Micro-Factorized Convolution
We decompose the point by point and depth convolution at a finer scale ,Micro-Factorized Convolution gets its name from this . The goal is in the number of channels and inputs / Balance between output connections .
Micro-Factorized Pointwise Convolution: We propose a group of adaptive convolutions to decompose the pointwise convolution . For the sake of brevity , Let's assume that the convolution kernel W With the same number of input and output channels ( C i n = C o u t = C C_{in} = C_{out} = C Cin=Cout=C) And ignore the deviation . Kernel matrix W Is decomposed into two groups of adaptive convolutions , The number of groups G Depending on the number of channels C. In Mathematics , It can be expressed as :
W = P Φ Q T (1) \boldsymbol{W}=\boldsymbol{P} \boldsymbol{\Phi} \boldsymbol{Q}^{T} \tag{1} W=PΦQT(1)
among W W W It's a C × C C×C C×C matrix . Q The shape of is C × C R C×\frac{C}{R} C×RC, In proportion R Number of compression channels . P The shape of is C × C R C×\frac{C}{R} C×RC, Expand the number of channels back to C As the output . P and Q Yes, there is G Diagonal block matrix of blocks , Each block corresponds to the convolution of a group .$\boldsymbol{\Phi} $ It's a C R × C R \frac{C}{R} × \frac{C}{R} RC×RC Permutation matrix , Be similar to [42] Mixed washing channel in . The calculation complexity is O = 2 C 2 R G \mathcal{O}=\frac{2 C^{2}}{R G} O=RG2C2 . chart 2- The left shows C = 18、R = 2 and G = 3 An example of .
Please note that , Group number G Not fixed , It adapts to the number of channels C And reduction ratio R by :
G = C / R (2) G=\sqrt{C / R} \tag{2} G=C/R(2)
This square root relationship is derived from the number of channels C And the input / Balance between output connectivity . ad locum , We will connect E Defined as the number of I / O connections per output channel . Each output channel is connected to the... Between two groups of adaptive convolutions C R G \frac{C}{RG} RGC Hidden channels , Each hidden channel is connected to CG Input channel . therefore E = C 2 R G 2 E = \frac{ C^{2}}{R G^{2}} E=RG2C2 . When we fix the computational complexity O = 2 C 2 R G \mathcal{O}=\frac{2 C^{2}}{R G} O=RG2C2 And reduction ratio R when , The channel number C C C And connectivity $E $ stay G G G Change in the opposite direction :
C = O R G 2 , E = O 2 G C=\sqrt{\frac{\mathcal{O R G}}{2}}, \quad E=\frac{\mathcal{O}}{2 G} C=2ORG,E=2GO
This is shown in the figure 3 Shown . With the number of groups G An increase in ,C Increase but E Reduce . When G = KaTeX parse error: \tag works only in display equations when , Two curves intersect ( C = E ) (C = E) (C=E), At this time, each output channel is connected to all input channels once . In Mathematics , The resulting convolution matrix W Is divided into G × G G×G G×G block , Is the rank of each block 1( See the picture 2- Left ).
Micro-Factorized Deep convolution : Pictured 2-Middle Shown , We will k × k k×k k×k The deep convolution kernel is decomposed into a k × 1 k×1 k×1 Nuclear and a 1 × k 1×k 1×k nucleus . This is related to Micro-Factorized Point by point convolution ( equation 1) Have the same mathematical format . Every channel W The shape of the kernel matrix of is k × k k × k k×k, It is broken down into one k × 1 k × 1 k×1 vector P And a 1 × k 1 × k 1×k vector Q T Q^{T} QT . here $ \boldsymbol{\Phi}$ Is a value of 1 scalar . This low rank approximation will reduce the computational complexity from O ( k 2 C ) \mathcal{O}(k^{2}C) O(k2C) Down to O ( k C ) \mathcal{O}(kC) O(kC).
Combine Micro-Factorized Point by point and depth convolution : We combine the point by point and depth convolution of micro factors in two different ways :(a) General combination and (b) Streamline composition . The former just connects two convolutions . lite Use a combination of Micro-Factorized depthwise Convolution extends the number of channels by applying multiple spatial filters to each channel . then , It applies a group adaptive convolution to fuse and compress the number of channels ( Pictured 2- On the right ). Compared with conventional counterparts , Thin combinations are more effective at lower levels , Because it saves channel fusion ( Point by point ) The calculation of , To compensate for learning more spatial filters ( depth ).
3.3、 dynamic Shift-Max
Now let's move on Shift-Max, This is a new activation function to enhance nonlinearity . It dynamically fuses the input feature graph with its cyclic group shift , One of the channels is shifted . Dynamic Shift-Max It also strengthens the connection between groups . This is a supplement to the pointwise convolution of the micro factors connected in the group of interest .
Definition : Make x = x i ( i = 1 , … , C ) x = {x_{i}} (i = 1, \ldots,C) x=xi(i=1,…,C) Represents an input vector ( Or tensor ), Its C The two channels are divided into G A set of . Each group has C G \frac{C}{G} GC passageway . its N The channel cyclic shift can be expressed as x N ( i ) = x ( i + N ) m o d C x_{N}(i)=x_{(i+N) mod C} xN(i)=x(i+N)modC. We define the group loop function as :
x C G ( i , j ) = x ( i + j C G ) m o d C , j = 0 , … , G − 1 (4) x_{\frac{C}{G}}(i, j)=x_{\left(i+j \frac{C}{G}\right) \bmod C}, j=0, \ldots, G-1 \tag{4} xGC(i,j)=x(i+jGC)modC,j=0,…,G−1(4)
among x C G ( i , j ) x_{\frac{C}{G}}(i,j) xGC(i,j) It corresponds to the following i Channels xi Move j A set of . Dynamic Shift-Max Multiple combinations (J) Group shift , As shown below :
y i = max 1 ≤ k ≤ K { ∑ j = 0 J − 1 a i , j k ( x ) x C G ( i , j ) } (5) y_{i}=\max _{1 \leq k \leq K}\left\{\sum_{j=0}^{J-1} a_{i, j}^{k}(\boldsymbol{x}) x_{\frac{C}{G}}(i, j)\right\} \tag{5} yi=1≤k≤Kmax{ j=0∑J−1ai,jk(x)xGC(i,j)}(5)
The parameter a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x) Adapt input through hyperfunctions x, This can be easily achieved by using two full connection layers after average pooling , Be similar to Squeeze-and-Excitation [15].
nonlinear :Dynamic Shift-Max Two nonlinear :(a) It outputs J Of the group K The maximum values of different blends , as well as (b) Parameters a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x) It's not static , It's input. x Function of . These provide dynamic ShiftMax More expressive ability , To compensate for the reduction in the number of layers . Recent developments ReLU [6] It's dynamic Shift-Max (J = 1) A special case of , Each of these channels is individually activated .
Connectivity :Dynamic Shift-Max Improved connectivity between channel groups . It's right MicroFactorized pointwise convolution A supplement to , The latter focuses on connectivity within each group . chart 4 indicate , Even static group shifts ( y i = a i , 0 x C G ( i , 0 ) + a i , 1 x C G ( i , 1 ) ) \left(y_{i}=a_{i, 0} x_{\frac{C}{G}}(i, 0)+a_{i, 1} x_{\frac{C}{G}}(i, 1)\right) (yi=ai,0xGC(i,0)+ai,1xGC(i,1)) It can also effectively increase the rank of the pointwise convolution of the micro factors . By inserting it between two groups of adaptive convolutions , The resulting convolution matrix W(G×G Block matrix ) The rank of each block in is from 1 Add to 2. Note that static group shifting is a simple special case of dynamic shifting - K = 1、J = 2 And static a i , j k a_{i,j}^{k} ai,jk The maximum value of .
Computational complexity :Dynamic Shift-Max From input x Generate C J K CJK CJK Parameters a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x). The computational complexity consists of three parts :(a) The average pooling O ( H W C ) \mathcal{O}(HWC) O(HWC),(b) stay Eq.5 O ( C 2 J K ) \mathcal{O}(C^{2}JK) O(C2JK) Generate parameters in a i , j k ( x ) a_{i,j}^{k}(x) ai,jk(x), as well as Each channel and each channel applies dynamic Shift Max Space location O ( H W C J K ) \mathcal{O}(HWCJK) O(HWCJK). When $ J$ and K K K It's Lightweight . Based on experience , stay J = 2 and K = 2 A good compromise was achieved .
3.4、 Relationship with previous work
MicroNet With two popular efficient networks (MobileNet [12, 26, 11] and ShuffleNet [42, 24]) of . It is associated with MobileNet Shared reverse bottleneck structure , And with ShuffleNet Sharing uses group convolution . by comparison ,MicroNet It is different from them in terms of convolution and activation function . First , It decomposes point by point convolution into group adaptive convolution , The number of groups is suitable for the number of channels G = C / R G=\sqrt{C / R} G=C/R. secondly , It decomposes the depth convolution . Last , A new activation method is proposed ( Dynamic Shift-Max) To improve channel connectivity and nonlinearity .
Four 、MicroNet framework
Now let's describe four kinds of MicroNet The architecture of the model , They have from 6M To 44M Different FLOP. They consist of three types of Micro-Blocks form ( See the picture 5), They combine in different ways Micro-Factorized pointwise and depthwise Convolution . They all use dynamic ShiftMax As an activation function . Details are as follows :
Micro-Block-A: Pictured 5a Shown ,Micro-Block-A Use Micro-Factorized pointwise and depthwise Compact combination of convolutions ( See the picture 2- Right ). It is at a lower level with a higher resolution ( for example 112×112 or 56×56) On the effective . Please note that , The number of channels passes Micro-Factorized depthwise Convolution expansion , And compression by using group adaptive convolution .
Micro-Block-B:Micro-Block-B Used to connect to Micro-Block-A and Micro-Block-C. And Micro-Block-A The difference is , It uses the complete micro factor point by point convolution , This includes two groups of adaptive convolutions ( Pictured 5b Shown ). The former compresses the number of channels , The latter increases the number of channels . Every MicroNet only one Micro-Block-B( See table 1).
Micro-Block-C:Micro-Block-C( Pictured 5c Shown ) Use connections Micro-Factorized depthwise and pointwise Conventional combinations of convolutions . It is used at higher levels ( See table 1), Because it's converging in the channel ( Point by point ) It takes more computation than lite Combine more . Use skip connection when size matching .
Each micro block has four super parameters : Kernel size k、 Number of output channels C、 Point by point convolution of micro factors R The reduction rate in the bottleneck of , And two group number pairs of group adaptive convolution (G1; G2). Please note that , We will Eq.2 To G 1 G 2 = C / R G1G2 = C / R G1G2=C/R And find the approximate integer solution .
Stem Layer: We redesigned Stem Layers to meet low FLOP constraint . It includes a 3×1 Convolution and a 1×3 Group convolution , And then a ReLU. The second convolution expands the number of channels R times . This big saving saves the cost of calculation . for example ,MicroNet-M3 The backbone layer in ( See table 1) It only needs 1.5M MAdds.
Four kinds of MicroNet Model (M0–M3): We have designed four types with different computational costs (6M、12M、21M、44M MAdds) Model of (M0、M1、M2、M3). surface 1 Shows their full specifications . These networks follow the same pattern from low to high : Trunk layer ! Microblock -A! Microblock B! Microblock -C. Please note that , All models are designed manually , No network architecture search (NAS).
5、 ... and 、 experiment :ImageNet classification
Let's evaluate four MicroNet Model (M0-M3) as well as ImageNet [8] Comprehensive ablation of classification . ImageNet Yes 1000 Classes , Include 1,281,167 Images and for training 50,000 An image for verification .
5.1、 Implementation details
Training strategy : Each model is trained in two ways :(a) Independent , and (b) Learn from each other . The former is simple , Learn by yourself . The latter along each MicroNet Learn from a full level partner together , The full level partners share the same network width / Height , But with the original point by point and depth (k×k) Convolution substitution Micro-Factorized Point by point and depth convolution . KL Divergence is used to encourage MicroNet Learn from their corresponding full level partners .
Training settings : All models use 0.9 Momentum SGD The optimizer trains . The image resolution is 224×224. We use 512 Of mini-batch Size and 0.02 Learning rate of . Each model goes through 600 individual epoch Cosine learning rate decay training . For the smaller MicroNet(M0 and M1), The weight decays to 3e-5,dropout by 0.05. For larger models (M2 and M3), The weight decays to 4e-5,dropout Rate is 0.1. Label smoothing (0.1) and Mixup [41] (0.2) be used for MicroNet-M3 To avoid over fitting .
5.2、 The main result

surface 2 Four different calculation costs are compared MicroNets With the most advanced ImgageNet classification . MicroNets In all four FLOP Performance under constraints is significantly better than all previous work . for example , Without learning from each other ,MicroNets stay 6M、12M、21M and 44M FLOP The performance ratio on is MobileNetV3 high 9.6%、9.6%、6.1% and 4.4%. When training through mutual learning , All four MicroNets Always get about 1.5% Of top-1 Accuracy rate . Our approach is 6M FLOPs It has been realized. 53.0% Of top-1 Accuracy rate , Than MobileNetV3 Double the complexity of (12M FLOPs) Higher than 3.2%. Compare with the recent MobileNet and ShuffleNet Compared with , Such as GhostNet [10]、WeightNet [23] and ButterflyTransform
[33], Our approach is similar FLOPs More than 5% Of top-1 Accuracy rate . This shows that MicroNet Can effectively deal with very low FLOP.
5.3、 Melting research
We ran a lot of ablation to analyze MicroNet. MicroNet-M1 (12M FLOPs) For all ablations , Every model training 300 individual epoch. dynamic Shift-Max The default superparameter for is set to J = 2,K = 2.
from MobileNet To MicroNet: surface 3 Show from MobileNet To our MicroNet The path of . Both share the reverse bottleneck structure . ad locum , We changed it MobileNetV2[26]( No, SE [13]), Make it have similar complexity (10.5M MAdds) And three Micro-Factorized Convolution variants ( The first 2-4 That's ok ). Micro-Factorized pointwise and depthwise Convolution and its low level lite The combination will top-1 The accuracy is from 44.9% Gradually increase to 51.7%. Besides , Use static and dynamic Shift-Max Each of them gained an extra 2.7% and 6.8% Of top-1 Accuracy , And add a small amount of extra cost . This shows that the proposed Micro-Factorized Convolution and dynamics Shift-Max It is effective and complementary in dealing with extremely low computing costs .
Group number G:Micro-Factorized pointwise convolution Includes two groups of adaptive convolutions , The number of groups is relaxed KaTeX parse error: \tag works only in display equations To select a near integer . surface 4a It has similar structure and FLOP( about 10.5M MAdds) However, a fixed number of groups are used for comparison . Group adaptive convolution achieves higher accuracy , The number of channels and input are proved / Good balance between output connectivity .
surface 4b Different options for the number of adaptive groups are compared , These options are determined by the multiplier λ control , bring KaTeX parse error: \tag works only in display equations. The larger λ Values correspond to more channels but fewer inputs / Output connection ( See chart 3). When λ Be situated between 0.5 and 1 Between time , Achieved a good balance . When λ increase ( More channels but less connectivity ) Or decrease ( Less channels but more connectivity ) when ,top-1 The accuracy is reduced . therefore , We use... For the rest of the paper λ = 1. Please note that , surface 4b All models in have similar computational costs ( about 10.5M MAdds).
Different levels of Lite Combine : surface 4c Compares the use of... At different levels Micro-Factorized pointwise and depthwise Convolution Lite Combine ( See chart 2- Right ). Use it only at low levels to get the highest accuracy . This verifies that thin composition is more effective at lower levels . Compared with conventional combination , It saves channel convergence ( Point by point ) The calculation of , To compensate for learning more spatial filters ( depth ).
Comparison with other activation functions : We will be dynamic Shift-Max With three existing activation functions ReLU [25]、SE+ReLU [13] And dynamic ReLU [6] Compare . Results such as table 5 Shown . Our news Shift-Max Obviously better than the other three (2.5%), Proved its superiority . Please note that , dynamic ReLU yes J = 1 The dynamics of the Shift-Max A special case of ( See formula 5).
Dynamics of different layers Shift-Max: surface 6 Shows the use of dynamic... In three different layers in the microblock Shift-Max Of top-1 precision ( See chart 5). Using it in more layers leads to continuous improvement . The best accuracy can be achieved when it is used in all three layers . If only one layer is allowed to use dynamic Shift-Max, It is suggested that depthwise convolution Then use .
dynamic Shift-Max Different superparameters in : surface 7 Shows the use of different K and J The result of the combination ( In the equation 5 in ). We are K = 1 add ReLU, because max There is only one element left in the operator . Baseline of the first line (J = 1,K = 1) Equivalent to SE+ReLU [13]. When fixed J = 2( Merge the two groups ) when , The winner of the two fusions (K = 2) Better than a single fusion (K = 1). Adding a third blend doesn't help , Because it is mainly covered by the other two fusion , But more parameters are involved . When fixed K = 2( Two fusion at most ) when , More groups involved J Always better , But more FLOP. stay J = 2 and K = 2 A good compromise was achieved , among 4.1% Gain with extra 150 ten thousand MAdds Realization .
6、 ... and 、 For pixel level classification MicroNet
MicroNet It is not only effective for image level classification , It is also very effective for pixel level tasks . In this section , We will show its application in human posture estimation and semantic segmentation .
6.1、 Body posture estimation
We use COCO 2017 Data sets [21] Evaluate on single person key point detection MicroNet. Our model is in train2017 Training , Include 57K Images and 150K Human instance , Marked with 17 A key point . We are including 5000 Picture of val2017 To evaluate our methods on the Internet , And use more than 10 Similarity of key points of objects (OKS) Average accuracy of threshold (AP) As an indicator .
Implementation details : Similar to image classification , We have four MicroNet Model (M0-M3) For different FLOP Key point detection . By adding a set of selected blocks ( for example , All steps are 32 The block ) The resolution of the (×2) To adapt the model to the key detection task . For different MicroNet Model , The choice will be different ( For more details , Please see Appendix 8.1). Each model has a head containing three micro blocks ( One step is 8, The two steps are 4) And a pointwise convolution to generate 17 Heat map of key points . We use bilinear upsampling to improve the head resolution , And use spatial attention at each level [6].
Training settings : Use [28] Training settings in . The human body detection frame is cut and adjusted to 256 × 192. Data enhancement includes random rotation ([−45°; 45°])、 Random scaling ([0.65, 1.35])、 Flipping and half volume data enhancement . All models use Adam Optimizer [17] Train from the beginning 250 individual epoch. The initial learning rate is set to 1e-3, In the 210 and 240 epoch Down to 1e-4 and 1e-5.
test : A two-stage top-down paradigm [36, 28] Used for testing : Detection personnel instance , And then predict the key points . We use [36] Same personnel detector provided . The original image is combined with the heat map of the flipped image , Key points are predicted by adjusting the position of the highest calorific value to a quarter offset of the second highest response .
The main result : surface 8 take MicroNets With previous work [6, 5] In the aspect of effective attitude estimation , The calculated cost is less than 850 MFLOPs. Both of these works are used in the trunk and head MobileNet Inverse residual bottleneck block of , And by convoluting [5] And activation functions [6] The parameters in adapt to the input to show significant improvement .
our MicroNet-M3 In these jobs, only 33% Of FLOP, But it achieves similar performance , It is proved that our method is also effective for key point detection . Besides ,MicroNet-M2、M1、M0 It provides a good baseline for key point detection , Lower computational complexity , from 77M To 163M FLOPs.
6.2、 Semantic segmentation
We are Cityscape Data sets [7] Experiments with fine annotations were carried out on , To evaluate MicroNet Semantic segmentation . Our model is trained on the training fine set , Include 2,975 Zhang image . We are including 500 Picture of val On our evaluation set , And use mIOU As a measure .
Implementation details : We have revised four MicroNet Model (M0-M3) As the backbone , By setting the step size to 32 The resolution of all blocks of is increased to a step of 16, Be similar to MobileNetV3 [11]. Our model is in 1024×2048 It has very low computational cost in image resolution , from 2.5B To 0.8B FLOPs. We follow in the split header Atrous Spatial Pyramid Pooling (LR-ASPP) [11] Of Lite Reduced Design , Draw the feature map 2 Times bilinear upsampling , Apply spatial attention , And 8 The stride of is merged with the feature map from the trunk . We use Micro-Factorized Convolution instead of 1×1 Convolute to make LR-ASPP Lighter , be called Micro-Reduce ASPP (MR-ASPP).
Training settings : All models are initialized and trained randomly 240 individual epoch. The initial learning rate is set to 0.2, And decay to... By cosine function 1e-4. Weight falloff set to 4e-5. Used [4] Data enhancement in .
The main result : surface 9 All four are reported MicroNet Of mIOU. And MobileNetV3(68.4 mIOU and 2.90B MAdds) comparison , our MicroNet-M3 More accurate (69.1 mIOU), Lower computational costs (2.52B MAdds). This proves the superiority of our method in semantic segmentation . Besides , our MicroNet-M2、M1、M0 It provides a good baseline for semantic segmentation , Even lower FLOPs from 1.75B To 0.81B MAdds.
7、 ... and 、 Conclusion
In this paper , We proposed MicroNet To handle extremely low computing costs .
It builds on two proposed operators :Micro-Factorized Convolution and dynamic shift maxima . The former is approximated by point convolution and depth convolution in terms of channel number and input / Balance between output connections . The latter dynamically fuses continuous channel groups , Enhance node connectivity and nonlinearity to compensate for depth reduction . A series of MicroNet At very low FLOP Three tasks are realized under ( Image classification 、 Human pose estimation and semantic segmentation ) Reliable improvement of . We hope this work will be efficient CNN Provide a good baseline for multi vision tasks .
边栏推荐
- golang_3_结构体
- 1108. IP address invalidation
- Openstack explanation (18) -- Nova service startup and service creation
- 从零开始实现递归神经网络——【torch学习笔记】
- Digital image processing graphic image restoration task
- 认证成功处理器
- Redis info command memory information description
- Yantingli, a famous person in Jinan, Shandong Province, is a famous Oriental philosopher and his thoughts. China needs such a thinker
- CSRF跨站请求伪造
- Sword finger offer 18 Delete the node of the linked list
猜你喜欢

Kubernetes Chapter 7: Advanced pod, advanced controller, resource and dashboard

Notes on the development of raspberry pie (15): Raspberry pie 4b+ compile and install MySQL database from the source code

Security monitoring video easycvr video access interface adds the close button of a single video

106. construct binary tree from middle order and post order traversal sequence

Openstack explanation (17) -- other configurations of openstack Nova

“当你不再是程序员,很多事会脱离掌控”—— 对话全球最大独立开源公司SUSE CTO...

Kubernets chapitre 7: POD Advanced, Controller Advanced, Resource and Dashboard
![[genius_platform software platform development] lesson 35: UDP for cross network segment broadcasting](/img/53/c8d8b388788e13bdbf5cad9fbf91d6.png)
[genius_platform software platform development] lesson 35: UDP for cross network segment broadcasting

dict中的部分指令与set中的差和交集与增减元素

Unsupportedoperationexception exception resolution
随机推荐
随时随地可访问的 IT 资源构成
不加班的测试开发工程师不是好程序员?可能不是一只笨鸟,但一直在先飞......
How to achieve compliant telecommuting?
从数据库查询权限信息
Getting started with cloud based LDAP (Part 1)
16. sum of the nearest three numbers - quick sort plus double pointer method
1324. 竖直打印单词-力扣双百代码
Interview question 04.02 Minimum height tree - depth first traversal, plus tree divide and conquer
WPF 实现带明细的环形图表
golang_3_结构体
Après une perte de 300 millions de dollars, IBM a annoncé sa sortie de Russie!
Composition of IT resources accessible anywhere, anytime
华泰证券是安全的吗
AppScan检查到的一些中高危漏洞解决方案
电池充放电设备招投标解决方案
失业潮?元宇宙开拓全新的就业机会
Kubernetes第七篇:Pod進階、Controller進階、Resource和Dashboard
How does cloud based LDAP save traditional LDAP?
明道云上榜2022年中国信创行业办公软件排行榜
TensorFlow新文档发布:新增CLP、DTensor...最先进的模型已就绪!