当前位置：网站首页>Squeeze and exception networks learning notes

Squeeze and exception networks learning notes

2022-06-09 18:22:00 【Fried dough twist】

Squeeze-and-Excitation Networks Learning notes

Squeeze-and-Excitation Networks

Abstract

Convolutional neural networks （CNN） The core building block of is convolution operator , It enables the network to construct information features by fusing the spatial and channel information in each local receptive field . A large number of previous studies have investigated the spatial components of this relationship , We try to improve the spatial coding quality of the whole feature level to enhance CNN Representative of . In this work , We will focus on channel relationships , A new architecture unit is proposed , We call it “ Squeeze and excite ”（SE） block , It explicitly models the interdependencies between channels , Adaptively recalibrate the channel characteristic response . We show that , These blocks can be stacked together , formation SENet Architecture , It can be effectively extended to different data sets . We further prove that ,SE The block increases the calculation cost slightly , Significantly improve the existing state-of-the-art CNN Performance of . Squeeze and excitation networks make up our 2017 year ILSVRC Basis of classified submission , Won the first place , And the former 5 The error of name is reduced to 2.251%, More than the 2016 Award winning entries in , Relatively improved ∼25%. The model and code can be found in the https://github.com/hujie-frank/SENet.

** key word ：**Squeeze-and-Excitation, Image representations, Attention, Convolutional Neural Networks.

1 INTRODUCTION

Convolutional neural networks （CNN） It has proved to be a useful model for a wide range of visual tasks [1]、[2]、[3]、[4]. Each convolution layer in the network , A set of filters express neighborhood spatial connection patterns along the input channel , Fusion of spatial and channel information into local receptive fields . By using nonlinear activation function and down sampling operator, a series of convolution layers are interleaved ,CNN Capable of generating image representations that capture hierarchical patterns , And get the global theory feeling field . A central theme of computer vision research is to find more powerful representations , Capture only the most significant attributes in the image for a given task , To improve performance . As a series of models widely used in visual tasks , The development of new neural network architecture design now represents a key frontier of this research . Recent research shows that , By integrating learning mechanisms into the network , It helps to capture the spatial correlation between features , Can enhance CNN The resulting representation . One of the methods is Inception Series architecture [5]、[6] Extension , It combines multi-scale processes into network modules , For better performance . Further work attempts to better model spatial dependencies [7]、[8], And bring spatial attention into the network structure [9].

In this paper , We studied another aspect of network design —— The relationship between channels . We introduced a new architecture unit , We call it extrusion and excitation （SE） block , Its The aim is to improve the representation quality of network generation by explicitly modeling the interdependence between convolution feature channels . So , We propose a mechanism that allows the network to perform feature recalibration , Through this mechanism , Networks can learn to use global information to selectively emphasize informative features , And suppress less useful features .

SE The structure of the building block is shown in the figure 1 Shown . For any given transformation $F_{tr}$ , Enter X Mapping to feature mapping U, among $U∈R^{H×W×C}$ , For example, convolution , We can construct the corresponding SE Block to perform feature recalibration . First, the feature is transferred through the extrusion operation U, The extruding operation is performed through the spatial dimension of the feature mapping （H×W） Aggregate feature mapping to generate channel descriptors . The function of this descriptor is to embed the global distribution of channel characteristic response , Allow all layers of the network to use information from the global receptive field . Aggregation is followed by an incentive operation , This operation takes the form of a simple self gating mechanism , This mechanism will embed as input , And generate a set of modulation weights for each channel . Apply these weights to feature mapping U, To generate SE Block output , This output can be fed directly to subsequent layers of the network .

You can simply stack SE A collection of blocks to build SE The Internet （SENet）. Besides , these SE The block can also be used as a substitute for the original block within a certain depth range in the network mechanism .

Although the templates for building blocks are generic , But throughout the network , It performs different roles at different depths . stay Early layers in , It takes a kind of Fired in a class independent manner Information characteristics , So as to strengthen the low-level representation of sharing . In subsequent strata ,SE Blocks are becoming more and more specialized , And Respond to different inputs in a highly category specific manner （ The first 7.2 section ）. therefore ,SE The benefits of feature recalibration performed by the block can be accumulated over the network .

Design and develop new CNN Architecture is a difficult engineering task , You usually need to choose many new super parameters and layer configurations . by comparison ,SE The structure of the block is very simple , You can do this by replacing the component with SE Blocks are directly used in the most advanced existing architectures , So as to effectively improve the performance .SE Blocks are also computationally lightweight , It will only slightly increase the complexity of the model and the computational burden .

To provide evidence for these claims , We have developed several SENET, Also on ImageNet Data sets were extensively evaluated 【10】. We also showed ImageNet Other than the results , Show that the benefits of our approach are not limited to specific data sets or tasks . By using SENets, We are 2017 year ILSVRC Ranked first in the classification competition . Our best test set 1 It has been realized. 2.251% Before 5 Wrong name . Compared with the winners of the previous year , This represents about 25% Relative improvement of （ The error of the top five is 2.991%）.

2 RELATED WORK

Deeper architectures.

VGGNets【11】 And initial model 【5】 indicate , Increasing the depth of the network can significantly improve the representation quality that it can learn . By adjusting the distribution of input to each layer , Batch normalization （BN）[6] It increases the stability of the learning process in the deep network , And produces a smoother optimized surface [12]. On the basis of these work ,ResNets prove , By using identity based skip connections , Can learn more 、 A stronger network 【13】、【14】. Road network 【15】 A gating mechanism is introduced , To adjust the flow of information along the quick connection . After all this work , The connection between network layers has been further restated 【16】、【17】, This shows that the learning and representation characteristics of deep networks have been greatly improved .

Another alternative but closely related research direction is A method to improve the functional form of computing elements in the network . Block convolution has been proved to be a popular method to increase the radix of learning transformation 【18】、【19】. Multi branch convolution [5]、[6]、[20]、[21] It can realize more flexible operator combination , This can be seen as a natural extension of the grouping operator . In previous work , Cross channel correlation is usually mapped into new feature combinations , Or independent of the spatial structure [22]、[23], Or through the use of 1×1 Standard convolution filter for convolution [24] Union mapping . Much of this research has focused on the goal of reducing model and computational complexity , Reflects a hypothesis , That is, the channel relation can be expressed as a combination of instance unknowable functions with local receptive fields . contrary , We claim that , Provide a mechanism for the unit , Use global information to explicitly model dynamic nonlinear dependencies between channels , It can simplify the learning process , And significantly enhance the representativeness of the network .

Algorithmic Architecture Search.

In addition to the above work , There is also a rich history of research , Designed to abandon manual architectural design , But looking for automatic learning network structure . Much of the early work in this field was carried out in the neuroevolutionary community , The community has established a method of searching across network topologies using evolutionary methods 【25】、【26】. Although evolutionary search usually requires computation , But it has achieved remarkable success , This includes finding good memory cells for sequence models [27]、[28], And learn the complex architecture for large-scale image classification [29]、[30]、[31]. In order to reduce the computational burden of these methods , Based on Lamarckian inheritance [32] And microstructure search [33], An effective alternative to this method is proposed .

By expressing schema search as hyperparametric optimization , Random search [34] And other more complex model-based optimization techniques [35]、[36] It can also be used to solve this problem . Topology selection as a path through possible design structures 【37】 And direct architecture prediction 【38】、【39】 It has been proposed as another feasible architecture search tool . The technology of reinforcement learning has achieved particularly remarkable results 【40】、【41】、【42】、【43】、【44】.SE block 3 It can be used as an atomic building block for these search algorithms , It has been proved that it has high efficiency in parallel work [45].

Attention and gating mechanisms.

Note that it can be interpreted as a method of favoring the allocation of available computing resources to the component with the largest amount of information in the signal [46]、[47]、[48]、[49]、[50]、[51]. Attention mechanism has shown its effectiveness in many tasks , Learning sequence includes [52]、[53]、 Image location and understanding [9]、[54]、 Image caption [55]、[56] Lip reading [57]. In these applications , It can be followed as an operation to one or more layers that represent a higher level of abstraction , To adapt between modes . Some works have conducted interesting research on the combination of space and channel attention 【58】,【59】.Wang wait forsomeone 【58】 This paper introduces a powerful backbone and mask attention mechanism based on hourglass module 【8】, It is inserted between the intermediate stages of the deep residual network . by comparison , What we proposed SE The block contains a lightweight gating mechanism , This mechanism enhances the network representation by modeling the channel relationship in a computationally efficient way .

3 SQUEEZE-AND-EXCITATION BLOCKS

The extrusion and excitation block is a computing unit , It can be built on transformation $F_{tr}$ Mapping input $X∈R^{H^{'}×W^{'}×C^{'}}$ On the basis of the element map $U∈ R^{H×W×C}$ . In the following symbol , We will $F_{tr}$ As a convolution operator , And use $V=[v_1,v_2,…,v_C]$ Represents the learning filter kernel set , among $v_C$ It means the first one C The parameters of a filter . then , We can write the output as $U=[u_1,u_2,…,u_C]$ , among

ad locum ∗ For convolution , $v_c=[v_{1}^c,v_{2}^c,…,v_{c}^{C^{'}}]$ , $X=[x^1,x^2,…,x^{C^{'}}]$ and $u_c∈R^{H×W}$ . $v_C^s$ It's a 2D Space kernel , To act on X The individual of the corresponding channel vc passageway . To simplify the representation , The deviation term is omitted . Since the output is generated by summing all channels , Therefore, channel dependencies are implicitly embedded in $v_c$ in , But it is entangled with the local spatial correlation captured by the filter . The channel relations modeled by convolution are implicit and local in nature （ Except the top layer ）. We expect to enhance convolution feature learning by explicitly modeling channel interdependencies , So that the network can improve its sensitivity to information characteristics , These information features can be utilized through subsequent transformation . therefore , We hope to provide them with access to global information , And before inputting the filter response into the next transformation , A two-step （ Squeeze and excite ） Recalibrate filter response . chart 1 Is shown in SE A diagram of the structure of a block .

3.1 Squeeze: Global Information Embedding

In order to solve the problem of using channel correlation , We first consider the signal of each channel in the output characteristic . Each learning filter works with a local receptive field , So transform the output U Each unit of the cannot take advantage of context information outside the region .

To alleviate the problem , We propose to compress the global spatial information into the channel descriptor . This is by using Global average pool To generate channel statistics . Formally , statistics $z∈R^C$ Is through contraction U Space size H×W To generate the , therefore z Of the c The calculation formula of elements is ：

Discussion.

Transformation U The output of can be interpreted as a collection of local descriptors , The statistics of these descriptors represent the whole image . The use of such information is very common in previous feature engineering work 【60】、【61】、【62】. We chose the simplest aggregation technology , namely Global average pool , It also points out that more complex strategies can be adopted here .

3.2 Excitation: Adaptive Recalibration

In order to take advantage of the information aggregated in the extrusion operation , We then perform the second operation , The purpose is Fully capture channel dependencies . In order to achieve this goal , This function must meet two criteria ： First , It has to flexible （ especially , It has to Be able to learn the nonlinear interaction between channels ）, secondly , It must learn non mutually exclusive relationships , Because we want to ensure that multiple channels are allowed to be emphasized （ Instead of forcing a hot activation ）. To meet these standards , We choose to use Simple gating mechanism , And activate sigmoid：

among δ Express ReLU【63】 function , $\mathbf{W}_{1} \in \mathbb{R}^{\frac{C}{r} \times C}$ and $\mathbf{W}_{2} \in \mathbb{R}^{C \times \frac{C}{r}}$ . In order to limit the complexity of the model and help promote , By forming two complete connections around the nonlinearity （FC） Layer bottleneck to parameterize the gating mechanism , That is, a dimension reduction layer , The dimensionality reduction rate is r（ This parameter is selected in the 6.1 Section discusses ）, One ReLU, Then there is a dimension adding layer , Return to transform output U Channel dimension . Activate by using s Rescale U To get the final output of the block ：

among , $\tilde{\mathbf{X}}=\left[\widetilde{\mathbf{x}}_{1}, \widetilde{\mathbf{x}}_{2}, \ldots, \widetilde{\mathbf{x}}_{C}\right]$ and $F_scale(u_c,s_c)$ For scalar $s_c$ And feature mapping $u_c∈ R^{H×W}$ Multiply the channels between .

Discussion.

The excitation operator will be specific to the input descriptor z Map to a set of channel weights . In this regard ,SE The block essentially introduces input - conditioned dynamics , You can think of it as a function of self - attention on the channel , The relationship is not limited to the local receptive field of the convolution filter response .

3.3 Instantiations

SE Blocks can be inserted after nonlinearity after each convolution , Integrated into the VGGNet And other standard architectures . Besides ,SE The flexibility of the block means that it can be directly applied to transformations other than standard convolution . To illustrate this point , We pass the will SE Blocks are merged into multiple examples of more complex architectures to develop SENET, As follows .

We first consider building for the initial network SE block [5]. here , We simply convert $F_{tr}$ As a complete initial module （ See the picture 2）, By making this change to each such module in the schema , We got a SE Initial network .SE Blocks can also be used directly for the rest of the network （ chart 3 It depicts SE ResNet Module mode ）. here ,SE Block conversion $F_{tr}$ Considered as a non identity branch of the remaining modules . Both squeezing and excitation work before summation of the branches of the identity . The SE Block and ResNeXt【19】、Inception ResNet【21】、MobileNet【64】 and ShuffleNet【65】 Other variations of the integration . about SENet Specific examples of Architecture , surface 1 given SE-ResNet-50 and SE-ResNeXt-50 Detailed description of .

SE One result of block flexibility is , There are several possible ways to integrate it into these architectures . therefore , In order to evaluate the use of SE The sensitivity of the integration strategy of the block into the network architecture , We're still in the first 6.5 Section provides ablation experiments to explore different block inclusion designs .

4 MODEL AND COMPUTATIONAL COMPLEXITY

In order to make the proposed SE The block design has practical uses , It must make a good trade-off between improved performance and increased model complexity . To illustrate the computational burden associated with the module , We use ResNet-50 and SE-ResNet-50 Take the comparison between .ResNet-50 requirement ∼ about 224×224 Pixel input image , In a single forward process 3.86 GFLOPs. Every SE The global average pool operation in block usage is two small steps in the compression phase and the firing phase FC layer , Then perform cheap channel scaling operation . in general , When the reduction rate r（ The first 3.2 Described in the section ） Set to 16 when ,SE-ResNet-50 requirement ∼3.87 GFLOP, Relatively primitive ResNet-50 increase 0.26%. In exchange for this slight additional computational burden ,SE-ResNet-50 The accuracy of exceeds ResNet-50 The accuracy of the , in fact , Proximity requires ∼7.58 GFLOP（ surface 2）.

actually , adopt ResNet-50 A single pass forward and backward requires 190 ms, And for those with 256 Training of images in small batches SE-ResNet-50, need 209 ms（ Both of these timings have 8 individual NVIDIA Titan X GPU On the server of ）. We suggest , This represents reasonable runtime overhead , With global pooling and small internal product operations in vogue GPU Library is further optimized , This may further reduce . Because of the importance of embedded devices , We further study each model CPU Extrapolating time is benchmarked ： about 224×224 Pixel input image ,ResNet-50 need 164 ms, and SE-ResNet-50 need 167 ms. We think ,SE The small amount of additional computational cost generated by the block is reasonable , Because of its contribution to model performance .

Next , We consider the proposed SE Additional parameters introduced by the block . These additional parameters come from only two of the gating mechanisms FC layer , Therefore, it only accounts for a small part of the total network capacity . To be specific , these FC The total number of layer weight parameters introduced is as follows ：

among ,r Represents the reduction rate ,S Indicates the number of stages （ A phase is a collection of blocks that operate on a feature map of a common space dimension ）,Cs Represents the dimension of the output channel ,Ns Represents the stage S Number of repeated blocks of （ When in FC When using offset terms in layers , The introduced parameters and computational costs are usually negligible ）.SE-ResNet-50 Introduction ∼250 10000 additional parameters ∼ResNet-50 need 2500 All the parameters , Corresponding to ∼ increase 10%. In practice , Most of these parameters come from the final stage of the network , The excitation operation is performed on the maximum number of channels . However , We found that , This relatively expensive SE The last stage of the block can be done at a small performance cost （ImageNet Preceding 5 Errors less than 0.1%） remove , Thus, the increase of relative parameters is reduced to ∼4%, This can be useful when parameter usage is a key consideration （ For further discussion , Please refer to the first 6.4 Section and section 7.2 section ）.

5 EXPERIMENTS

In this section , We will carry out experiments , To study SE Blocks in a series of tasks 、 Validity in data sets and model architectures .

A little

6 ABLATION STUDY

7 ROLE OF SE BLOCKS

Although the proposed SE Blocks have been shown to improve the network performance of multiple visual tasks , But we also want to understand the relative importance of extrusion and how the incentive mechanism works in practice . It is still challenging to make a rigorous theoretical analysis on the representation of deep neural network learning , therefore , We use empirical methods to test SE The role of the block , The purpose is to at least have a preliminary understanding of its actual function .

7.1 Effect of Squeeze

In order to evaluate whether the global embedding generated by extrusion operation plays an important role in performance , We use a SE Block variants were tested , The SE The block adds the same number of parameters , But do not perform global average pool . To be specific , We removed the pool operation , And in the excitation operator （ namely Nosqueze） The corresponding... With the same channel dimension is used in 1×1 Convolution replaces two FC layer , The excitation output keeps the spatial dimension as the input . And SE Block opposite , These pointwise convolutions can only remap the channel according to the output of the local operator . In practice , The latter layers of a deep network usually have （ In theory ） Global receptive field , stay Nosqueze In variant , Global embedding is no longer directly accessible over the network . The accuracy and computational complexity of the two models are similar to that of the table 16 The standard ResNet-50 The models are compared . We observed that , The use of global information has a significant impact on model performance , This highlights the importance of the extrusion operation . Besides , And Nosqueze Compared to the design ,SE The block allows this global information to be used in a computationally efficient way .

7.2 Role of Excitation

In order to understand more clearly SE The function of the excitation operator in the block , In this section , We studied SE-ResNet-50 Example activation in the model , Their distributions in different categories and input images at different depths of the network are examined . especially , We want to understand how the excitation changes between images in different categories and between images in a category .

We first consider different types of incentive distributions . To be specific , We from ImageNet Extracted from the dataset Four classes with semantic and visual diversity , namely goldfish、pug、plane and cliff（ Sample images of these classes are shown in the appendix ）. then , We extract for each category from the validation set 50 Samples , Calculate the last of each stage SE In block 50 Average activation rate of a uniform sampling channel （ Immediately before downsampling ）, And in the picture 6 Draw its distribution map in . As a reference , We also drew all 1000 Average activation distribution of categories .

We made the following three observations on the effect of excitation operation . First , In the early layers of the network , Such as SE 2 3, The distribution among different classes is very similar . This shows that , At an early stage , The importance of functional channels may be shared by different classes . The second observation is depth , The value of each channel becomes more category specific , Because different categories show different preferences for the distinguishing value of features , for example SE 4 6 and SE 5 1. These observations are consistent with previous work 【81】、【82】 The findings are consistent , namely Early layer features are usually more general （ for example , In the context of classification tasks , Category unknown ）, and Later layer features show higher specificity 【83】.

Next , We observed some different phenomena in the final stage of the network .SE 5 2 Showing an interesting trend towards saturation , In this state , Most activations are close to 1. Values are taken in all activations 1 The point of ,SE The block is reduced to identity The operator . stay SE 5 3 The end of the network （ This is followed by the global pool before the classifier ）, Similar patterns appear in different categories , The scale changes little （ It can be adjusted by the classifier ）. This shows that SE 5 2 and SE 5 3 Not as important as previous modules in providing recalibration for the network . This finding is related to the 4 The empirical findings in section are consistent , The results of the survey show that , By removing... At the last stage SE block , Additional parameter counts can be significantly reduced , Only a slight loss of performance .

Last , We are in the picture 7 Two sample classes are shown in （ Goldfish and plane ） The mean and standard deviation of the activation of image instances in the same class . We observed a trend consistent with inter class visualization , indicate SE The dynamic behavior of a block differs between classes and instances within a class . Especially in the latter layers of the network , There is considerable diversity of representations within a single category , E-learning uses feature recalibration to improve its discrimination performance 【84】. All in all ,SE Blocks produce instance specific responses , But these responses do not support the increasingly class specific requirements of models at different levels of the architecture .

8 CONCLUSION

In this paper , We proposed SE block , This is an architectural unit , The purpose is to improve the representation ability of the network by enabling the network to perform dynamic channel feature recalibration . A large number of experiments show the effectiveness of the sensor , It enables state-of-the-art performance across multiple datasets and tasks . Besides ,SE The module reveals that the previous architecture can not fully model the channel characteristic dependency . We hope this insight may be useful for other tasks that require strong authentication . Last ,SE The feature importance value generated by the block can be used for other tasks , For example Model compressed network pruning .

原网站

版权声明
本文为[Fried dough twist]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/160/202206091808264010.html