当前位置:网站首页>Squeeze and incentive networks

Squeeze and incentive networks

2022-07-23 16:48:00 TJMtaotao

Squeeze-and-Excitation Networks

ie Hu[0000−0002−5150−1003]
Li Shen[0000−0002−2283−4976]
Samuel Albanie[0000−0001−9736−5134]
Gang Sun[0000−0001−6913−6799]
Enhua Wu[0000−0002−2174−1428]

 

Abstract Convolutional neural networks (CNNs) The core building block of is the convolution operator , It enables the network to construct information features by fusing the spatial and channel information in the local receptive field of each layer . Previous extensive studies have investigated the spatial components of this relationship , We try to improve the spatial coding quality of the whole feature level to enhance CNN Expressive force . In this work , We will focus on Channel relations On , A new architecture unit is proposed , We call it “ Squeeze and excite ”(SE) block , It passes through Explicitly model the interdependencies between channels , Adaptively recalibrate the channel characteristic response . We show that , These blocks can Stacked together to form SENet Architecture , Generalize very effectively on different data sets . We further prove that ,SE Block in Slightly increase the calculation cost Under the circumstances , remarkable Improved the existing latest cnn Performance of . Compressing and motivating networks is what we ILVRC 2017 Basis of classified submission , It won the first place , And the former 5 Bit errors are reduced to 2.251%, More than the 2016 Winning entries for , Relatively improved 25%.

key word : Squeeze and excite , The image shows , Be careful , Convolutional neural networks .

1 INTRODUCTION

Evolutionary neural networks (CNNs) It has proved to be a useful model for dealing with various visual tasks [1]、[2]、[3]、[4]. Each convolution layer in the network , The filter set represents the neighborhood spatial connection pattern along the input channel , The spatial and channel information are fused together in the local receiving field . By interleaving a series of convolution layers with nonlinear activation function and down sampling operator ,CNNs Can produce image representation , Thus, the hierarchical pattern is captured and the global theoretical receiving field is obtained . A central theme of computer vision research is to find more powerful forms of expression , These representations capture only those attributes in the image that are most significant for a given task , To improve performance . As a widely used visual task model family , The development of new neural network architecture design is a key frontier of this research . Recent research shows that , By integrating learning mechanisms into networks that help capture spatial correlations between features ,CNNs The generated representation can be enhanced . One of them consists of Inception Series architecture [5]、[6] Methods of generalization , Integrate the multi-scale process into the network module , To achieve improved performance . Further work seeks better model space dependence [7],[8], And bring spatial attention into the structure of the network [9].

In this paper , We studied another aspect of network design —— The relationship between channels . We introduced a new architecture unit , We call it squeezing and motivating (SE) block , The purpose is to improve the representation quality of the network by explicitly modeling the interdependence between the channels with convolution characteristics . So , We propose a mechanism that allows the network to perform feature recalibration , Through this mechanism , Networks can learn to use global information to selectively emphasize informative features , And suppress less useful features .

SE The structure of the building block is shown in the figure 1 Shown . For, enter X Mapping to U∈RH×W×C Feature mapping of U Any given transformation of , For example, convolution , We can construct the corresponding SE Block to perform feature recalibration . features U First, through the extrusion operation , The squash operation gathers its spatial dimensions (H×W) Generate channel descriptors by feature mapping on . The function of the descriptor is to generate the embedding function of the global distribution of the channel characteristic response , Allow information from the global receiving field of the network to be used by all its layers . Aggregation is followed by an incentive operation , It takes the form of a simple self gating mechanism , The mechanism takes embedding as input , And generate a set of modulation weights for each channel . These weights are applied to feature mapping U To generate SE Block output ,SE The output of the block can be directly fed into the subsequent layer of the network .

By simply superimposing SE Block set , Can construct SE The Internet (SENet). Besides , these SE The block can also be used as a substitute for the original block within a certain depth of the network arch itecarXiv:1709.01507v4 Number [ resume ]2019 year 5 month 16 Japan 2 chart 1. Squeeze and excite blocks . Really? ( The first 6.4 section ). Although the templates for building blocks are generic , But it performs different roles at different depths throughout the network . In the early layers , It stimulates information properties in a class agnostic way , Enhanced low-level representation of sharing . In subsequent layers ,SE Blocks are becoming more and more specialized , And respond to different inputs in a highly class specific manner ( The first 7.2 section ). therefore ,SE The benefits of feature recalibration performed by the block can be accumulated through the network .

 

 

 

                                                                                      chart 1. Squeeze and excite blocks .

Design and develop new CNN Architecture is a difficult engineering task , You usually need to choose many new super parameters and layer configurations . by comparison ,SE The structure of the block is simple , It can be directly used in the most advanced existing architecture , use SE Block replacement component , So as to effectively improve the performance .SE Blocks are also computationally lightweight , It will only slightly increase the complexity and computational burden of the model .

In order to provide evidence for these statements , We have developed several senet, Also on ImageNet Data sets were extensively evaluated [10]. We also showed ImageNet Other than the results , These results show that the benefits of our approach are not limited to specific data sets or tasks . By using SENets, We are ILSVRC 2017 Ranked first in the classification competition . Our best test set 1 Up to 2.251% Before 5 A mistake . Compared with the winner of the previous year ( front 5 The error of the name is 2.991%), This means a relative improvement 25%.

2 RELATED WORK

Deeper Architecture .VGGNets[11] and Inception Model [5] indicate , Increasing the depth of the network can significantly improve the representation quality of the network . By adjusting the distribution of input to each layer , Batch normalization (BN)[6] It increases the stability of the learning process in the deep network , And produce a smoother optimized surface [12]. On the basis of these work ,ResNets It proves that we can learn more deeply by using identity based skip connection 、 Stronger networks are possible [13],[14]. Road network [15] A gating mechanism is introduced , To adjust the information flow along the shortcut connection . After all this work , The connection between network layers is further restated [16]、[17]、1.http://image-net.org/challenges/LSVRC/2017/results This website shows promising improvements in the learning and representation characteristics of deep networks .

Another closely related research direction is to improve the functional form of computing elements in the network . Block convolution has proved to be a popular method to increase the learned transform cardinality [18],[19]. Multi branch convolution [5]、[6]、[20]、[21] It can realize more flexible operator combination , This can be seen as a natural extension of the grouping operator . In previous work , Cross channel correlation is usually mapped into new feature combinations , Independent of spatial structure [22]、[23] Or through the use of 1×1 Standard convolution filter for convolution [24] To unite . Most of these studies focus on the goal of reducing model and computational complexity , Reflects a hypothesis , That is, the channel relationship can be expressed as a combination of instance unknowable functions with local receiving fields . On the contrary , We believe that providing a mechanism , Global information is used to explicitly model the dynamics between channels 、 Nonlinear dependence , It can simplify the learning process , Significantly enhance the presentation ability of the network .

Algorithm architecture search . In addition to the above work , There is also a rich research history , Designed to abandon manual architectural design , But looking for automatic learning network structure . Most of the early work in this field was carried out in the field of neuroevolution , They established a method of searching network topology by evolutionary method [25],[26]. Although calculation is usually required , But evolutionary search has achieved remarkable success , These include the sequence model [27]、[28] Find a good storage unit , And learn the complex architecture for large-scale image classification [29]、[30]、[31]. In order to reduce the computational burden of these methods , be based on Lamarckian Inherit [32] And microstructure search [33] An effective alternative to this method is proposed

By defining schema search as hyperparametric optimization , Random search [34] And other more complex model-based optimization techniques [35],[36] It can also be used to solve this problem . Topology selection as a path through the structure of possible design [37] And direct architecture prediction [38],[39] It is proposed as an additional feasible architecture search tool . Through intensive learning [40]、[41]、[42]、[43]、[44] The technique in , Achieved particularly remarkable results .SE block 3 It can be used as an atomic building block for these search algorithms , And proved to be efficient in concurrent work .

Attention and gating mechanism . Note that it can be interpreted as a method of favoring the allocation of available computing resources to the most informative part of the signal [46]、[47]、[48]、[49]、[50]、[51]. Attention mechanisms have proven their usefulness in many tasks , Learning sequence includes [52],[53],[9],[54],[55],[56] Lip reading [57]. In these applications , It can be combined into one operator , Follow one or more layers , These layers represent a higher level of abstraction for adaptation between patterns . Some works have conducted interesting research on the combination of spatial attention and channel attention [58],[59].Wang wait forsomeone .[58] An hourglass based module is introduced [8] Powerful backbone and mask attention mechanism , The mechanism is inserted between the intermediate stages of the deep residual network . by comparison , What we proposed SE The block contains a lightweight gating mechanism , This mechanism models the channel relationship in a computationally efficient way , Focus on enhancing the presentation ability of the network .

3 SQUEEZE-AND-EXCITATION BLOCKS

The compression and excitation block is a computing unit , It can be built on inputting X∈RH0×W0×C0 Mapping to feature mapping U∈RH×W×C On the transformation of . In the following symbol , We will Ftrto As a convolution operator , And use V=[v1,v2.,vC] Represents the learned filter kernel set , among vC It means the first one c The parameters of a filter . Then we can write the output as U=[u1,u2.,uC), among

                                                

here * For convolution ,vc=[v1 c,v2c.,vC0 c),X=[x1,x2.,xC0] and uc∈RH×W.vs-cis A two-dimensional space kernel , Act on X On the corresponding channel of vc A single channel . To simplify the representation , The offset term is omitted . Because the output is generated by the sum of all channels , Channel dependence is implicitly embedded in vc in , But it is entangled with the local spatial correlation captured by the filter . The channel relation of convolution modeling is implicit and local in nature ( Except for the top ). We expect to enhance the learning of convolution features by explicitly modeling channel correlation , Thus, the network can increase the sensitivity to information features , These information features can be used through subsequent transformation . therefore , We want to input the filter response into the next transform , It provides global information access and recalibrates the filter response through two steps: squeezing and excitation . chart 1 Shows SE A diagram of the structure of a block .

3.1 Compress : Global information embedding is to solve the problem of using channel correlation , We first consider the signal of each channel in the output characteristics . Each learned filter uses a local receiver field , So convert the output U Each unit of cannot take advantage of context information outside the region .

To alleviate this problem , We propose to compress the global spatial information into the channel descriptor . This is achieved by using the global average pool to generate channel statistics . Formally , A statistic z∈RCis from U Through its spatial dimension H×W Shrink to produce , bring z Of the c The elements are calculated as follows :

Discuss . transformation U The output of can be interpreted as a collection of local descriptors , The statistics of these descriptors represent the whole image . Using this information is very popular in previous feature engineering work [60]、[61]、[62]. We chose the simplest aggregation technology global average pooling, Note that more complex strategies can also be used here . 

3.2 incentive : Adaptive recalibration

In order to take advantage of the information gathered in the compression operation , We use the second operation to track it , This operation is intended to fully capture channel dependencies . In order to achieve this goal , This function must meet two criteria : First of all , It has to be flexible ( especially , It must be able to learn the nonlinear interaction between channels ); second , It must learn non mutually exclusive relationships , Because we want to ensure that we allow emphasis on multiple channels ( Instead of forcing a hot start ). To meet these standards , We choose to adopt a simple gating mechanism , sigmoid function :

among δ Refer to ReLU[63] function ,W1∈R C R×C and W2∈R C×crr. In order to limit the complexity of the model and help promote , By forming two complete connections around the nonlinearity (FC) Layer bottleneck to parameterize the gate mechanism ,i、 e. With dimensionality reduction ratio r Dimensionality reduction layer ( The selection of this parameter is shown in 6.1 Section discusses )、ReLU, Then return to the transform output U Dimensionality reduction layer of channel size . Rescale by using activation U Get the final output of the block :

In style ,ex=[e x1,e x2.,e xC] and Fscale(uc,sc) Channel multiplication b

Discuss . The excitation operator will be specific to the input descriptor z Map to a set of channel weights . In this regard ,SE The block essentially introduces the dynamics based on input , This can be seen as a self attention function on the channel , The relationship is not limited to the local receptive field of the convolution filter response .

3.3 example

SE Blocks can be integrated into the standard architecture by nonlinear insertion after each convolution , example Such as VGGNet[11]. Besides ,SE The flexibility of the block means that it can be directly applied to transformations other than standard convolution . To illustrate this point , We pass the will SE The blocks are combined into several examples of more complex architectures to develop senet, It will be described below . 

We first consider the of the initial network SE Construction of blocks [5]. here , We simply convert Ftrto As a complete initial module ( See chart 2), And by making this change to each such module in the architecture , We get a SE Initial network .SE Blocks can also be used directly for the remaining networks ( chart 3 It depicts SE ResNet Module mode ). here ,SE Block conversion ftri It is considered to be the non identified branch of the remaining modules . Both squeezing and excitation work before summing with the identity Branch . take SE Block and ResNeXt[19]、Inception ResNet[21]、MobileNet[64] and ShuffleNet[65] Further variants of the integration can be constructed by following a similar scheme . about SENet Specific examples of Architecture , surface 1 given SE-ResNet-50 and SE-ResNeXt-50 Detailed description of .

SE One result of the flexibility of blocks is , There are several possible ways to integrate it into these architectures . therefore , In order to evaluate the use of SE The sensitivity of the integration strategy of merging blocks into the network architecture , We also provide ablation experiments , Explore the second 6.5 The blocks in this section contain different designs .

In order to make the proposed SE Block design has practical application value , It must provide a good compromise between improved performance and increased model complexity . To illustrate the computational burden associated with this module , We use ResNet-50 and SE-ResNet-50 Take the comparison between .ResNet-50 about 224×224 Pixel input image , In a forward process, you need 3.86 GFLOPs, Extrusion stage FC The layer is in the excitation phase , Then there is a cheap channel scaling operation . in general , When the reduction rate r( See the first 3.2 section ) Set to 16 when ,SE-ResNet-50 need 3.87 GFLOPs, Relatively primitive ResNet-50 increase 0.26%. In exchange for this small additional computational burden ,SE-ResNet-50 The accuracy of exceeds ResNet-50, And actually close to the need 7.58 GFLOPs Deeper ResNet-101 The accuracy of the network ( surface 2).

actually , Pass forward and backward ResNet-50 need 190 ms, and SE-ResNet-50 You need to 209 ms, Train small batches 256 Images ( Both timings are performed on one server , There is... On the server 8 individual NVIDIA Titan X gpu). We suggest that this represents a reasonable runtime overhead , With the popularity of GPU The global pool and small internal product operations in the library are further optimized , This overhead may be further reduced . Because of its importance in embedded device application , We further study each model CPU Extrapolate time for benchmarking : about 224×224 Pixel input image ,ResNet-50 need 164ms, and SE-ResNet-50 need 167ms. We think ,SE The contribution of blocks to model performance proves SE The small amount of additional computational cost generated by the block is reasonable .

Next, let's consider the proposed SE Additional parameters introduced by the block . These additional parameters are determined by only two of the gating mechanisms FC Layers produce , Therefore, it constitutes a small part of the total network capacity . To be specific , these FC The total number of weight parameters introduced by the layer is given by the following formula :

                                           

among r Represents the reduction rate ,S Represent series ( Series refers to the set of blocks operating on the feature mapping of the common space dimension ),Cs Represents the dimension of the output channel ,Ns Represent series S Number of duplicate blocks ( When in FC When using offset terms in layers , The introduced parameters and computational costs are usually negligible ).SE-ResNet-50 stay  

∼ResNet-50 need 2500 All the parameters , Corresponding to ∼10% The growth of . actually , Most of these parameters come from the final stage of the network , In the final stage of the network , The excitation operation is performed on the maximum number of channels . However , We found that , This relatively expensive SE The final stage of the block can be achieved at only a small performance cost (ImageNet Preceding 5 The bit error is less than 0.1%) To remove , Thus, the increase of relative parameters is reduced to 4%, This may prove useful when parameter use is a key consideration ( For further discussion , Please refer to No 6.4 Section and section 7.2 section ).

primary ResNeXt-101[19] use ResNet-152[13] Block stacking strategy . With the design and training of this model ( Don't use SE block ) The further differences are as follows :(a) The front of each bottleneck building block 1×1 The number of convolution channels is halved , In order to reduce the computational cost of the model with the minimum performance reduction .(b) The first one. 7×7 The convolution layer is replaced by three consecutive 3×3 Convolution layer .(c) use 3×3 step -2 Convolution instead of step size -2 Convolution 1×1 Down sampling projection to save information .(d) Insert a shedding layer before the classification layer ( The shedding ratio is 0.2), To reduce over fitting .(e) Used... During training Labelsmoothing Regularization ( Such as [20] Introduced in ).(f) Freeze all... During the last few training sessions BN Parameters of the layer , To ensure the consistency of training and testing .(g) Use 8 Servers (64 GPU) Perform training in parallel , To achieve mass production (2048). The initial learning rate is set to 1.0. 

import torch.nn as nn
class SELayer(nn.Module):
def __init__(self, channel, reduction=16):
super(SELayer, self).__init__()
self.avgpool = nn.AdaptiveAvgPool2d(1)
self.fc = nn.Sequential(
nn.Linear(channel, channel//reduction,bias=False),
nn.ReLU(inplace=True),
nn.Linear(channel//reduction,channel, bias=False),
nn.Sigmoid()
)
def forward(self, x):
b,c,h,w = x.size()
y = self.avgpool(x).view(b,c)

y = self.fc(y).view(b,c,1,1)
return x * y.expand_as(x)

原网站

版权声明
本文为[TJMtaotao]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/204/202207231255173700.html