当前位置:网站首页>Squeeze and exception networks learning notes
Squeeze and exception networks learning notes
2022-06-09 18:22:00 【Fried dough twist】
Squeeze-and-Excitation Networks Learning notes
Squeeze-and-Excitation Networks
Abstract
Convolutional neural networks (CNN) The core building block of is convolution operator , It enables the network to construct information features by fusing the spatial and channel information in each local receptive field . A large number of previous studies have investigated the spatial components of this relationship , We try to improve the spatial coding quality of the whole feature level to enhance CNN Representative of . In this work , We will focus on channel relationships , A new architecture unit is proposed , We call it “ Squeeze and excite ”(SE) block , It explicitly models the interdependencies between channels , Adaptively recalibrate the channel characteristic response . We show that , These blocks can be stacked together , formation SENet Architecture , It can be effectively extended to different data sets . We further prove that ,SE The block increases the calculation cost slightly , Significantly improve the existing state-of-the-art CNN Performance of . Squeeze and excitation networks make up our 2017 year ILSVRC Basis of classified submission , Won the first place , And the former 5 The error of name is reduced to 2.251%, More than the 2016 Award winning entries in , Relatively improved ∼25%. The model and code can be found in the https://github.com/hujie-frank/SENet.
** key word :**Squeeze-and-Excitation, Image representations, Attention, Convolutional Neural Networks.
1 INTRODUCTION
Convolutional neural networks (CNN) It has proved to be a useful model for a wide range of visual tasks [1]、[2]、[3]、[4]. Each convolution layer in the network , A set of filters express neighborhood spatial connection patterns along the input channel , Fusion of spatial and channel information into local receptive fields . By using nonlinear activation function and down sampling operator, a series of convolution layers are interleaved ,CNN Capable of generating image representations that capture hierarchical patterns , And get the global theory feeling field . A central theme of computer vision research is to find more powerful representations , Capture only the most significant attributes in the image for a given task , To improve performance . As a series of models widely used in visual tasks , The development of new neural network architecture design now represents a key frontier of this research . Recent research shows that , By integrating learning mechanisms into the network , It helps to capture the spatial correlation between features , Can enhance CNN The resulting representation . One of the methods is Inception Series architecture [5]、[6] Extension , It combines multi-scale processes into network modules , For better performance . Further work attempts to better model spatial dependencies [7]、[8], And bring spatial attention into the network structure [9].
In this paper , We studied another aspect of network design —— The relationship between channels . We introduced a new architecture unit , We call it extrusion and excitation (SE) block , Its The aim is to improve the representation quality of network generation by explicitly modeling the interdependence between convolution feature channels . So , We propose a mechanism that allows the network to perform feature recalibration , Through this mechanism , Networks can learn to use global information to selectively emphasize informative features , And suppress less useful features .
SE The structure of the building block is shown in the figure 1 Shown . For any given transformation F t r F_{tr} Ftr, Enter X Mapping to feature mapping U, among U ∈ R H × W × C U∈R^{H×W×C} U∈RH×W×C, For example, convolution , We can construct the corresponding SE Block to perform feature recalibration . First, the feature is transferred through the extrusion operation U, The extruding operation is performed through the spatial dimension of the feature mapping (H×W) Aggregate feature mapping to generate channel descriptors . The function of this descriptor is to embed the global distribution of channel characteristic response , Allow all layers of the network to use information from the global receptive field . Aggregation is followed by an incentive operation , This operation takes the form of a simple self gating mechanism , This mechanism will embed as input , And generate a set of modulation weights for each channel . Apply these weights to feature mapping U, To generate SE Block output , This output can be fed directly to subsequent layers of the network .

You can simply stack SE A collection of blocks to build SE The Internet (SENet). Besides , these SE The block can also be used as a substitute for the original block within a certain depth range in the network mechanism .
Although the templates for building blocks are generic , But throughout the network , It performs different roles at different depths . stay Early layers in , It takes a kind of Fired in a class independent manner Information characteristics , So as to strengthen the low-level representation of sharing . In subsequent strata ,SE Blocks are becoming more and more specialized , And Respond to different inputs in a highly category specific manner ( The first 7.2 section ). therefore ,SE The benefits of feature recalibration performed by the block can be accumulated over the network .
Design and develop new CNN Architecture is a difficult engineering task , You usually need to choose many new super parameters and layer configurations . by comparison ,SE The structure of the block is very simple , You can do this by replacing the component with SE Blocks are directly used in the most advanced existing architectures , So as to effectively improve the performance .SE Blocks are also computationally lightweight , It will only slightly increase the complexity of the model and the computational burden .
To provide evidence for these claims , We have developed several SENET, Also on ImageNet Data sets were extensively evaluated 【10】. We also showed ImageNet Other than the results , Show that the benefits of our approach are not limited to specific data sets or tasks . By using SENets, We are 2017 year ILSVRC Ranked first in the classification competition . Our best test set 1 It has been realized. 2.251% Before 5 Wrong name . Compared with the winners of the previous year , This represents about 25% Relative improvement of ( The error of the top five is 2.991%).
2 RELATED WORK
Deeper architectures.
VGGNets【11】 And initial model 【5】 indicate , Increasing the depth of the network can significantly improve the representation quality that it can learn . By adjusting the distribution of input to each layer , Batch normalization (BN)[6] It increases the stability of the learning process in the deep network , And produces a smoother optimized surface [12]. On the basis of these work ,ResNets prove , By using identity based skip connections , Can learn more 、 A stronger network 【13】、【14】. Road network 【15】 A gating mechanism is introduced , To adjust the flow of information along the quick connection . After all this work , The connection between network layers has been further restated 【16】、【17】, This shows that the learning and representation characteristics of deep networks have been greatly improved .
Another alternative but closely related research direction is A method to improve the functional form of computing elements in the network . Block convolution has been proved to be a popular method to increase the radix of learning transformation 【18】、【19】. Multi branch convolution [5]、[6]、[20]、[21] It can realize more flexible operator combination , This can be seen as a natural extension of the grouping operator . In previous work , Cross channel correlation is usually mapped into new feature combinations , Or independent of the spatial structure [22]、[23], Or through the use of 1×1 Standard convolution filter for convolution [24] Union mapping . Much of this research has focused on the goal of reducing model and computational complexity , Reflects a hypothesis , That is, the channel relation can be expressed as a combination of instance unknowable functions with local receptive fields . contrary , We claim that , Provide a mechanism for the unit , Use global information to explicitly model dynamic nonlinear dependencies between channels , It can simplify the learning process , And significantly enhance the representativeness of the network .
Algorithmic Architecture Search.
In addition to the above work , There is also a rich history of research , Designed to abandon manual architectural design , But looking for automatic learning network structure . Much of the early work in this field was carried out in the neuroevolutionary community , The community has established a method of searching across network topologies using evolutionary methods 【25】、【26】. Although evolutionary search usually requires computation , But it has achieved remarkable success , This includes finding good memory cells for sequence models [27]、[28], And learn the complex architecture for large-scale image classification [29]、[30]、[31]. In order to reduce the computational burden of these methods , Based on Lamarckian inheritance [32] And microstructure search [33], An effective alternative to this method is proposed .
By expressing schema search as hyperparametric optimization , Random search [34] And other more complex model-based optimization techniques [35]、[36] It can also be used to solve this problem . Topology selection as a path through possible design structures 【37】 And direct architecture prediction 【38】、【39】 It has been proposed as another feasible architecture search tool . The technology of reinforcement learning has achieved particularly remarkable results 【40】、【41】、【42】、【43】、【44】.SE block 3 It can be used as an atomic building block for these search algorithms , It has been proved that it has high efficiency in parallel work [45].
Attention and gating mechanisms.
Note that it can be interpreted as a method of favoring the allocation of available computing resources to the component with the largest amount of information in the signal [46]、[47]、[48]、[49]、[50]、[51]. Attention mechanism has shown its effectiveness in many tasks , Learning sequence includes [52]、[53]、 Image location and understanding [9]、[54]、 Image caption [55]、[56] Lip reading [57]. In these applications , It can be followed as an operation to one or more layers that represent a higher level of abstraction , To adapt between modes . Some works have conducted interesting research on the combination of space and channel attention 【58】,【59】.Wang wait forsomeone 【58】 This paper introduces a powerful backbone and mask attention mechanism based on hourglass module 【8】, It is inserted between the intermediate stages of the deep residual network . by comparison , What we proposed SE The block contains a lightweight gating mechanism , This mechanism enhances the network representation by modeling the channel relationship in a computationally efficient way .
3 SQUEEZE-AND-EXCITATION BLOCKS
The extrusion and excitation block is a computing unit , It can be built on transformation F t r F_{tr} Ftr Mapping input X ∈ R H ′ × W ′ × C ′ X∈R^{H^{'}×W^{'}×C^{'}} X∈RH′×W′×C′ On the basis of the element map U ∈ R H × W × C U∈ R^{H×W×C} U∈RH×W×C. In the following symbol , We will F t r F_{tr} Ftr As a convolution operator , And use V = [ v 1 , v 2 , … , v C ] V=[v_1,v_2,…,v_C] V=[v1,v2,…,vC] Represents the learning filter kernel set , among v C v_C vC It means the first one C The parameters of a filter . then , We can write the output as U = [ u 1 , u 2 , … , u C ] U=[u_1,u_2,…,u_C] U=[u1,u2,…,uC], among

ad locum ∗ For convolution , v c = [ v 1 c , v 2 c , … , v c C ′ ] v_c=[v_{1}^c,v_{2}^c,…,v_{c}^{C^{'}}] vc=[v1c,v2c,…,vcC′], X = [ x 1 , x 2 , … , x C ′ ] X=[x^1,x^2,…,x^{C^{'}}] X=[x1,x2,…,xC′] and u c ∈ R H × W u_c∈R^{H×W} uc∈RH×W. v C s v_C^s vCs It's a 2D Space kernel , To act on X The individual of the corresponding channel vc passageway . To simplify the representation , The deviation term is omitted . Since the output is generated by summing all channels , Therefore, channel dependencies are implicitly embedded in v c v_c vc in , But it is entangled with the local spatial correlation captured by the filter . The channel relations modeled by convolution are implicit and local in nature ( Except the top layer ). We expect to enhance convolution feature learning by explicitly modeling channel interdependencies , So that the network can improve its sensitivity to information characteristics , These information features can be utilized through subsequent transformation . therefore , We hope to provide them with access to global information , And before inputting the filter response into the next transformation , A two-step ( Squeeze and excite ) Recalibrate filter response . chart 1 Is shown in SE A diagram of the structure of a block .
3.1 Squeeze: Global Information Embedding
In order to solve the problem of using channel correlation , We first consider the signal of each channel in the output characteristic . Each learning filter works with a local receptive field , So transform the output U Each unit of the cannot take advantage of context information outside the region .
To alleviate the problem , We propose to compress the global spatial information into the channel descriptor . This is by using Global average pool To generate channel statistics . Formally , statistics z ∈ R C z∈R^C z∈RC Is through contraction U Space size H×W To generate the , therefore z Of the c The calculation formula of elements is :

Discussion.
Transformation U The output of can be interpreted as a collection of local descriptors , The statistics of these descriptors represent the whole image . The use of such information is very common in previous feature engineering work 【60】、【61】、【62】. We chose the simplest aggregation technology , namely Global average pool , It also points out that more complex strategies can be adopted here .
3.2 Excitation: Adaptive Recalibration
In order to take advantage of the information aggregated in the extrusion operation , We then perform the second operation , The purpose is Fully capture channel dependencies . In order to achieve this goal , This function must meet two criteria : First , It has to flexible ( especially , It has to Be able to learn the nonlinear interaction between channels ), secondly , It must learn non mutually exclusive relationships , Because we want to ensure that multiple channels are allowed to be emphasized ( Instead of forcing a hot activation ). To meet these standards , We choose to use Simple gating mechanism , And activate sigmoid:

among δ Express ReLU【63】 function , W 1 ∈ R C r × C \mathbf{W}_{1} \in \mathbb{R}^{\frac{C}{r} \times C} W1∈RrC×C and W 2 ∈ R C × C r \mathbf{W}_{2} \in \mathbb{R}^{C \times \frac{C}{r}} W2∈RC×rC. In order to limit the complexity of the model and help promote , By forming two complete connections around the nonlinearity (FC) Layer bottleneck to parameterize the gating mechanism , That is, a dimension reduction layer , The dimensionality reduction rate is r( This parameter is selected in the 6.1 Section discusses ), One ReLU, Then there is a dimension adding layer , Return to transform output U Channel dimension . Activate by using s Rescale U To get the final output of the block :

among , X ~ = [ x ~ 1 , x ~ 2 , … , x ~ C ] \tilde{\mathbf{X}}=\left[\widetilde{\mathbf{x}}_{1}, \widetilde{\mathbf{x}}_{2}, \ldots, \widetilde{\mathbf{x}}_{C}\right] X~=[x1,x2,…,xC] and F s c a l e ( u c , s c ) F_scale(u_c,s_c) Fscale(uc,sc) For scalar s c s_c sc And feature mapping u c ∈ R H × W u_c∈ R^{H×W} uc∈RH×W Multiply the channels between .
Discussion.
The excitation operator will be specific to the input descriptor z Map to a set of channel weights . In this regard ,SE The block essentially introduces input - conditioned dynamics , You can think of it as a function of self - attention on the channel , The relationship is not limited to the local receptive field of the convolution filter response .
3.3 Instantiations
SE Blocks can be inserted after nonlinearity after each convolution , Integrated into the VGGNet And other standard architectures . Besides ,SE The flexibility of the block means that it can be directly applied to transformations other than standard convolution . To illustrate this point , We pass the will SE Blocks are merged into multiple examples of more complex architectures to develop SENET, As follows .
We first consider building for the initial network SE block [5]. here , We simply convert F t r F_{tr} Ftr As a complete initial module ( See the picture 2), By making this change to each such module in the schema , We got a SE Initial network .SE Blocks can also be used directly for the rest of the network ( chart 3 It depicts SE ResNet Module mode ). here ,SE Block conversion F t r F_{tr} Ftr Considered as a non identity branch of the remaining modules . Both squeezing and excitation work before summation of the branches of the identity . The SE Block and ResNeXt【19】、Inception ResNet【21】、MobileNet【64】 and ShuffleNet【65】 Other variations of the integration . about SENet Specific examples of Architecture , surface 1 given SE-ResNet-50 and SE-ResNeXt-50 Detailed description of .



SE One result of block flexibility is , There are several possible ways to integrate it into these architectures . therefore , In order to evaluate the use of SE The sensitivity of the integration strategy of the block into the network architecture , We're still in the first 6.5 Section provides ablation experiments to explore different block inclusion designs .
4 MODEL AND COMPUTATIONAL COMPLEXITY
In order to make the proposed SE The block design has practical uses , It must make a good trade-off between improved performance and increased model complexity . To illustrate the computational burden associated with the module , We use ResNet-50 and SE-ResNet-50 Take the comparison between .ResNet-50 requirement ∼ about 224×224 Pixel input image , In a single forward process 3.86 GFLOPs. Every SE The global average pool operation in block usage is two small steps in the compression phase and the firing phase FC layer , Then perform cheap channel scaling operation . in general , When the reduction rate r( The first 3.2 Described in the section ) Set to 16 when ,SE-ResNet-50 requirement ∼3.87 GFLOP, Relatively primitive ResNet-50 increase 0.26%. In exchange for this slight additional computational burden ,SE-ResNet-50 The accuracy of exceeds ResNet-50 The accuracy of the , in fact , Proximity requires ∼7.58 GFLOP( surface 2).

actually , adopt ResNet-50 A single pass forward and backward requires 190 ms, And for those with 256 Training of images in small batches SE-ResNet-50, need 209 ms( Both of these timings have 8 individual NVIDIA Titan X GPU On the server of ). We suggest , This represents reasonable runtime overhead , With global pooling and small internal product operations in vogue GPU Library is further optimized , This may further reduce . Because of the importance of embedded devices , We further study each model CPU Extrapolating time is benchmarked : about 224×224 Pixel input image ,ResNet-50 need 164 ms, and SE-ResNet-50 need 167 ms. We think ,SE The small amount of additional computational cost generated by the block is reasonable , Because of its contribution to model performance .
Next , We consider the proposed SE Additional parameters introduced by the block . These additional parameters come from only two of the gating mechanisms FC layer , Therefore, it only accounts for a small part of the total network capacity . To be specific , these FC The total number of layer weight parameters introduced is as follows :

among ,r Represents the reduction rate ,S Indicates the number of stages ( A phase is a collection of blocks that operate on a feature map of a common space dimension ),Cs Represents the dimension of the output channel ,Ns Represents the stage S Number of repeated blocks of ( When in FC When using offset terms in layers , The introduced parameters and computational costs are usually negligible ).SE-ResNet-50 Introduction ∼250 10000 additional parameters ∼ResNet-50 need 2500 All the parameters , Corresponding to ∼ increase 10%. In practice , Most of these parameters come from the final stage of the network , The excitation operation is performed on the maximum number of channels . However , We found that , This relatively expensive SE The last stage of the block can be done at a small performance cost (ImageNet Preceding 5 Errors less than 0.1%) remove , Thus, the increase of relative parameters is reduced to ∼4%, This can be useful when parameter usage is a key consideration ( For further discussion , Please refer to the first 6.4 Section and section 7.2 section ).
5 EXPERIMENTS
In this section , We will carry out experiments , To study SE Blocks in a series of tasks 、 Validity in data sets and model architectures .
A little
6 ABLATION STUDY


7 ROLE OF SE BLOCKS
Although the proposed SE Blocks have been shown to improve the network performance of multiple visual tasks , But we also want to understand the relative importance of extrusion and how the incentive mechanism works in practice . It is still challenging to make a rigorous theoretical analysis on the representation of deep neural network learning , therefore , We use empirical methods to test SE The role of the block , The purpose is to at least have a preliminary understanding of its actual function .
7.1 Effect of Squeeze
In order to evaluate whether the global embedding generated by extrusion operation plays an important role in performance , We use a SE Block variants were tested , The SE The block adds the same number of parameters , But do not perform global average pool . To be specific , We removed the pool operation , And in the excitation operator ( namely Nosqueze) The corresponding... With the same channel dimension is used in 1×1 Convolution replaces two FC layer , The excitation output keeps the spatial dimension as the input . And SE Block opposite , These pointwise convolutions can only remap the channel according to the output of the local operator . In practice , The latter layers of a deep network usually have ( In theory ) Global receptive field , stay Nosqueze In variant , Global embedding is no longer directly accessible over the network . The accuracy and computational complexity of the two models are similar to that of the table 16 The standard ResNet-50 The models are compared . We observed that , The use of global information has a significant impact on model performance , This highlights the importance of the extrusion operation . Besides , And Nosqueze Compared to the design ,SE The block allows this global information to be used in a computationally efficient way .

7.2 Role of Excitation
In order to understand more clearly SE The function of the excitation operator in the block , In this section , We studied SE-ResNet-50 Example activation in the model , Their distributions in different categories and input images at different depths of the network are examined . especially , We want to understand how the excitation changes between images in different categories and between images in a category .
We first consider different types of incentive distributions . To be specific , We from ImageNet Extracted from the dataset Four classes with semantic and visual diversity , namely goldfish、pug、plane and cliff( Sample images of these classes are shown in the appendix ). then , We extract for each category from the validation set 50 Samples , Calculate the last of each stage SE In block 50 Average activation rate of a uniform sampling channel ( Immediately before downsampling ), And in the picture 6 Draw its distribution map in . As a reference , We also drew all 1000 Average activation distribution of categories .


We made the following three observations on the effect of excitation operation . First , In the early layers of the network , Such as SE 2 3, The distribution among different classes is very similar . This shows that , At an early stage , The importance of functional channels may be shared by different classes . The second observation is depth , The value of each channel becomes more category specific , Because different categories show different preferences for the distinguishing value of features , for example SE 4 6 and SE 5 1. These observations are consistent with previous work 【81】、【82】 The findings are consistent , namely Early layer features are usually more general ( for example , In the context of classification tasks , Category unknown ), and Later layer features show higher specificity 【83】.
Next , We observed some different phenomena in the final stage of the network .SE 5 2 Showing an interesting trend towards saturation , In this state , Most activations are close to 1. Values are taken in all activations 1 The point of ,SE The block is reduced to identity The operator . stay SE 5 3 The end of the network ( This is followed by the global pool before the classifier ), Similar patterns appear in different categories , The scale changes little ( It can be adjusted by the classifier ). This shows that SE 5 2 and SE 5 3 Not as important as previous modules in providing recalibration for the network . This finding is related to the 4 The empirical findings in section are consistent , The results of the survey show that , By removing... At the last stage SE block , Additional parameter counts can be significantly reduced , Only a slight loss of performance .
Last , We are in the picture 7 Two sample classes are shown in ( Goldfish and plane ) The mean and standard deviation of the activation of image instances in the same class . We observed a trend consistent with inter class visualization , indicate SE The dynamic behavior of a block differs between classes and instances within a class . Especially in the latter layers of the network , There is considerable diversity of representations within a single category , E-learning uses feature recalibration to improve its discrimination performance 【84】. All in all ,SE Blocks produce instance specific responses , But these responses do not support the increasingly class specific requirements of models at different levels of the architecture .

8 CONCLUSION
In this paper , We proposed SE block , This is an architectural unit , The purpose is to improve the representation ability of the network by enabling the network to perform dynamic channel feature recalibration . A large number of experiments show the effectiveness of the sensor , It enables state-of-the-art performance across multiple datasets and tasks . Besides ,SE The module reveals that the previous architecture can not fully model the channel characteristic dependency . We hope this insight may be useful for other tasks that require strong authentication . Last ,SE The feature importance value generated by the block can be used for other tasks , For example Model compressed network pruning .
边栏推荐
猜你喜欢

如何实现自定义富文本编辑器标签

智充推出NET ZERO SERIES储能充一体机,携手比亚迪共创净零未来

KVM虚拟化基本原理

10 common high-frequency business scenarios that trigger IO bottlenecks

微信小程序根据经纬度获取省市区信息
![[work with notes] multiple coexistence of ADB, sound card, network card and serial port of Tina system](/img/c9/e57db5cbcd0599717f551c591a86a5.png)
[work with notes] multiple coexistence of ADB, sound card, network card and serial port of Tina system

NLP - Keyword Extraction - textrank

数据库:高并发下的数据字段变更

深度学习与CV教程(13) | 目标检测 (SSD,YOLO系列)

redis源码学习-04_链表
随机推荐
Redis basic and advanced
基于Nexys3的频谱仪设计VHDL可上板
基于FPGA的SD卡读写设计及仿真verilog
NLP-RNN
Side B of the charging pile is Not only official account? And smart charging applet!
ZigBee networking has never been so simple!
Interpretation of new shares | ranked second in the event content marketing industry, and wanted to push SaaS products on the cloud to create a competitive barrier
SPI通信原理+Verilog实现及仿真(完整代码)
Development direction of hybrid cloud storage
10个常见触发IO瓶颈的高频业务场景
Scala基本语法学习-1
金鱼哥RHCA回忆录:DO447管理清单--编写YAML清单文件
Management of free memory
Analysis of C language [advanced] paid knowledge [II]
Application of die cutting products in different fields
如何利用无线通讯技术优化钢铁厂消防用水管网?
Process control -- > > process termination
Explain MySQL index
AI chief architect 3-aica-ai application practice in smart city
How to learn the process of KD tree construction and search with cases?