当前位置：网站首页>In depth understanding of the se module of the attention mechanism in CV

In depth understanding of the se module of the attention mechanism in CV

2022-07-08 02:18:00 【Strawberry sauce toast】

CV Medium Attention Mechanism summary （ One ）：SE modular

Squeeze-and-Excitation Networks

Thesis link ：Squeeze-and-Excitation Networks

1. Abstract

In this work, we focus instead on the channel relationship and propose a novel architectural unit, which we term the “Squeeze-and-Excitation”(SE) block,that adaptively recalibrates（ Recalibrate ）channel-wise feature responses by explicitly modelling interdependencies between channels.
SE The module belongs to the channel attention mechanism , It can adaptively learn the dependencies between different channels .

2. SE Detailed understanding of the module

Given in the original SE The module legend is as follows :
Insert picture description here
Combined with article 3 The contents of this section provide a detailed understanding of the following two issues ：

SE How modules learn about dependencies between different channels ？
SE How does the module use the channel information to guide the model to carry out differentiated weighted learning of features ？

2.1 Multiple input and multiple output channels

chart 1 in ① Part describes the convolution layer of multiple input and multiple output channels .
Multiple input channels : Each channel of the input characteristic graph corresponds to a two-dimensional convolution kernel , The sum of convolution results of all input channels is the final convolution result , As shown in the figure below （ For simplicity , The deviation is omitted ）:
Insert picture description here
In style , $C$ It means the first one $C$ Output channels , $S$ It means the first one $S$ Input channels .
Each input channel corresponds to a two-dimensional convolution kernel , therefore : Number of channels of three-dimensional convolution kernel = Enter the number of channels of the characteristic graph .

2.2 Multiple output channels

Insert picture description here
Each output channel corresponds to an independent three-dimensional convolution kernel , therefore , The number of channels to output the characteristic graph = Number of three-dimensional convolution kernels . Usually , The number of output channels is a super parameter .

According to the principle of multiple input and multiple output channels , It is not difficult for us to understand that in conventional convolution calculation , The correlation between different input channels is hidden in each output channel , And only “ Add up ” This simple way , Different output channels correspond to independent three-dimensional convolution kernels , therefore , The correlation between input channels is not reasonably utilized .

Therefore, the author of the paper proposes SE Module to explicitly utilize information between different input channels .

2.3 Squeeze-and-Excitation Block

Detailed explanation SE modular

2.3.1 Squeeze: Global Information Embedding

The author adopts global average convergence (Global Average Pooling) Get the information of each channel .
$z_c=\bold F_{sq}(\bold u_c)=\frac{1}{H\times W}\sum_{i=1}^{H} \sum_{j=1}^{W}u_c(i,j)$

Why do you do this ？ The original text explains ：

Each of the learned filters operates with a local receptive field and consequently each unit of the transformation output $U$ is unable to exploit contextual information outside of this region.
On a sheet of $H\times W$ In the characteristic diagram of , Each element only corresponds to a local area in the input characteristic graph （ It's the receptive field ）, Therefore, each element in the output feature graph contains only local information rather than global information .

To mitigate this problem, we propose to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generated chanel-wise statistics.
The author uses the global average convergence to get the overall situation features , The purpose is to fuse local information to get global information , The reason for adopting global average convergence is that it is simple to implement , Other more delicate but complex operations can also be used .

2.3.2 Excitation: Adaptive Recaloibration

Excitation（ incentive ） The module is to better get the dependencies between various channels , Two requirements need to be met ：

It can learn the nonlinear relationship between various channels ;
It can ensure that each channel has a corresponding output , obtain soft-label, instead of one-hot Type vector .
therefore , The author uses two full connection layers to learn nonlinear relations , Finally using sigmoid Activation function .
And in order to reduce model parameters and complexity , Adopted “bottleneck” Thought design full connection layer , Then a super parameter is generated ： $r$ , Wen Zhongling $r = 16$ .

$\color{red}{ On why to use sigmoid Thinking about functions ？}$

sigmoid Is one of the common activation functions ,SE The final output of the module is equivalent to the weight of each channel learned , First of all, ensure that the weight cannot be 0, by 0 Instead, it will lose a lot of information , So it can't be used ReLU; in addition , The range you want here is $[0, 1]$ The weight of , Not to highlight a certain channel , Different from “ Multi category classification ” problem , More like “ Multi label classification ” problem , So I'm going to use softmax Function is not appropriate .

Excitation The module formula represents ：
$s=\bold F_{ex}(\bold x, \bold W)=\sigma(g(\bold z,\bold W))=\sigma(\bold W_2\delta(W_1 \bold z))$
In style , $\delta(\bullet)$ Express ReLU Activation function , $\sigma(\bullet)$ Express sigmoid Activation function .

SE Module flow chart

2.3.3 weighting

The final will be SE The output of the module acts on the output of the convolution layer , Get the output characteristic diagram of channel attention weighting .
Insert picture description here
Use what you get channel-wise vector , Each element of the characteristic graph of each channel is weighted （ Understand the formula （4） Then there is the product of scalar and matrix ）.

3. SE Use of modules

Insert picture description here

3、 ... and 、PyTorch Realization SE modular

3.1 Use the full connection layer to realize Excitation

class SE(nn.Module):
	def __init__(self, channels, reduction=16): #  I think so 16, If the number of feature map channels is small , It can be adjusted properly 
		super(SE, self).__init__()
		self.squeeze = nn.AdaptiveAvgPool2d((1, 1))
		self.excitation = nn.Sequential(
			nn.Linear(channels, channels // reduction, bias=False),
			nn.ReLU(inplace=True),
			nn.Linear(channels // reduction, channels, bias=False),
			nn.Sigmoid())
	
	def forward(self, x):
		b, c, _, _ = x.size()
		y = self.squeeze(x).view(b, c)
		y = self.excitation(y).view(b, c, 1, 1)
		return x * y.expand_as(x)

3.2 Use $1\times 1$ Convolution realization Excitation

Use $1\times 1$ Convolution instead of full connection layer , Avoid dimensional transformations between matrices and vectors

class SE(nn.Module):
    def __init__(self, channels, reduction=2):
        super(SE, self).__init__()
        self.squeeze = nn.AdaptiveAvgPool2d((1, 1))
        self.excitation = nn.Sequential(
            nn.Conv2d(channels, channels // reduction, kernel_size=1, stride=1, padding=0, bias=False),
            nn.ReLU(inplace=True),
            nn.Conv2d(channels // reduction, channels, kernel_size=1, stride=1, padding=0, bias=False),
            nn.Sigmoid())

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.squeeze(x)
        print(y.shape)
        y = self.excitation(y)
        print(y.shape)
        return x * y