当前位置：网站首页>Deep learning non local neural networks

Deep learning non local neural networks

2022-06-25 05:24:00 【HheeFish】

Non-local Neural Networks Nonlocal neural network

0. summary
1. Related work
2.Non-local neural network

0. summary

Capturing long-range dependence is of central importance in deep neural networks . For sequential data （ for example , In voice 、 In language ）, Repetitive operation is the main solution of remote dependency modeling . For image data , Long distance dependence is caused by The large receptive field formed by deep convolution operation is modeled . Both convolution and loop operations deal with local neighborhoods , Whether in space or time ; therefore , Only repeat these operations , Gradually spread the signal through the data , To capture long-range dependencies . Repeated local operations have several limitations .

It's computationally inefficient .
Can lead to optimization difficulties that need to be carefully addressed .
These challenges make multi hop dependency modeling more difficult , For example, when information needs to be transferred back and forth between remote locations .

In this paper , We will operate nonlocally (Non-local Operation) As an efficient 、 Simple and universal components , Used to capture remote dependencies using deep neural networks . Our nonlocal operation is a classical nonlocal averaging operation in computer vision (Non-local Means) Promotion of . Intuitively speaking , Nonlocal operations calculate the response of a location , It is through calculating the input feature mapping Weighted sum of features at all locations . Location sets can be set in space 、 In time or space , This means that our operation applies to images 、 Sequence and video problems .
Using nonlocal operations has several advantages ：

Compared with the asymptotic behavior of loop operation and convolution operation , Nonlocal operation calculates the interaction between any two positions Capture long-range dependencies directly , Regardless of their location and distance ;
As we showed in the experiment , Nonlocal operations are effective , Even if there are only a few floors （ for example 5 layer ）, Can also achieve the best results ;
Last , Our nonlocal operations maintain variable input sizes , And it can be easily combined with other operations （ for example , We will use convolution ） Combine .
We demonstrate the effectiveness of nonlocal operations in video classification applications . In the video ,** Long distance interaction between distant pixels in space and time . A nonlocal block （ Our basic unit ） These can be captured directly in a feedforward manner Spatiotemporal dependencies .** For some nonlocal blocks , We call it the architecture of nonlocal neural networks 2D and 3D Convolution network （ Including expansion variables ） More accurate for video classification . Besides , Nonlocal neural network is more economical than three-dimensional convolutional neural network . In dynamics and Charades A comprehensive ablation study is presented on the data set . Use only RGB without any bells and whistles（ for example , Optical flow , Multi scale testing ）, Our method has achieved equivalent or better results than the winner of the latest competition on both data sets .
To prove the generality of nonlocal operations , We further propose target detection / Experiments on segmentation and pose estimation were carried out in COCO Data sets . In the powerful Mask R-CNN Above baseline , Our nonlocal block can improve the accuracy of all three tasks with a small additional computational cost . Combined with the evidence on the video , These image experiments show that , Nonlocal operations are often useful , It can be the basic component of designing deep neural network .

1. Related work

1.1.Non-local image processing.（ Nonlocal image processing ）

Nonlocal mean is a classical filtering algorithm , It calculates the weighted average of all pixels in the image . It allows remote pixel pairs to be based on patch Appearance similarity contributes to the filtering response of the location . The idea of nonlocal filtering developed into BM3D( Block match 3D), It applies to a group of similar but nonlocal patch To filter . Compared with deep neural networks ,BM3D It is a kind of solid image denoising baseline . Nonlocal matching is also one of the most successful texture synthesis , The essence of super-resolution and repair algorithm .

1.2.Graphical models（ Graph model ）

Long term dependencies can be achieved through Graphical model to model , For example, conditional random fields (CRF). In the context of deep neural network ,CRF It can be used for post-processing semantic segmentation prediction of network .CRF The iterative mean field reasoning can be transformed into a recursive network and trained . by comparison , Our method is a A simpler feedforward block computes nonlocal filtering . Different from these methods for segmentation , Our common components are used for classification and detection . These and our methods also involve a more abstract model , be called Figure neural network .

1.3.Feedforward modeling for sequences（ Feed forward modeling of sequences ）

Recently there has been a use of feedforward ( That is, non recursive ) Trends in network modeling of speech and language sequences . In these methods , Large receive fields contributed by very deep one-dimensional convolution capture long-term dependencies . These feedforward models are suitable for parallel implementation , And more effective than the widely used cycle model .

1.4.Self-Attention（ Self attention mechanism ）

The self - attention module focuses on all positions and takes their weighted average value in the embedded space , Calculate a position in the sequence （ for example , A sentence ） Response . As we will discuss in the next article , Self focus can be seen as Non-local A form of average , In this sense , Our work links the self-interest of machinetranslation with more general nonlocal filtering operations for image and video problems in computer vision .

1.5.Interaction Networks（ Interactive networks ）

Interactive networks （IN） Recently, it has been proposed for physical system modeling . They operate on graphics involving objects that interact in pairs .Hoshen[23] Under the background of multi-agent prediction modeling Vertex Attention IN （ Vertex attention interaction network ） Note the more efficient vertices in . Another variable , Name it relational network （Relation Networks）, The feature is embedded with a computational function at all positions in its input . Our method also handles all pairs of , We will be in equation （1） Explained in （f（xi,xj））. Although our nonlocal network is associated with these methods , But our experiments show that , The nonlocality of the model , This is orthogonal to note / Interaction / The idea of relationship ( for example , A network can focus on a local area ), Is the key to its success . Nonlocal modeling is a long-term key element in image processing , It has been largely ignored in recent computer vision neural networks .

1.6.Video classification architectures（ Video classification architecture ）

A natural solution to frequency classification is to combine cnn The success of the image and rnn The success of the sequence . by comparison , The feedforward model is through three-dimensional convolution (3D convolutions, C3D) Realized in time and space , The three-dimensional filter can pass “ inflation ” Pre trained two-dimensional filter formation . In addition to modeling the original video input end-to-end , Optical flow and trajectories have been found to be useful . Flow and track are ready-made modules , Long term 、 Nonlocal dependencies .

2.Non-local neural network

We first give a general definition of nonlocal operations , Then provide several specific examples of it .

2.1. The formula

According to the definition of nonlocal mean , We define in deep neural networks non-local The operation is as follows ：
Insert picture description here

x It's the input signal ,cv It is generally used in feature map
i Represents the output position , Such as space 、 Index of time or space time , His response should be right j The result of enumeration and calculation
f Functional calculation i and j The similarity
g Function calculation feature map stay j The representation of position
The final y It's through the response factor C(x) After standardization
i Represents the response of the current location ,j For global response , A nonlocal response value is obtained by weighting .
f(xi,xj) To calculate i And all possible locations j Between pairwise The relationship between , This relationship can be, for example i and j The farther away you are ,f The smaller the value. , Express j The position is right i The less the impact .g(xj) Used to calculate the input signal at j Eigenvalue of position .C(x) It's a normalized parameter .
In the full connection layer （FC）x_j and x_i The relationship is not a function of input data , It's directly through Learn to gain weight , and Non-local In operation , Calculate the response according to the relationship between different locations . Besides , We're doing this (1) The formula in supports variable size input , And keep the corresponding output size . contrary ,fcl The layer requires a fixed size input / Output , And lost position correspondence . Nonlocal operations are a flexible building block , Can be easily convoluted with / Use with circulation layer . It can be added to the early part of deep neural network , It's not like what is usually used last fc layer . This allows us to build richer hierarchies , Combine nonlocal and local information .

2.2 example

Next we describe f and g Several versions of . Interestingly , We will go through experiments (Table2a) indicate , our Nonlocal models are insensitive to these choices , This indicates that the general nonlocal behavior is the main reason for the observed improvement . For the sake of simplicity , We only consider the form of linear embedding :g(x_j) =W_gx_j, among W_g Is a weight matrix that needs to be learned . This can be achieved as 1×1 Space convolution or 1×1×1 Spatiotemporal convolution . Next, let's talk about pairwise functions f The choice of .

2.2.1. gaussian

After nonlocal mean and bilateral filter , Natural selection Gaussian function . In this paper , We consider the ： Insert picture description here
X^T_i X _j Is the calculated dot product similarity , The normalization factor is set to C(x)=∑_∨j f(x_i ,x_j )

2.2.2. Embedded Gauss

The simple extension of Gaussian function is to calculate the similarity in the embedded space . In this paper , We consider the ：
Insert picture description here
θ(x_i)=W_θx_i and ф(x_j)=W_фx_j Are two embedded , The normalization factor is set to C(x)=∑_∨j f(x_i ,x_j )
self-attention The module is actually non-local Of embedded Gaussian A special case of version . For a given i,C(x)f(x_i,x_j) It becomes calculating all j Of softmax, namely y=softmax(x^TW^T_θW_ϕx)g(x), This is it. self-attention The form of expression . So we will self-attention The model is connected with the traditional nonlocal mean , And will sequential self-attention network It is extended to the more general space/spacetime non-local network, Can be in the image 、 Used in video recognition tasks .

2.2.3. Point multiplication

f It can also be defined as point multiplication similarity , namely ：
Insert picture description here
Here we use the embedded version . under these circumstances , We set the normalization factor C(x)=N, among N yes x The total number of elements . Standardization is necessary , Because the input can have a variable size . The main difference between dot product and embedded Gaussian version is softmax The existence of , It acts as an activation function .

2.2.4.Concatenation

Concat Is in Relation Networks Used in pairwise function. We also give a concat Formal f, as follows ： Insert picture description here
here [.,.] It means concat,wf Is to be able to concat The vector of is transformed into a scalar weight vector . Set up here C(x)=N.
The above variants demonstrate the flexibility of our generic nonlocal operations . We believe that alternative versions are possible , And may improve the results .

2.3. Nonlocal block

We will (1) Type in the non-local The operation becomes a non-local block, So that it can be inserted into the existing structure .
Let's define a non-local block by ： Insert picture description here
among y_i By the equation （1） give ,x_i Represents the residual link . Residual join allows us to insert new nonlocal blocks into any pre trained model , Without destroying its initial behavior .
Please add a picture description

chart 2
Time space nonlocal module . The characteristic graph is shown as the shape of its tensor T×H×W×1024 be used for 1024 Channels （ Perform proper shaping when paying attention ）.“⊗” Representation matrix multiplication , and “⊕” Means sum by element . Execute... For each line softmax operation . The blue box indicates 1×1×1 Convolution . Here we show Embedded Gaussian version , The bottleneck is 512 Channels .
The normal Gauss version can be removed θ and φ To complete , Dot product version can be used by 1/N Zoom replacement for softmax To complete .

chart 2 A nonlocal block is illustrated in . equation （2）、（3） or （4） The pairwise computation in can be accomplished simply by matrix multiplication , Pictured 2 Shown ;（5） The connection version in is very simple . Pairwise computation of nonlocal blocks is lightweight when used in high-level subsampling feature mapping . for example , chart 2 The typical value in is T=4、H=W=14 or 7. The pairwise computation by matrix multiplication can be compared with the typical convolution layer in the standard network . We further adopt the following implementation , Make it more efficient .
We will W_g、W_θ and W_φ The number of channels represented is set to x Half the number of channels . This follows the bottleneck design , The computation of the block is reduced by about half . equation （6） Weight matrix in W_zi Calculation y_i Position insertion （position-wise embedding ）, Compare the number of channels with x matching . See chart 2. Subsampling techniques can be used to further reduce the amount of computation . We will work out the formula （1） It is amended as follows ：yi=f(x_i,x^∧_j)g(x^∧_j), among x^∧ yes x Second sampling version of （ for example , Through pooling ）. We do this in the spatial domain , This can reduce the amount of pairwise computation 1/4. This technique does not change nonlocal behavior , It will only make the calculation more sparse . This can be done by φ and g chart 2 Then add the maximum pool layer to achieve . We apply these effective modifications to all nonlocal blocks studied in this paper .

2.4. Understand and combine with the schematic diagram Non-local Efficient implementation strategy for

With Embeded Gaussian For example
Insert picture description here

x representative feature map, x_i It represents the information about the current location of interest ; x_j Represents global information .
θ It stands for θ(x_i)=W_θx_i , The actual operation is to use a 1×1 Convolution for learning
φ It stands for ϕ(x_j)=W_ϕx_j, The actual operation is to use a 1×1 Convolution for learning
g Empathy
C(x) It represents the normalization operation , stay embedding gaussian Is used in Sigmoid Realized .

non-local block Of pairwise The calculation of can be very lightweight Of , If it is used at the high level , smaller feature map What I said . such as , chart 2 The typical value on is T=4,H=W=14 or 7. To calculate by matrix operation parwise function The value of is the same as calculating a conv layer Similar amount of calculation . In addition, we also make it more efficient by .
Please add a picture description
We set up W_g,W_θ,W_ϕ Of channel The number is x Of channel Half the number , And that's what makes a bottleneck, stay Channel domain Reduce the amount of data , So we can reduce the computation by half .W_z Zoom in again to x Of channel number , Ensure that the input and output dimensions are consistent .
One more subsampling Of trick Can be further used , Will be (1) Equation to ：yi=∑_∀jf(x_i,x^∧_j)g(x^∧_j)/C(x^∧_j), among x^∧ yes x I got it from down sampling （ Such as through pooling）, We put this method in Space domain Upper use , Can reduce 1/4 Of pairwise function Amount of computation . This trick It doesn't change non-local act , It's making computing more sparse . This can be done through 2 Medium ϕ and g Add a... To the back max pooling Layer implementation .
All of us in this article non-local The above efficient strategies are used in all modules .
（2.4 Partial reference ：【 Paper notes 】Non-local Neural Networks