当前位置:网站首页>Deep learning non local neural networks

Deep learning non local neural networks

2022-06-25 05:24:00 HheeFish

Paper download

0. summary

Capturing long-range dependence is of central importance in deep neural networks . For sequential data ( for example , In voice 、 In language ), Repetitive operation is the main solution of remote dependency modeling . For image data , Long distance dependence is caused by The large receptive field formed by deep convolution operation is modeled . Both convolution and loop operations deal with local neighborhoods , Whether in space or time ; therefore , Only repeat these operations , Gradually spread the signal through the data , To capture long-range dependencies . Repeated local operations have several limitations .

  • It's computationally inefficient .
  • Can lead to optimization difficulties that need to be carefully addressed .
  • These challenges make multi hop dependency modeling more difficult , For example, when information needs to be transferred back and forth between remote locations .

In this paper , We will operate nonlocally (Non-local Operation) As an efficient 、 Simple and universal components , Used to capture remote dependencies using deep neural networks . Our nonlocal operation is a classical nonlocal averaging operation in computer vision (Non-local Means) Promotion of . Intuitively speaking , Nonlocal operations calculate the response of a location , It is through calculating the input feature mapping Weighted sum of features at all locations . Location sets can be set in space 、 In time or space , This means that our operation applies to images 、 Sequence and video problems .
Using nonlocal operations has several advantages :

  • Compared with the asymptotic behavior of loop operation and convolution operation , Nonlocal operation calculates the interaction between any two positions Capture long-range dependencies directly , Regardless of their location and distance ;
  • As we showed in the experiment , Nonlocal operations are effective , Even if there are only a few floors ( for example 5 layer ), Can also achieve the best results ;
  • Last , Our nonlocal operations maintain variable input sizes , And it can be easily combined with other operations ( for example , We will use convolution ) Combine .
    We demonstrate the effectiveness of nonlocal operations in video classification applications . In the video ,** Long distance interaction between distant pixels in space and time . A nonlocal block ( Our basic unit ) These can be captured directly in a feedforward manner Spatiotemporal dependencies .** For some nonlocal blocks , We call it the architecture of nonlocal neural networks 2D and 3D Convolution network ( Including expansion variables ) More accurate for video classification . Besides , Nonlocal neural network is more economical than three-dimensional convolutional neural network . In dynamics and Charades A comprehensive ablation study is presented on the data set . Use only RGB without any bells and whistles( for example , Optical flow , Multi scale testing ), Our method has achieved equivalent or better results than the winner of the latest competition on both data sets .
    To prove the generality of nonlocal operations , We further propose target detection / Experiments on segmentation and pose estimation were carried out in COCO Data sets . In the powerful Mask R-CNN Above baseline , Our nonlocal block can improve the accuracy of all three tasks with a small additional computational cost . Combined with the evidence on the video , These image experiments show that , Nonlocal operations are often useful , It can be the basic component of designing deep neural network .

1. Related work

1.1.Non-local image processing.( Nonlocal image processing )

Nonlocal mean is a classical filtering algorithm , It calculates the weighted average of all pixels in the image . It allows remote pixel pairs to be based on patch Appearance similarity contributes to the filtering response of the location . The idea of nonlocal filtering developed into BM3D( Block match 3D), It applies to a group of similar but nonlocal patch To filter . Compared with deep neural networks ,BM3D It is a kind of solid image denoising baseline . Nonlocal matching is also one of the most successful texture synthesis , The essence of super-resolution and repair algorithm .

1.2.Graphical models( Graph model )

Long term dependencies can be achieved through Graphical model to model , For example, conditional random fields (CRF). In the context of deep neural network ,CRF It can be used for post-processing semantic segmentation prediction of network .CRF The iterative mean field reasoning can be transformed into a recursive network and trained . by comparison , Our method is a A simpler feedforward block computes nonlocal filtering . Different from these methods for segmentation , Our common components are used for classification and detection . These and our methods also involve a more abstract model , be called Figure neural network .

1.3.Feedforward modeling for sequences( Feed forward modeling of sequences )

Recently there has been a use of feedforward ( That is, non recursive ) Trends in network modeling of speech and language sequences . In these methods , Large receive fields contributed by very deep one-dimensional convolution capture long-term dependencies . These feedforward models are suitable for parallel implementation , And more effective than the widely used cycle model .

1.4.Self-Attention( Self attention mechanism )

The self - attention module focuses on all positions and takes their weighted average value in the embedded space , Calculate a position in the sequence ( for example , A sentence ) Response . As we will discuss in the next article , Self focus can be seen as Non-local A form of average , In this sense , Our work links the self-interest of machinetranslation with more general nonlocal filtering operations for image and video problems in computer vision .

1.5.Interaction Networks( Interactive networks )

Interactive networks (IN) Recently, it has been proposed for physical system modeling . They operate on graphics involving objects that interact in pairs .Hoshen[23] Under the background of multi-agent prediction modeling Vertex Attention IN ( Vertex attention interaction network ) Note the more efficient vertices in . Another variable , Name it relational network (Relation Networks), The feature is embedded with a computational function at all positions in its input . Our method also handles all pairs of , We will be in equation (1) Explained in (f(xi,xj)). Although our nonlocal network is associated with these methods , But our experiments show that , The nonlocality of the model , This is orthogonal to note / Interaction / The idea of relationship ( for example , A network can focus on a local area ), Is the key to its success . Nonlocal modeling is a long-term key element in image processing , It has been largely ignored in recent computer vision neural networks .

1.6.Video classification architectures( Video classification architecture )

A natural solution to frequency classification is to combine cnn The success of the image and rnn The success of the sequence . by comparison , The feedforward model is through three-dimensional convolution (3D convolutions, C3D) Realized in time and space , The three-dimensional filter can pass “ inflation ” Pre trained two-dimensional filter formation . In addition to modeling the original video input end-to-end , Optical flow and trajectories have been found to be useful . Flow and track are ready-made modules , Long term 、 Nonlocal dependencies .

2.Non-local neural network

We first give a general definition of nonlocal operations , Then provide several specific examples of it .

2.1. The formula

According to the definition of nonlocal mean , We define in deep neural networks non-local The operation is as follows :
 Insert picture description here

  • x It's the input signal ,cv It is generally used in feature map
  • i Represents the output position , Such as space 、 Index of time or space time , His response should be right j The result of enumeration and calculation
  • f Functional calculation i and j The similarity
  • g Function calculation feature map stay j The representation of position
  • The final y It's through the response factor C(x) After standardization
    i Represents the response of the current location ,j For global response , A nonlocal response value is obtained by weighting .
    f(xi,xj) To calculate i And all possible locations j Between pairwise The relationship between , This relationship can be, for example i and j The farther away you are ,f The smaller the value. , Express j The position is right i The less the impact .g(xj) Used to calculate the input signal at j Eigenvalue of position .C(x) It's a normalized parameter .
    In the full connection layer (FC)xj and xi The relationship is not a function of input data , It's directly through Learn to gain weight , and Non-local In operation , Calculate the response according to the relationship between different locations . Besides , We're doing this (1) The formula in supports variable size input , And keep the corresponding output size . contrary ,fcl The layer requires a fixed size input / Output , And lost position correspondence . Nonlocal operations are a flexible building block , Can be easily convoluted with / Use with circulation layer . It can be added to the early part of deep neural network , It's not like what is usually used last fc layer . This allows us to build richer hierarchies , Combine nonlocal and local information .

2.2 example

Next we describe f and g Several versions of . Interestingly , We will go through experiments (Table2a) indicate , our Nonlocal models are insensitive to these choices , This indicates that the general nonlocal behavior is the main reason for the observed improvement . For the sake of simplicity , We only consider the form of linear embedding :g(xj) =Wgxj, among Wg Is a weight matrix that needs to be learned . This can be achieved as 1×1 Space convolution or 1×1×1 Spatiotemporal convolution . Next, let's talk about pairwise functions f The choice of .

2.2.1. gaussian

After nonlocal mean and bilateral filter , Natural selection Gaussian function . In this paper , We consider the : Insert picture description here
XTi X j Is the calculated dot product similarity , The normalization factor is set to C(x)=∑∨j f(xi ,xj )

2.2.2. Embedded Gauss

The simple extension of Gaussian function is to calculate the similarity in the embedded space . In this paper , We consider the :
 Insert picture description here
θ(xi)=Wθxi and ф(xj)=Wфxj Are two embedded , The normalization factor is set to C(x)=∑∨j f(xi ,xj )
self-attention The module is actually non-local Of embedded Gaussian A special case of version . For a given i,C(x)f(xi,xj) It becomes calculating all j Of softmax, namely y=softmax(xTWTθWϕx)g(x), This is it. self-attention The form of expression . So we will self-attention The model is connected with the traditional nonlocal mean , And will sequential self-attention network It is extended to the more general space/spacetime non-local network, Can be in the image 、 Used in video recognition tasks .

2.2.3. Point multiplication

f It can also be defined as point multiplication similarity , namely :
 Insert picture description here
Here we use the embedded version . under these circumstances , We set the normalization factor C(x)=N, among N yes x The total number of elements . Standardization is necessary , Because the input can have a variable size . The main difference between dot product and embedded Gaussian version is softmax The existence of , It acts as an activation function .

2.2.4.Concatenation

Concat Is in Relation Networks Used in pairwise function. We also give a concat Formal f, as follows : Insert picture description here
here [.,.] It means concat,wf Is to be able to concat The vector of is transformed into a scalar weight vector . Set up here C(x)=N.
The above variants demonstrate the flexibility of our generic nonlocal operations . We believe that alternative versions are possible , And may improve the results .

2.3. Nonlocal block

We will (1) Type in the non-local The operation becomes a non-local block, So that it can be inserted into the existing structure .
Let's define a non-local block by : Insert picture description here
among yi By the equation (1) give ,xi Represents the residual link . Residual join allows us to insert new nonlocal blocks into any pre trained model , Without destroying its initial behavior .
 Please add a picture description

chart 2
Time space nonlocal module . The characteristic graph is shown as the shape of its tensor T×H×W×1024 be used for 1024 Channels ( Perform proper shaping when paying attention ).“⊗” Representation matrix multiplication , and “⊕” Means sum by element . Execute... For each line softmax operation . The blue box indicates 1×1×1 Convolution . Here we show Embedded Gaussian version , The bottleneck is 512 Channels .
The normal Gauss version can be removed θ and φ To complete , Dot product version can be used by 1/N Zoom replacement for softmax To complete .

chart 2 A nonlocal block is illustrated in . equation (2)、(3) or (4) The pairwise computation in can be accomplished simply by matrix multiplication , Pictured 2 Shown ;(5) The connection version in is very simple . Pairwise computation of nonlocal blocks is lightweight when used in high-level subsampling feature mapping . for example , chart 2 The typical value in is T=4、H=W=14 or 7. The pairwise computation by matrix multiplication can be compared with the typical convolution layer in the standard network . We further adopt the following implementation , Make it more efficient .
We will Wg、Wθ and Wφ The number of channels represented is set to x Half the number of channels . This follows the bottleneck design , The computation of the block is reduced by about half . equation (6) Weight matrix in Wzi Calculation yi Position insertion (position-wise embedding ), Compare the number of channels with x matching . See chart 2. Subsampling techniques can be used to further reduce the amount of computation . We will work out the formula (1) It is amended as follows :yi=f(xi,xj)g(xj), among x yes x Second sampling version of ( for example , Through pooling ). We do this in the spatial domain , This can reduce the amount of pairwise computation 1/4. This technique does not change nonlocal behavior , It will only make the calculation more sparse . This can be done by φ and g chart 2 Then add the maximum pool layer to achieve . We apply these effective modifications to all nonlocal blocks studied in this paper .

2.4. Understand and combine with the schematic diagram Non-local Efficient implementation strategy for

With Embeded Gaussian For example
 Insert picture description here

  • x representative feature map, xi It represents the information about the current location of interest ; xj Represents global information .
  • θ It stands for θ(xi)=Wθxi​ , The actual operation is to use a 1×1 Convolution for learning
  • φ It stands for ϕ(xj)=Wϕxj​, The actual operation is to use a 1×1 Convolution for learning
  • g Empathy
  • C(x) It represents the normalization operation , stay embedding gaussian Is used in Sigmoid Realized .

non-local block Of pairwise The calculation of can be very lightweight Of , If it is used at the high level , smaller feature map What I said . such as , chart 2 The typical value on is T=4,H=W=14 or 7. To calculate by matrix operation parwise function The value of is the same as calculating a conv layer Similar amount of calculation . In addition, we also make it more efficient by .
 Please add a picture description
We set up Wg,Wθ,Wϕ Of channel The number is x Of channel Half the number , And that's what makes a bottleneck, stay Channel domain Reduce the amount of data , So we can reduce the computation by half .Wz Zoom in again to x Of channel number , Ensure that the input and output dimensions are consistent .
One more subsampling Of trick Can be further used , Will be (1) Equation to :yi=∑∀jf(xi,xj)g(xj)/C(xj), among x yes x I got it from down sampling ( Such as through pooling), We put this method in Space domain Upper use , Can reduce 1/4 Of pairwise function Amount of computation . This trick It doesn't change non-local act , It's making computing more sparse . This can be done through 2 Medium ϕ and g Add a... To the back max pooling Layer implementation .
All of us in this article non-local The above efficient strategies are used in all modules .
(2.4 Partial reference :【 Paper notes 】Non-local Neural Networks

原网站

版权声明
本文为[HheeFish]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202210508588847.html