当前位置：网站首页>[sca-cnn interpretation] spatial and channel wise attention

[sca-cnn interpretation] spatial and channel wise attention

2022-06-13 00:38:00 【AI bacteria】

Insert picture description here

Abstract

Visual attention has been successfully applied to structural prediction tasks , Such as visual subtitles and question answering . The existing visual attention models are generally spatial , Attention is modeled as spatial probability , The effect of the spatial probability on the encoding of the input image CNN And reweighted the last convolution feature map . However , We think , This kind of spatial attention does not necessarily conform to the attention mechanism —— A dynamic feature extractor that combines contextual gaze over time , because CNN Is characterized by spatial 、 Channeled and multilevel . In this paper , We introduce a new convolutional neural network , be called SCA-CNN, It's in CNN It combines spatial and channel orientation concerns . In the image caption task ,SCA-CNN Dynamically adjust sentence generation context in multi-level feature mapping , Where is the encoded visual attention ( namely , Multiple levels of attention to spatial location ) And what it is ( namely , Pay attention to the passage ). We are working on three benchmark image caption datasets ：Flickr8K、Flickr30K and MSCOCO On the proposed SCA-CNN The architecture is evaluated . According to observation ,SCA-CNN Its performance is obviously superior to the most advanced image caption method based on visual attention .
Insert picture description here

One 、 introduction

Visual attention has been proved to be effective in various structural prediction tasks , Such as images / Video captioning and visual question answering . Its success is mainly due to a reasonable assumption , That is, human vision does not tend to process the whole image at once ; contrary , One pays attention to specific parts of the whole visual space only when and where it is needed . To be specific , Attention is not encoding images into static vectors , It allows the image features to evolve from the sentence context . In this way , Visual attention can be considered as a dynamic feature extraction mechanism , It combines contextual gaze over time .

In this paper , We will make full use of CNN Three features of visual attention based image captioning . especially , We propose a novel convolutional neural network based on spatial and channel attention , be called SCA-CNN, It learns to focus on multiple layers 3D Each feature entry in the feature map . chart 1 The motivation of introducing channel attention into multi-layer feature mapping is explained . First , Because the channel feature mapping is essentially the detection response mapping of the corresponding filter , Therefore, channel attention can be seen as a process of selecting semantic attributes according to the needs of sentence context . for example , When we want to predict the cake , Pay attention to the direction of our passage ( for example , stay Conv5 3/Conv5 4 In the characteristic diagram ) According to the cake 、 fire 、 Semantics such as light and candle shape assign more weight to the channel direction feature map generated by the filter . secondly , Because the feature map depends on its lower level feature map , Naturally, attention will be paid on many levels , Thus, visual attention to multiple semantic abstractions can be obtained . for example , It is useful to emphasize lower level channels corresponding to more basic shapes , Such as the array and cylinder that make up the cake .

We have three famous image captioning benchmarks ：Flickr8K、Flickr30K and MSCOCO The proposed SCACNN The effectiveness of the . stay BLEU4 in ,SCA-CNN Can significantly exceed the spatial attention model 4.8%. in summary , We propose a unified SCA-CNN frame , To effectively integrate CNN Space in features 、 Channel and multi-layer visual attention , For image captions . especially , A new spatial and channel attention model is proposed . The model is universal , Therefore, it can be applied to any CNN Any layer in the architecture , For example, popular VGG[25] and ResNet[8].SCA-CNN Help us better understand CNN The evolution of features in the process of sentence generation .

Two 、 Related work

We are interested in the use of neural images / Video subtitles (NIC) And visual Q & A (VQA) The visual attention model used in the codec framework is of interest , This is in line with the recent trend of linking computer vision and natural language .NIC and VQA Groundbreaking work using CNN The image or video is encoded as a static visual feature vector , Then send it to RNN To decode language sequences such as subtitles or answers .

However , Static vectors do not allow image features to adapt to the sentence context at hand . Inspired by the attention mechanism introduced in machinetranslation , Decoding dynamically selects useful source language words or subsequences to translate into the target language , Visual attention model in NIC and VQA Has been widely used in . We classify these attention-based models into the following three areas , They inspire our SCA-CNN.

Xu Et al. Proposed the first visual attention model of image capture . in general , They used the method of selecting the most likely areas of interest “ hard ” aggregate , Or by focusing on the weighted average spatial features “ soft ” aggregate . as for VQA, Zhu et al “ soft ” Attention merges image region features . In order to further refine the spatial attention , Yang et al. Studied spatial attention . Xu et al. Used a stacked spatial attention model , The second attention is based on the attention characteristic graph of the first attention modulation . What's different from them is , Our multi-level focus applies to CNN On multiple levels . A common drawback of the above spatial models is that they usually resort to weighted aggregation on the feature map of interest . therefore , Spatial information will inevitably be lost . More serious , Their attention is focused only on the last layer , There? , The receptive field will be very large , And the difference between each receptive field area is very small , The spatial attention is not significant .

In addition to spatial information ,You They also put forward their own views , Proposed in NIC Select semantic concepts in , The image feature is the confidence vector of the attribute classifier . Jia et al. Used the correlation between images and subtitles as the global semantic information to guide LSTM Make sentences . However , These models require external resources to train these semantic attributes . stay SCA-CNN in , Each filter core of the convolution layer acts as a semantic detector . therefore ,SCA-CNN Channel attention is similar to semantic attention .

3、 ... and 、 Space and channel attention

3.1 summary

We use the popular codec framework to generate image captions , among CNN First, the input image is encoded into a vector , then LSTM Decode the vector into a sequence of words . Pictured 2 Shown ,SCA-CNN Through multi-layer channel attention and spatial attention, the original CNN The multi-layer feature map adapts to the sentence context .
Insert picture description here

Formally , Suppose we want to generate the first... Of the image title t Word . Now? , We are LSTM Memory ht−1∈Rd The last sentence context is encoded in , among d Is the hidden state dimension . stay l layer , Space and channel direction of interest weights γ yes hT−1 And the current cnn features V1 Function of . therefore ,SCA-CNN Use attention weights in a circular and multi-layered manner γ To modulate VL, as follows ：
Insert picture description here

among ,x1 It's a modulation feature ,Φ(·) Will be on the 3.2 Section and section 3.3 The spatial and channel oriented attention functions described in detail in section ,v1 It is the characteristic graph output from the previous convolution , for example , Followed by the merger 、 Convolution after down sampling or convolution [25,8],f(·) It's modulation cnn A linear weighting function of features and weights of interest . Compared with the existing popular attention based weight [34] The modulation strategies that summarize all the visual features are different , function f(·) Multiply by element... Is applied . up to now , We are ready to generate the... In the following way t Word ：
Insert picture description here
among ,L Is the total number of conversion layers ;pt∈R|D| It's a probability vector ,D Is a predefined dictionary that includes all subtitle words .

3.2 Spatial attention

Generally speaking , Subtitle words only involve part of the image . for example , In the figure 1 in , When we want to predict the cake , Only the image area containing the cake is useful . therefore , Applying global image feature vectors to generate subtitles may lead to suboptimal results due to uncorrelated regions . Spatial attention mechanism attempts to pay more attention to semantic related areas , Instead of considering each image region equally . Without losing generality , We discard hierarchical superscripts l. We flatten the original V The width and height of V=[v1,v2,…,Vm], among vi∈Rc and m=W·H. We can think of a through-hole as the i Visual features of the positions . At a given previous time step LSTM Hidden state ht−1 Under the circumstances , We use a single-layer neural network , And then use Softmax Function to generate the attention distribution on the image area α. Here is the spatial attention model Φ The definition of ：

Insert picture description here

3.3 Channel attention

Be careful , The formula (3) The spatial attention function in still needs visual features V To calculate the spatial attention weight , But visual features for spatial attention V It's not really based on attention . therefore , We introduce a channel based attention mechanism to focus on features V. It is worth noting that , Every CNN The filter acts as a pattern detector , and CNN Each channel of the feature map in is the response activation of the corresponding convolution filter . therefore , Applying the attention mechanism in a channelized way can be seen as a process of selecting semantic attributes .

Concerns about access , Let's start with V Remodel as U, also U=[U1,U2,…,UC], among UI∈RW×H Representation feature mapping V Of the i Channels , and C Is the total number of channels . then , We apply average pooling to each channel , To obtain channel characteristics v：
Insert picture description here
Where scalar vi It means No i A vector of channel features ui Average value . After the definition of spatial attention model , Channel based attention model Φ It can be defined as follows ：

3.4 Mixed attention mechanism

According to the different implementation sequence of channel attention and spatial attention , There are two models that contain both attention mechanisms . We distinguish these two types as follows ：

passageway - Spatial attention . The first type is called a channel - Space (C-S), Apply channel attention before spatial attention .C-S Type of flow chart is shown in Figure 2 Shown . First , In a given initial characteristic graph V Under the circumstances , We use channel based attention weights Φ To get the attention weight of the channel β. adopt β and V The linear combination of , We get a channel weighted feature map . Then the channel weighted feature map is fed back to the spatial attention model Φ, Get the weight of spatial attention α. Get two attention weights α and β after , We can V、β,α Feed to the modulation function f To calculate the modulation characteristic diagram X. All processes are summarized as follows ：
Insert picture description here

Space - Channel attention . The second type is called SpatialChannel(S-C), Is a model that first implements spatial attention . about S-C type , In a given initial characteristic graph V Under the circumstances , We first use spatial attention Φ To gain spatial attention weight α. be based on α、 Linear function fs(·) And the direction of the channel Φc, We can follow C-S Type of recipe to calculate modulation characteristics X：
Insert picture description here

Four 、 experiment

We will verify the proposed by answering the following questions SCACNN Effectiveness of image caption framework ：

First of all , Pay attention to whether the channel is effective ？ Does it improve spatial attention ？
second , Is multi-level attention effective ？ Compared with other most advanced visual attention models ,SCA-CNN How do you behave ？

4.1 Data sets and evaluation criteria

We conducted experiments on three famous benchmarks ：

1)Flickr8k： contain 8000 A picture . According to its official split , It chooses 6000 Images for training ,1000 I want to verify it with two images ,1000 I want to test this image ;
2)Flickr30k： contain 31000 A picture . In the absence of a formal split , In order to make a fair comparison with the previous work , We report the results of publicly available splits used in previous work . In this split ,29,000 Images for training ,1,000 Images are used to verify ,1,000 Images are used to test
3)MSCOCO： The training set contains 82,783 Images , The validation set contains 40,504 Images , The test set contains 40,775 Images . because MSCOCO The basic facts of the test set are not available , Therefore, the validation set is further divided into validation subsets for model selection and test subsets for local experiments . This split followed , It uses the whole 82,783 Training set images for training , And select from the official verification set 5,000 Images are verified and 5,000 Two images are tested . For sentence preprocessing , We follow the public code 1. We use BLEU([email protected],[email protected],[email protected],[email protected])、 A shooting star (MT)、 Cider (CD) And rouge -L(RG) As an evaluation indicator . In short , For all four measures , They measure n The consistency of meta grammar in the generated sentences and basic fact sentences , This consistency is determined by n The significance and rarity of meta grammar are used to weight . meanwhile , These four indicators can be passed MSCOCO Subtitle evaluation tool 2 Directly calculate . And our source code is publicly available .

Insert picture description here

5、 ... and 、 Conclusion

This paper proposes a new model of deep attention SCA-CNN For image captions .SCA-CNN Make the most of it CNN Characteristics , Generate image features of interest ： Spatiality 、 Channel intelligence and multi-level , Thus achieving the most advanced performance on popular benchmarks .SCA-CNN The contribution of is not only a more powerful attention model , And better understand the process of sentence generation CNN Where is your attention ( That's space ) And what ( That is, the channel ) The evolution of . In the future work , We are going to SCA-CNN Introduction of time notice in , So as to participate in the characteristics of video subtitles in different video frames . We will also study how to increase the number of attention layers without over matching .

原网站

版权声明
本文为[AI bacteria]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/164/202206130035042351.html