当前位置：网站首页>Paper reproduction - ac-fpn:attention-guided context feature pyramid network for object detection

Paper reproduction - ac-fpn:attention-guided context feature pyramid network for object detection

2022-06-29 12:29:00 【RooKiChen】

The reproduced paper has been open source , Need to rely on Detectron library , But now most of the papers use mmdetection 了（ Need to install mmdetection Ku, please read my last article ：Ubuntu install mmdetection）, Suspect configuration Detectron Library trouble can refer to my reproduced code .
AC-FPN Original thesis ：https://arxiv.org/pdf/2005.11475.pdf
AC-FPN Official code ：https://github.com/Caojunxu/AC-FPN

It is not implemented in the source code CxAM and CnAM！！！ The reason is that the author has made many experiments and found , A separate CEM The module can also work well .

The specific implementation details of some papers are not clear , So I reproduce it according to my own understanding , If there are different methods, welcome to discuss in the comment area .

1.AC-FPN The overall structure

AC-FPN It is used to solve the contradiction between receptive field and feature map in high-resolution images , Intuitively speaking, high-resolution images need larger receptive fields , But the detection effect of large receptive field on small targets is not good , Will misjudge the small target as the background , Based on the above problems ,AC-FPN Two modules are proposed to solve this contradiction , They are context extraction modules (CEM) And pay attention to the boot module (attention -guided module, AM), and AM There are two sub modules in , They are context attention modules (Context Attention Module, CxAM) And content attention module (Content Attention Module, CnAM). The innovation is to use different proportions of hole convolution to extract features ,AM Module CnAM and CxAM Follow self-attention The idea is similar to .

2.CEM：Context Extraction Module

Insert picture description here In the original paper, the author said CEM A module is a feature map F5 Five rate The void convolution of , Then make dense links （ Dense links to the original paper ）, It can also be seen from the figure that each feature map has arrow links with other feature maps , Then deformable convolution is performed on each characteristic graph , Finally, these feature maps are spliced together 1x1 To jump the number of channels . But there is no deformable convolution in the source code , Moreover, this deformable convolution in the paper is also passed by , So I didn't add deformable convolution in my implementation . You can see that there is another one under the hole convolution structure diagram upsampling operation , However, there is no implementation in the source code. I think it is compressed into a 1x1xC Eigenvector of , The function of this vector should be similar to that of spatial attention , Maybe the effect of the author's addition is not good , In the source code, we will abandon the structure .
Personally think that CEM The reason why the effect is very good is that one is added after each cavity convolution GroupNorm operation , This is not given in the original paper . And BatchNorm Different ,GroupNorm You don't need a big one batch_size（ Training COCO Data sets ,batch_size It's usually 2）.

3. AM： Attention-guided Module

3.1 CxAM

Insert picture description here
This module is an ordinary self-attention, It's just in the feature map R The average pooling operation is added later .F yes CEM The output characteristics of , from CEM Generate and contain multiscale receptive field information , Put in CxAM modular . Based on this information ,CxAM Adaptively focus on the relationship between related sub regions . therefore , Output CxAM The functionality of will have clear semantics and include context dependencies within surrounding objects .

3.2 CnAM

Insert picture description here
CnAM Structure follows CxAM Structure is the same , The original paper says that because CEM Deformable convolution is used , The geometric characteristics of the given image have been completely destroyed , This causes the position to shift . So , We designed a new attention module , Called the content attention module （CnAM）, To maintain accurate location information for each object .
The difference is this CnAM Take advantage of F5 Of feature map As an input, make up for the damaged positioning information .

4. Training strategy

Here are my personal training strategies ： The backbone network is ResNet50, Again COCO Training on dataset 12 round , Used 8 block 40G The memory GPU, each GPU On 2 A picture , The initial learning rate is 0.02, And in the 8 Round and Chapter 11 Wheel descent 0.1 times , It's not bad