当前位置：网站首页>Feature pyramid networks for object detection

Feature pyramid networks for object detection

2022-07-01 03:29:00 【It's seventh uncle】

Abstract

Feature pyramid is a basic component of the recognition system for detecting targets with different scales . But the recent deep learning object detector avoids pyramid representation , Part of the reason is that they are computationally and memory intensive . In this paper , We use the inherent multi-scale pyramid hierarchy of deep convolution network to construct a feature pyramid with boundary additional cost . Developed a top-down architecture with horizontal connectivity , Used to build high-level semantic feature maps on all scales . This is called a feature pyramid network （FPN） As a universal feature extractor , It shows significant improvement in some applications . Faster on a foundation R-CNN The system uses FPN, Our approach is COCO Realized on the test benchmark sota Single model results , No hint is needed , More than all existing single model entries , Include COCO 2016 The entries of the winner of the challenge . Besides , Our approach can be in GPU In order to 6 FPS Speed of operation , Therefore, it is a practical and accurate multi-scale target detection solution . The code will be publicly available .

Feature Pyramid Networks

Insert picture description here

Recognizing objects of different sizes is a fundamental challenge in computer vision , Our common solution is to construct multi-scale pyramids .
（b） Scholars have found that we can make use of the characteristics of convolutional Networks , That is to convolute and pool the original image , Through this operation, we can get different sizes of feature map, This is similar to constructing a pyramid in the feature space of an image . Experiments show that , Shallow networks pay more attention to details , High level networks focus more on semantic information , And the high-level semantic information can help us detect the target accurately , So we can use... On the last convolution layer feature map To predict . This method exists in most deep networks , such as VGG、ResNet、Inception, They all use the last feature of the deep network to classify . The advantage of this method is that it is fast 、 Less memory required . Its disadvantage is that we only focus on the characteristics of the last layer in the deep network , But ignore the characteristics of other layers , But the detail information can improve the detection accuracy to a certain extent .
（c） chart c The architecture shown , Its design idea is to use both low-level features and high-level features , Prediction is made at different layers at the same time , This is because there may be multiple targets of different sizes in one of my images , Distinguishing between different goals may require different features , For a simple target, we only need shallow features to detect it , For complex targets, we need to use complex features to detect it . The whole process is to do depth convolution on the original image first , Then the prediction is made on different feature layers . Its advantage is to output corresponding targets on different layers , You don't need to go through all layers to output the corresponding target （ That is, for some goals , No unnecessary forward operation is required ）, In this way, the network can be accelerated to a certain extent , At the same time, it can improve the detection performance of the algorithm . Its disadvantage is that the obtained features are not robust , Are some weak features （ Because many features are obtained from shallow layers ）.
（d）FPN Its architecture is shown in the figure d Shown , The whole process is as follows , First, we do depth convolution on the input image , Then on Layer2 The above features are dimensionally reduced （ That is to add a layer 1x1 The convolution of layer ）, Yes Layer4 The above feature is used for upsampling operation , Make them the right size , And then the processed Layer2 And processed Layer4 Perform an addition operation （ Add the corresponding elements ）, Input the result to Layer5 In the middle . The idea behind it is to get a strong semantic information , This can improve detection performance . Seriously, you may have observed , This time we used a deeper layer to construct the feature pyramid , This is done to use more robust information ; besides , We accumulate the processed low-level features and the processed high-level features , The purpose of this is that lower level features can provide more accurate location information , However, many down sampling and up sampling operations make the positioning information of the deep network have errors , So we use it in combination with it , So we build a deeper feature pyramid , Fusion of multi-layer feature information , And output in different features . This is the detailed explanation of the above figure .（ Just a personal point of view ）

Reference resources ：FPN Detailed explanation

Our goal is to use ConvNet Pyramid feature hierarchy , It has semantics from low level to high level , And build a feature pyramid that always has high-level semantics . The resulting feature pyramid network is universal , In this paper , We mainly focus on sliding window proponents （Region Proposal Network, abbreviation RPN）[29] And region based detectors （Fast R-CNN）[11]. We will also FPN Generalized to Sec6 Instance segmentation in .

Our method takes a single-scale image of any size as the input , In the form of full convolution in multiple levels Output the characteristic map in proportion on the . This process is independent of the backbone convolution architecture （ for example ,[19,36,16]）, In this paper , We use ResNets[16] The results are given . The construction of our pyramid includes a bottom-up path （bottom-up pathway）、 The top-down path （top-down pathway） And transverse connection （lateral connections）, As follows .

Bottom up Bottom-up pathway

The bottom-up path is the backbone （backbone）ConvNet Feed forward calculation of , It computes a feature hierarchy consisting of multiple scale feature maps , The scaling step is 2. There are usually many layers that produce output maps of the same size , We say that these layers are in the same network stage . For the feature pyramid , We define a pyramid level for each phase . We select the output of the last layer of each stage as the reference set of the feature graph , We will enrich these feature maps to create pyramids . This choice is natural , Because the deepest layer of each stage should have the most powerful function .

To be specific , about resnet【16】, We use the function of the last remaining block output of each stage to activate . about conv2、conv3、conv4 and conv5 Output , We express the output of these last remaining blocks as {C2、C3、C4、C5}, And notice that they have {4、8、16、32} Pixel span . because conv1 It takes up a lot of memory , So we don't include it in the pyramid .

The bottom-up process is the common forward propagation process of neural networks , The characteristic graph is calculated by convolution kernel , It usually gets smaller and smaller . To be specific , about ResNets, We use the last of each stage residual block The feature of the output activates the output . about conv2,conv3,conv4 and conv5 Output , We will end up with residual block The output of is expressed as {C2,C3,C4,C5}, And they have... Relative to the input image {4, 8, 16, 32} Step size of . Because of its huge memory footprint , We will not conv1 Included in the pyramid .

Top down access and horizontal connections Top-down pathway and lateral connections.

The top-down path maps spatially coarser but semantically stronger features from a higher pyramid level On the sampling , So as to produce higher resolution features .
then , Through transverse connection , Enhance these features through a bottom-up path . Each horizontal link combines a feature map of the same space size in the bottom-up path and the top-down path . Bottom-up feature mapping has a lower semantic level , But its activation is more accurately localized , Because it has fewer subsamples .

chart 3 Shows the building blocks for building a top-down feature map . For the feature map with coarser resolution , We will increase the spatial resolution 2 times （ For the sake of simplicity , Use nearest neighbor sampling ）.
Insert picture description here

then , Merge by element addition and upsampling . This process will iterate , Until the finest resolution map is generated . To start the iteration , We just need to C5 On Attach one 1×1 The convolution layer reduces the number of channels for horizontal connection , To produce the coarsest resolution map . Last , We attach one to each merged map 3×3 Convolution to generate the final feature map , To reduce the aliasing effect of upsampling . This last set of characteristic graphs is called {P2、P3、P4、P5}, Corresponding to... With the same space size {C2、C3、C4、C5}.

Because all levels of the pyramid use shared classifiers like the traditional characteristic image pyramid / Regressor , So we fixed the feature dimension in all feature maps （ The channel number , Expressed as d）. In this paper , We set up d=256, So all the extra convolution layers have 256 Channel output . There is no nonlinearity in these additional layers , Our experience shows that these layers have little effect . Simplicity is the core of our design , We found our model robust to many design choices . We experimented with more complex blocks （ for example , Use multilayer residual blocks as connections ）, And observed slightly better results . Designing better connection modules is not the focus of this paper , So we chose the simple design described above .

The top-down process is to abstract 、 High level feature graph with stronger semantics is used for up sampling （upsampling）, The horizontal connection is to combine the result of up sampling with the result of bottom-up generation of the same size feature map To merge （merge）. The two-layer features of horizontal connection have the same spatial dimension , This can take advantage of the underlying location details . Make a low resolution feature map 2 Multiple sampling （ For the sake of simplicity , Use nearest neighbor sampling ）. Then by adding by element , Merge the upsampling mapping with the corresponding bottom-up mapping . This process is iterative , Until the final resolution map is generated .
To start the iteration , We just need to C5 Attach a 1×1 Convolution layer to generate low resolution map P5. Last , We attach a to each merged graph 3×3 Convolution to generate the final feature map , This is to reduce the aliasing effect of up sampling . This final feature mapping set is called {P2,P3,P4,P5}, Corresponding to {C2,C3,C4,C5}, They have the same size .
Because all levels of the pyramid use shared classifiers like the traditional characteristic image pyramid / Regressor , So we fix the feature dimension in all the feature graphs （ The channel number , Write it down as d）. In this article, we set up d = 256, So all the extra convolution layers have 256 Output of channels .

Reference resources ：FPN Interpretation of the thesis and Code details

原网站

版权声明
本文为[It's seventh uncle]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207010315158052.html