当前位置:网站首页>Feature pyramid networks for object detection
Feature pyramid networks for object detection
2022-07-01 03:29:00 【It's seventh uncle】
Abstract
Feature pyramid is a basic component of the recognition system for detecting targets with different scales . But the recent deep learning object detector avoids pyramid representation , Part of the reason is that they are computationally and memory intensive . In this paper , We use the inherent multi-scale pyramid hierarchy of deep convolution network to construct a feature pyramid with boundary additional cost . Developed a top-down architecture with horizontal connectivity , Used to build high-level semantic feature maps on all scales . This is called a feature pyramid network (FPN) As a universal feature extractor , It shows significant improvement in some applications . Faster on a foundation R-CNN The system uses FPN, Our approach is COCO Realized on the test benchmark sota Single model results , No hint is needed , More than all existing single model entries , Include COCO 2016 The entries of the winner of the challenge . Besides , Our approach can be in GPU In order to 6 FPS Speed of operation , Therefore, it is a practical and accurate multi-scale target detection solution . The code will be publicly available .
Feature Pyramid Networks

- Recognizing objects of different sizes is a fundamental challenge in computer vision , Our common solution is to construct multi-scale pyramids .
- (b) Scholars have found that we can make use of the characteristics of convolutional Networks , That is to convolute and pool the original image , Through this operation, we can get different sizes of feature map, This is similar to constructing a pyramid in the feature space of an image . Experiments show that , Shallow networks pay more attention to details , High level networks focus more on semantic information , And the high-level semantic information can help us detect the target accurately , So we can use... On the last convolution layer feature map To predict .
This method exists in most deep networks , such as VGG、ResNet、Inception, They all use the last feature of the deep network to classify . The advantage of this method is that it is fast 、 Less memory required . Its disadvantage is that we only focus on the characteristics of the last layer in the deep network , But ignore the characteristics of other layers , But the detail information can improve the detection accuracy to a certain extent . - (c) chart c The architecture shown , Its design idea is to use both low-level features and high-level features , Prediction is made at different layers at the same time , This is because there may be multiple targets of different sizes in one of my images , Distinguishing between different goals may require different features , For a simple target, we only need shallow features to detect it , For complex targets, we need to use complex features to detect it . The whole process is to do depth convolution on the original image first , Then the prediction is made on different feature layers . Its advantage is to output corresponding targets on different layers , You don't need to go through all layers to output the corresponding target ( That is, for some goals , No unnecessary forward operation is required ), In this way, the network can be accelerated to a certain extent , At the same time, it can improve the detection performance of the algorithm .
Its disadvantage is that the obtained features are not robust , Are some weak features ( Because many features are obtained from shallow layers ). - (d)FPN Its architecture is shown in the figure d Shown , The whole process is as follows , First, we do depth convolution on the input image , Then on Layer2 The above features are dimensionally reduced ( That is to add a layer 1x1 The convolution of layer ), Yes Layer4 The above feature is used for upsampling operation , Make them the right size , And then the processed Layer2 And processed Layer4 Perform an addition operation ( Add the corresponding elements ), Input the result to Layer5 In the middle . The idea behind it is to get a strong semantic information , This can improve detection performance . Seriously, you may have observed ,
This time we used a deeper layer to construct the feature pyramid , This is done to use more robust information;besides , We accumulate the processed low-level features and the processed high-level features , The purpose of this is that lower level features can provide more accurate location information, However, many down sampling and up sampling operations make the positioning information of the deep network have errors , So we use it in combination with it , So we build a deeper feature pyramid , Fusion of multi-layer feature information , And output in different features . This is the detailed explanation of the above figure .( Just a personal point of view )
Reference resources :FPN Detailed explanation
Our goal is to use ConvNet Pyramid feature hierarchy , It has semantics from low level to high level , And build a feature pyramid that always has high-level semantics . The resulting feature pyramid network is universal , In this paper , We mainly focus on sliding window proponents (Region Proposal Network, abbreviation RPN)[29] And region based detectors (Fast R-CNN)[11]. We will also FPN Generalized to Sec6 Instance segmentation in .
Our method takes a single-scale image of any size as the input , In the form of full convolution in multiple levels Output the characteristic map in proportion on the . This process is independent of the backbone convolution architecture ( for example ,[19,36,16]), In this paper , We use ResNets[16] The results are given . The construction of our pyramid includes a bottom-up path (bottom-up pathway)、 The top-down path (top-down pathway) And transverse connection (lateral connections), As follows .
Bottom up Bottom-up pathway
The bottom-up path is the backbone (backbone)ConvNet Feed forward calculation of , It computes a feature hierarchy consisting of multiple scale feature maps , The scaling step is 2. There are usually many layers that produce output maps of the same size , We say that these layers are in the same network stage . For the feature pyramid , We define a pyramid level for each phase . We select the output of the last layer of each stage as the reference set of the feature graph , We will enrich these feature maps to create pyramids . This choice is natural , Because the deepest layer of each stage should have the most powerful function .
To be specific , about resnet【16】, We use the function of the last remaining block output of each stage to activate . about conv2、conv3、conv4 and conv5 Output , We express the output of these last remaining blocks as {C2、C3、C4、C5}, And notice that they have {4、8、16、32} Pixel span . because conv1 It takes up a lot of memory , So we don't include it in the pyramid .
- The bottom-up process is the common forward propagation process of neural networks , The characteristic graph is calculated by convolution kernel , It usually gets smaller and smaller . To be specific , about ResNets, We use the last of each stage residual block The feature of the output activates the output . about conv2,conv3,conv4 and conv5 Output , We will end up with residual block The output of is expressed as {C2,C3,C4,C5}, And they have... Relative to the input image {4, 8, 16, 32} Step size of . Because of its huge memory footprint , We will not conv1 Included in the pyramid .
Top down access and horizontal connections Top-down pathway and lateral connections.
- The top-down path maps spatially coarser but semantically stronger features from a higher pyramid level On the sampling , So as to produce higher resolution features .
- then , Through transverse connection , Enhance these features through a bottom-up path . Each horizontal link combines a feature map of the same space size in the bottom-up path and the top-down path . Bottom-up feature mapping has a lower semantic level , But its activation is more accurately localized , Because it has fewer subsamples .
chart 3 Shows the building blocks for building a top-down feature map . For the feature map with coarser resolution , We will increase the spatial resolution 2 times ( For the sake of simplicity , Use nearest neighbor sampling ).
then , Merge by element addition and upsampling . This process will iterate , Until the finest resolution map is generated . To start the iteration , We just need to C5 On Attach one 1×1 The convolution layer reduces the number of channels for horizontal connection , To produce the coarsest resolution map . Last , We attach one to each merged map 3×3 Convolution to generate the final feature map , To reduce the aliasing effect of upsampling . This last set of characteristic graphs is called {P2、P3、P4、P5}, Corresponding to... With the same space size {C2、C3、C4、C5}.
Because all levels of the pyramid use shared classifiers like the traditional characteristic image pyramid / Regressor , So we fixed the feature dimension in all feature maps ( The channel number , Expressed as d). In this paper , We set up d=256, So all the extra convolution layers have 256 Channel output . There is no nonlinearity in these additional layers , Our experience shows that these layers have little effect . Simplicity is the core of our design , We found our model robust to many design choices . We experimented with more complex blocks ( for example , Use multilayer residual blocks as connections ), And observed slightly better results . Designing better connection modules is not the focus of this paper , So we chose the simple design described above .
- The top-down process is to abstract 、 High level feature graph with stronger semantics is used for up sampling (upsampling), The horizontal connection is to combine the result of up sampling with the result of bottom-up generation of the same size feature map To merge (merge). The two-layer features of horizontal connection have the same spatial dimension , This can take advantage of the underlying location details . Make a low resolution feature map 2 Multiple sampling ( For the sake of simplicity , Use nearest neighbor sampling ). Then by adding by element , Merge the upsampling mapping with the corresponding bottom-up mapping . This process is iterative , Until the final resolution map is generated .
- To start the iteration , We just need to C5 Attach a 1×1 Convolution layer to generate low resolution map P5. Last , We attach a to each merged graph 3×3 Convolution to generate the final feature map , This is to reduce the aliasing effect of up sampling . This final feature mapping set is called {P2,P3,P4,P5}, Corresponding to {C2,C3,C4,C5}, They have the same size .
- Because all levels of the pyramid use shared classifiers like the traditional characteristic image pyramid / Regressor , So we fix the feature dimension in all the feature graphs ( The channel number , Write it down as d). In this article, we set up d = 256, So all the extra convolution layers have 256 Output of channels .
Reference resources :FPN Interpretation of the thesis and Code details
边栏推荐
- 第03章_用户与权限管理
- Pyramid Scene Parsing Network【PSPNet】论文阅读
- 过滤器 Filter
- 别再说不会解决 “跨域“ 问题啦
- 数据库DDL(Data Definition Language,数据定义语言)知识点
- Ouc2021 autumn - Software Engineering - end of term (recall version)
- Introduction to EtherCAT
- The shell script uses two bars to receive external parameters
- Include() of array
- [us match preparation] complete introduction to word editing formula
猜你喜欢
随机推荐
实战 ELK 优雅管理服务器日志
Ctfshow blasting WP
[us match preparation] complete introduction to word editing formula
E15 solution for cx5120 controlling Huichuan is620n servo error
Feign remote call and getaway gateway
Latest interface automation interview questions
【读书笔记】《文案变现》——写出有效文案的四个黄金步骤
BluePrism注册下载并安装-RPA第一章
The best learning method in the world: Feynman learning method
go实现命令行的工具cli
完全背包问题
Mybati SQL statement printing
JUC learning
家居网购项目
Thread data sharing and security -threadlocal
C # realize solving the shortest path of unauthorized graph based on breadth first BFS -- complete program display
Design practice of current limiting components
CX5120控制汇川IS620N伺服报错E15解决方案
C语言多线程编程入门学习笔记
数据交换 JSON









