当前位置：网站首页>Detailed explanation of retinanet network structure

Detailed explanation of retinanet network structure

2022-07-28 01:12:00 【@BangBang】

1. summary

Retinanet The paper of ：Focal Loss for Dense Object Detection
Insert picture description here
The paper 2017 Years published in CVPR(computer vision and pattern recongnition) , After the paper was put forward one-stage The network surpasses two-stage The Internet

One-stage and two-stage The difference between the Internet

two-stage : With Faster RCNN Dominant two-stage The Internet , First, we need to pass a RPN Network to generate our Proposal, And then through Fast RCNN Make the final prediction of our goal , It is divided into two steps .
one-stage： With SSD、YOLO Series based , It is a step to directly predict our final result , But in this article The paper Before putting forward ,one-stage The accuracy of the network is lower than two-stage Of , After this paper is put forward one-stage The Internet surpassed... For the first time two-stage The Internet .

retinaNet Performance indicators

Insert picture description here
from RetinaNet The performance parameters given can be seen ,AP (IoU from 0.5-0.95 The average of ) Up to 40.8%. You can see the same period one-stage The Internet YOLOv2 SSD513 Wait for them AP It's all in 21~33 Between .two-stage The Internet was more mainstream at that time Faster-RCNN Reached 36.8, It's obviously related to RetinaNet Of 40.8 It's much lower .

2. RetinaNet The detailed structure of the network

Insert picture description here

chart 1 RetinaNet Network structure

RetinaNet Network structure and FPN Different

RetinaNet The network structure is similar to FPN Network structure , But with FPN The structure has 3 A different place .

The first difference ： FPN Will use C2 What makes us P2, But in our RetinaNet Not used in C2 Generate P2, The reason given by the author of the paper is P2 It will occupy more computing resources （P2 The characteristic graph of is relative to P3~P6 It will be bigger ）, In order to save resources, the author did not use P2, But from C3 To start generating P3, BackBone This part is related to FPN similar . Reference resources ： object detection FPN(Feature Pyramid Networks) Use
The second difference ： stay P6 This place , stay FPN Is down sampled by maximizing pooling , Here is through convolution kernel 3x3, The step is 2 For down sampling , Got us P6
The third difference ： stay FPN From P2~P6 , But in RetinaNet The network is from P3~P7,P7 stay P6 On the basis of Relu Activation function and product kernel 3x3, The step is 2 Obtained by down sampling .

The prediction feature layer adopts scale and ratio

Before Blog As mentioned in FPN, Each prediction feature layer uses a scale and 3 individual Ratios namely 3 individual anchor, But in RetinaNet The author uses 3 individual scale and 3 individual ratios, 9 Different species anchor.
Insert picture description here

Predictor section

Before the blog introduction FPN When the network , In fact, it has something to do with us Faster RCNN It's similar , All are Two-stage The Internet . First of all, it will pass a RPN Generate Proposal, And then through Fast RCNN Generate the final prediction parameters . But we RetinaNet It's a One-stage The network uses predictors directly .
Insert picture description here
in the light of P3-P7 Predict the characteristic layer , Share a predictor , Of a predictor Details as follows , Divided into two branches class subnet and box subnet Predict the category of each goal and for each anchor To predict the target bounding box regression parameters .
Insert picture description here

class subnet: use first 4 individual 3x3 The convolution of layer , Each convolution layer is followed by Relu Activation function , One last 3x3 Convolution layer has no activation function , Similarly, its convolution kernel size is 3x3, The step is 1, c=KA there K Indicates the number of detection targets , Does not contain background categories , here A It is at each position on the prediction feature layer anchor Number , Here is 9
box subnet: It's the same thing 4 individual 3x3 The convolution of layer , Convolution kernel size 3x3, The step is 1, channel by 256, One last 3x3 Convolution layer has no activation function , Similarly, its convolution kernel size is 3x3, The step is 1, its channel=4A,A Corresponding anchor Number 9. This with Faster RCNN Is different ,Faster RCNN It is for each of the prediction feature layers anchor A set of bounding box regression parameters will be generated for each category , So it corresponds to channel=4KA

3. Loss function

Positive and negative samples match

Before will Faster RCNN When , First of all, we will treat all Anchor Match , Divide it into positive samples and negative samples , Sample in positive and negative samples , Calculate its loss according to the sampled samples .
stay RetinaNet We also need to match positive and negative samples , And Faster RCNN The differences are mainly reflected in ： First of all, I will Anchor And GT Calculation IoU, If IoU Greater than 0.5 If so, we will mark it as a positive sample , If a anchor With all GT All less than 0.4, We mark it as a negative sample .IoU stay [04,0.5] Between anchor We will give up .

Loss function calculation

Insert picture description here
RetinaNet A comparison The core knowledge point Namely Focal Loss , See blog ：Focal Loss Detailed explanation

The above loss function is divided into two parts ： Classified loss , Return to loss
Focal Loss Of Classified loss The calculation is Loss of all positive and negative samples , And divide by the number of positive samples
Return to loss It's all about Calculated by positive samples , Add it up and divide by the number of positive samples
Classification loss used sigmoid Focal Loss, For regression losses, use L1 Loss

原网站

版权声明
本文为[@BangBang]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207272236155624.html