当前位置：网站首页>[reading notes] for paper: summary of three papers in r-cnn series

[reading notes] for paper: summary of three papers in r-cnn series

2022-07-28 17:59:00 【jsBeSelf】

List of articles

1 R-CNN

1.1 Introduce

R-CNN, namely region proposals( candidate region ) + CNN, Yes, it will CNN A pioneering work of introducing target detection .

1.2 General steps

R-CNN Network structure

As shown in the figure （ From the original paper https://arxiv.org/abs/1311.2524）, Extract first region proposal, The predefined number is about 2000, Then match the picture proposal Cut the area of , And processed into uniform size （ Because the full connection layer is used later （FC）, When fixing the size of the full connection layer , This fixes the vector dimension of its input , So we need uniform size ）, And enter the CNN To extract features , Form a large number of feature map, Finally using SVM Classify and do Bounding-box regression.

1.3 characteristic

There are two key points mentioned in the paper ： First, feature extraction , Use high capacity CNN, Realize bottom-up generation region proposal; Second, training strategies , First, pre train the network parameters on the big data set , Then fine tune the field （ Similar to the idea of transfer learning ）.
advantage ： Compared with the same period, it is also used CNN To extract features OverFeat Model ,R-CNN Greatly improved VOC 2012 Upper mAP, It's really going to CNN Applied to the field of target detection .
But its disadvantage is ： Slow speed ; extract proposal When , The computer does a lot of double counting （ Each of the pictures proposal All have to enter the subsequent network separately for calculation ）; It is still used in classification SVM, The classification effect is poor , If there are more categories , You need more SVM, The training process is cumbersome , And the latter classification and regression network is separated from the former feature extraction network .

1.4 Knowledge supplement

1） In the past, traditional methods were used to extract features （ Such as HOG,SIFT etc. ）, Here we use CNN, because CNN It has the characteristics of weight sharing , It is equivalent to spreading the calculation to all classes , And through CNN After feature extraction ,feature map The size of is much smaller than the original image , It improves the calculation efficiency , Therefore, the memory and time consumption of calculation are saved .
2） When classifying , Used hard negative mining（ It's hard to find examples ）, Don't go into details , The general meaning is to record the samples of network error classification in each round of training , And then keep training , Until the performance of the network classifier cannot be improved .
3） In the frame regression, we are training four parameters , Two parameters control the position of the center point of the target box , The other two parameters control the size of the target frame . In the next section, we will further introduce .

2 Fast R-CNN

2.1 Introduce

Fast R-CNN, namely Fast Region-based Convolutional Network method, In order to in R-CNN And further improve the detection accuracy , At the same time, speed up the detection .
R-CNN The problem of slow speed , stay SPP-net It's solved in . because CNN To improve accuracy , You can deepen the network , Extract more features , however R-CNN The structure of makes too many repeated calculations when extracting features , and SPP-net Found this , And made improvements , But it also has some disadvantages , As a result, network parameters cannot be updated during reverse transmission , Finally, the detection accuracy is not high , and Fast R-CNN It solves these problems , Taking into account the advantages of these two models .

2.2 General steps

Fast R-CNN

The network structure is shown in the figure （ From the original paper https://arxiv.org/abs/1504.08083）, The steps are ： Input the whole picture into CNN, Feature extraction , And then feature map Generate on ROI, Then passed ROI pooing Layers change the size of the feature map , Last pass FC After the layer , It is divided into two brother branches , the Bounding-box regression And classification are added to the end of the feature extraction network , Turn the whole network into a single-stage multi-task.

2.3 characteristic

Several key parts are mentioned in the paper ：
- 1） How to improve accuracy at the same time , Speed up calculation ？ Before SPP-net The reason why the network parameters cannot be updated through reverse transmission is that SGD（ Stochastic gradient descent ） In the process of updating parameters , Due to improper sampling strategy , Resulting in low computational efficiency , Consume more computing resources , Therefore, the sampling strategy is improved ： Simply speaking , stay SGD in , For one mini-batch, The original strategy was to choose a single one in a batch of pictures ROI, To calculate , Instead, select a batch from a single picture ROI. Because of the same picture ROI Computing memory can be shared during forward and back propagation , So it improves the efficiency . and , In terms of network structure ,Fast R-CNN Streamline calculation method is used , Turn the model into a multi task model , In the training CNN At the same time , Trained classification and border regression together . Because the network should realize multitasking , therefore loss The calculation method of will also change .（ stay 2.4 This section discusses ）
- 2） Sort from SVM Change to use softmax, because SVM Strictly carry out secondary classification , There is no way to take into account the intersection of categories , and softmax Then the possibility that the target belongs to all categories is considered .
- 3） stay ROI Sampling strategy , The first choice used in the paper is 25% Of and ground truth Of IOU stay 0.5 The above ROI, Then the rest is the choice IOU stay 0.1-0.5 The largest in the interval ROI.
- 4） In the process of network training , In order to achieve scale invariance , There was originally a method of violence , That is, all training and detection images are simply processed into a unified size . The image pyramid method is used here , Preset some sizes , Then when sampling pictures during training , Choose a size randomly for training , This is actually a method of data enhancement .
Last , The paper is verified by experiments , Whether these theories really work , For example, by controlling variables to compare mAP The change of . Verified ：1） Multi stage training and multi task training ;2） Implementation method of scale invariance ： Violence law and image pyramid ;3） Do you need more data ： If there is more data , The accuracy of the model can also be improved , This is consistent with the nature of a good model ;4） contrast SVM and softmax：softmax better ;5） whether proposal The more settings, the better ： Not at all , Too many proposal It may also lead to a decrease in accuracy .
shortcoming ：Selective Search（ Selective search ） The speed of selecting candidate boxes in the algorithm has not improved , This has become the bottleneck of the network .

2.4 Knowledge supplement

1） Bounding box regression
Is to find a mapping , Make the predicted box close to the real box , Including translation （ Two parameters ）+ The zoom （ Two parameters ） Two parts . But there are two problems ： Why is it designed tx,ty Why divide by width and height ？ Why? tw,th There will be log form ？
answer ： The first problem is for scale invariance . If you just look △x or △y, Different targets in the picture have different sizes , It's not easy to train . If you divide by the width and height of the target box at this time , We can achieve equality in scale , Easy to train ; The second problem is that the scaling factor must be greater than 0, So I used it exp function , So when you study in turn, it's log 了 .
2） Multitasking loss
multitasking loss（multi-task loss） It's made up of two parts , The first part is used to represent category prediction , The negative logarithm of the prediction probability of each category . The second part is the border regression loss, by smoothL1. And the two-part loss uses a coefficient λ To adjust the weight , Avoid tendency .
problem ： Why take negative logarithm ？smooth-L1 Compared with the previous improvements ？
answer ：1） In practice , Maximizing the log likelihood function is more convenient , Because the logarithmic function is a monotonically increasing function of its parameters （monotonically increasing）, In terms of logarithm , The successive multiplication of probability is likely to lead to underflow , And after taking logarithms, it can be converted into continuous addition . And the probability is greater than 0, Less than 1, The logarithm is negative , So add a minus sign to make it positive .2）L1 loss For outliers （ Extreme values ） Sensitivity ratio R-CNN Used in L2 loss weak , Better robustness .

3 Faster R-CNN

3.1 Introduce

Faster R-CNN = RPN + Fast R-CNN, Because these three models are anchor-base, The bottleneck of the previous model is anchor On the preset , therefore Faster R-CNN Aiming at this bottleneck, this paper puts forward RPN（Region Proposal Network）.

3.2 General steps

Insert picture description here
The network structure is shown in the figure （ From the original paper https://arxiv.org/abs/1506.01497）, The general procedure is ： First input the image into the backbone feature extraction network （ Such as ResNet）, obtain feature map, Then input the feature map RPN, First, use another layer 3*3 Convolution extraction of features , Then enter two branches , A branch is used to learn anchor Four parameters of , Another branch is used to learn anchor Inside is the probability of the difference between foreground and background , Merge two branches to get proposals, Then with the front feature map Enter... Together ROI pooing, Finally, classification and regression . In fact, the point is that RPN.

3.3 characteristic

RPN The number of channels of the two branches in is 2k and 4k, Among them k representative k individual anchor, In the paper k=9, Yes 3 Species length width ratio and 3 Species scale , To combine .
RPN Middle school learning anchor The probability of belonging to foreground and background respectively is classified into two categories , through softmax To learn .
RPN Use in 3*3 Convolution to achieve sliding window .

4 summary

The above describes the target detection two-stage Model of ,R-CNN Three papers in the series , seeing the name of a thing one thinks of its function , Mainly with CNN of , The goal of improvement is to consider accuracy and speed , Consider how you can have both fish and bear's paws . The direction of improvement lies in the network structure and training methods .
Read next time one-stage The representative of the ,YOLO A series of papers .

原网站

版权声明
本文为[jsBeSelf]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281633289526.html