当前位置：网站首页>Target detection series -- detailed explanation of RCNN principle

Target detection series -- detailed explanation of RCNN principle

2022-06-22 07:35:00 【Bald little Su】

Author's brief introduction ： Bald Sue , Committed to describing problems in the most popular language
Looking back ：ubuntu Use guide Alibaba cloud object storage oss+picgo+typora Implementation steps and solution of unable to upload pictures
Near term goals ： Have 10000 fans
Support Xiao Su ： give the thumbs-up 、 Collection 、 Leaving a message.

List of articles

RCNN principle

RCNN principle

Write it at the front

RCNN It is a pioneering work in the field of target detection , The author is Ross Girshick , We call it RGB A great god Can be in google Look at the articles written by Daniel in academic circles , Look at the number of citations , Can only exclaim ！！！
Insert picture description here

Next, we will introduce... In detail RCNN Principle , Let's take a look at this classic picture in the paper . This picture shows RCNN Implementation process , There are four main steps , Each step is explained below .

Candidate area generation

Candidate regions are generated in RCNN It is used in selective search 【 abbreviation SS Algorithm 】, The principle of this algorithm is roughly through color 、 size 、 Some features such as shape cluster the image , The result of the algorithm is to generate a series of candidate boxes in a picture ,RCNN Make every image generate 2000 Candidate box . These candidate boxes have a lot of overlap , Therefore, we need to remove these overlapping candidate boxes later , Get a relatively accurate candidate box .【 notes ： Here is wrong SS Explain the algorithm in detail , Those who are interested can consult and understand by themselves 】 The following figure shows SS The approximate result of the algorithm , It can be seen that multiple candidate boxes will be generated for a target .【 notes ：RCNN in SS The number of candidate frames generated by each image is 2000】

Feature extraction by neural network

In the last step, we started from SS The algorithm gets... From a picture 2000 Candidate box , Next, we need to extract the features of these candidate boxes , That is, separate 2000 Candidate box areas are fed ALexNet Network training , The extracted features .【 notes ： of ALexNet I introduced the network structure of , Unclear click * Learn more 】 For the convenience of reading , I put ALexNet The network structure of is also posted for your reference , As shown in the figure below ：

It should be noted that , stay RCNN in , We don't need the last softmax layer , You only need to go through the last two full connection layers , Using the extracted features can . In addition, due to the existence of full connection layer , You need to limit the size of the output picture , That is, the resolution of the picture is 227*227. The method used in this paper is regardless of the size or aspect ratio of the candidate region , First expand around it 16 Adjacent pixels , Then force all pixels to zoom to 227*227 Size .【 notes ： It can be seen that this scheme will distort the original image , For example, people become shorter and fatter 】 The relevant scaling scheme is shown in the following figure ：

picture source B Brother Tongji Zihao

SVM Classifier classification

We have passed the previous step ALexNet The network extracts features , Each candidate box area will generate 4096 The eigenvectors of the dimensions , As shown in the figure below ：

picture source B Stand thunderbolt Wz

The above figure shows the feature extracted from a candidate box , We use SS The algorithm generates... From an image 2000 Candidate box , Enter all candidate boxes into the network , Will get 2000*4096 The characteristic matrix of dimension . take 2000*4096 The characteristic matrix of dimension and 20 individual SVM The weight matrix 4096*20 Multiply , You'll get 2000*20 The probability matrix of dimension , Each row represents the probability that a candidate box belongs to each target category .【 Be careful ： If you use VOC Data sets , So the category should have 21 class , Include a background class 】

picture source B Stand thunderbolt Wz

To make it easier for everyone to understand , For the above structure ① Explain in more detail , As shown in the figure below ：

As can be seen from the above figure ,2000*20 Each column of a dimensional matrix represents 2000 The prediction probability of each candidate box for a certain class , For example, the first column indicates 2000 The prediction probability of each candidate box to the dog . We perform non maximum suppression for each column or class （NMS） Used to eliminate overlapping candidate boxes , Get the suggestion box with the highest score in the column . Specifically NMS The process is as follows ：

picture source B Stand thunderbolt Wz

This part may be a bit confusing at first , Why delete IOU Big goals ？ I have had this question before , In fact, we are not very clear about this process . First, we will find the goal with the highest score in a certain column , Then it will calculate other goals and the goal with the highest score IOU【 Note that it is not calculation and Ground Truth Of IOU】, This IOU What does big mean ？ The larger the value, the more the two candidate boxes overlap , It means that the two candidate boxes are likely to represent the same object , Then it is easy to understand to delete the candidate box with low score . The following figure shows the relevant process ：

picture source B Stand thunderbolt Wz

The regressor corrects the position of the candidate box

In the previous step, we eliminated many candidate boxes , Next, we need to further filter the remaining candidate boxes , That is to say, use respectively 20 A regressor for the above 20 The remaining candidate boxes in each category are regressed , Finally, get the highest score of each category after correction bounding box.

So how do we get the final prediction box from the candidate box ？ We will still be ALexNet The output eigenvector is used to get the prediction result of the regressor , The result is $d_x(P),d_y(P),d_w(P),d_h(P))$ , It represents the center point coordinate offset and the scaling factor of the width and candidate box Height offset . The result of its prediction ${\mathop {\rm{G_i}}\limits^ \wedge}$ The expression for is as follows ：

picture source B Brother Tongji Zihao

We solve the inverse of the above equation $d_x(P),d_y(P),d_w(P),d_h(P))$ The expression of , Current use $t_x,t_y,t_w,t_h)$ Express , Because the dimension box parameters and candidate box parameters are given , therefore $t_x,t_y,t_w,t_h)$ It can also be calculated directly , For real value .

picture source B Brother Tongji Zihao

Next use $d_x(P),d_y(P),d_w(P),d_h(P))$ Value de fitting $t_x,t_y,t_w,t_h)$ value , Minimize the loss function , The loss function is as follows ：

Summary

RCNN That's all for the principle of , I hope it can help you . It will be updated continuously in the future fast_RCNN and Faster_RCNN And related code explanation , Come on ！！！