当前位置:网站首页>[Paper Intensive Reading] Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation (R-CNN)
[Paper Intensive Reading] Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation (R-CNN)
2022-08-05 05:55:00 【takedachia】
论文Title:Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.发表于2014年.
This paper is the foundational work in the field of computer vision object detection,提出了R-CNN模型,A paradigm that specifies a class of object detection processes,The algorithms of later target detection are all affected by it.
R-CNN中的R为 region,That is, the meaning of the target detection candidate frame.tell us the name,The job is to performCNN操作,Extract the features,Then classify the features, etc..
R-CNNPropose to offset the candidate frame,Using a regression model to improve the ability of objects in the candidate box.
此外,R-CNNThe research proposes the visualization of features extracted by convolutional neural networks,Interpretability research belonging to deep neural networks.
Here's what I've come up with after reading this paper,and summaries of their own thinking.
R-CNN的提出,Solved two problems at the time:
①How to use deep neural networks to train an efficient model,for positioning objects.
Previously, target detection was basically done using traditional computer vision methods..
文章提出了两阶段范式(先生成候选框,To classify candidate box+微调).
R-CNNExtract box features using convolutional neural networks,Classify the features.
Also the use of a regression model for the position of the candidate box for fine-tuning.
②Object detection can use less data.
提出了“First supervised training,After fine-tuning in specific areas”这一范式.
即使用fine tuningThe migration of learning technology.
- 1、Generate multiple candidate boxes from one image
- 2、Extract features from candidate boxes with a large convolutional neural network,Each candidate box extracts a fixed-length feature vector.
- 3、To extract the vector using linear support vector machine (SVM)(Linear SVM)进行类别预测,Each class trains a linear support vector machine to predict whether it is the current class.
- [同时] Predict the offset of the extracted vector at the same time,使用回归模型,Used to fine-tune the positioning of candidate boxes.
The picture below is given in the paperR-CNN流图(The regression model is not drawn in the figure):
使用selective search方法,That is, the clustering generates the initial segmentation area,according to the color、纹理、大小、Shape similarity weighted merging produces different levels of2000个左右的候选框.
Think of it as an algorithm that randomly generates candidate boxes,There are no learnable parameters associated with the network here.
Scale the candidate box first,缩放成227×227的RGB图像.Zoom details:Extend the candidate box around16个像素,Then force it to scale to227×227大小.The zoom effect as shown in the following figure.
Then feed the convolutional neural network model(文中使用的AlexNet,So read in227×227的图像),Finally extract a4096维的向量.
- pass in an image,使用selective search方法,生成2000个候选框,进行缩放.
- feed after scalingCNN模型中,抽取到一个4096维的向量.This step is the most time consuming.
- Use the trained线性SVM分类器To predict each category(Each category has a linearSVM分类器).
第2step is a2000×4096(Number of candidate boxes×向量长度)的矩阵,here multiplied by one4096×N的矩阵(N为类别数,that is trainedSVM),The confidence level of each category can be obtained.
最后,使用NMS算法(非极大值抑制)Eliminate redundant candidate boxes for repeated predictions.
In addition, while classifying predictions,At the same time, the prediction of the offset of the candidate box is also performed.,Correct it to a more accurate location.
The following diagram outlines the process of this model very clearly(图摘自b站up“同济子豪兄”的R-CNN论文解读):
Explain the details of training the model.
1 CNN模型的训练
①使用预训练模型+微调(fine tuning)的方式进行训练
in the second moduleCNN模型中,使用了ImageNetimage classification dataset预训练模型(作者使用的AlexNet),Use fine-tuning on the pretrained model(fine tuning)The transfer learning method is applied toVOC目标检测数据集上.(This method can solve the problem of small amount of target detection data)
具体就是将VOC数据集上的候选框作为训练集,进行缩放,然后喂入CNN模型进行微调训练.其中CNNThe original source of the last fully connected layer in the model1000change the output to21个,对应PASCAL VOC 2012数据集20个类别+1个背景类.
The background class is the class that is not represented by the candidate frame in the valid frame.
注意,我们在训练CNN模型时,最后一层21class outputs are used only for training,to train a feature extractor.
The last fully connected layer is changed to21The entire model can be trained after only one output,But the last thing we need to use is the penultimate fully connected layer(Even use the output of the further pooling layer)提取到的4096维的向量,We need to rely on it for classification in the testing phase.The method of classification uses linearSVM分类器,下文会讲.
So how did the training samples come from??
We know that in the beginning we had someGround Truth,即人工标注的框,但数量极少,It's unrealistic to use it for training.
We still have a lot ofselective search生成的候选框,We can label the candidate frame by processing,make it a training sample.
So how do you label it??
首先,We will be divided into training sample正样本和负样本,The positive sample indicator is noted as20个分类的候选框,The negative sample index is annotated as the candidate box of the background class.
How to distinguish positive and negative samples?这里使用了IoU(交并比)的概念,to measure whether a candidate box effectively frames the target.
As shown in the following two figures:
We put a candidate box(The light blue box as shown above)and all in this pictureGround Truth(The red box as shown above,in the picture1个GT框,实际上可以有多个,Represents multiple classes)做对比,计算出各个IoU值,得到最大的IoUValue and corresponding category.
我们将IoU>0.5The candidate box is regarded as正样本.比如,The candidate box is effectively boxedbird,即为bird的正样本.
Not valid boxbird,Instead, the tree branches and the green in the distance are framed,即为bird的负样本,as background class.
Thus our training set is generated,both that20类的候选框,background candidate frame.
③Solve the problem of class imbalance in the training set
The training set generated above,负样本(background class sample)still the vast majority of,Directly used for training will cause class imbalance,In the model identification is sample.
所以在训练时,Need to ensure a batch of data,Positive samples have a certain proportion(such as the one described in the128number of batches,要有32个正样本和96个负样本).
Since the number of positive samples is much smaller than that of negative samples,This requires oversampling of positive samples,Undersampling negative samples,Re-sampling to achieve relative class balance.
2 线性SVM分类器的训练
Each class will train a linearSVM的二分类器.
我们使用CNNTo extract the characteristics of the network(That is, a candidate box corresponding to a4096维向量),make a binary judgment on it.such as judging whether it isbird.这是一个bird的二分类器.
We can also train one morecar的二分类器.train in this way20binary classifiers for each class.
分类“car”For example, the binary classifier of,A binary classifier must also need a certain“car”positive and non“car”的负样本(can be framed tocar的很小一部分;can also be background,or even other categories).
How did the training set data for this binary classifier come from??
Think about the previous trainingCNN特征提取器,它会在IoU>0.5can identify a“car”,On the one hand, the ability of such fuzzy identification is relatively strong,This can extract a variety ofcar特征;training on the other handCNNfew time samples,relatively lowIoUvalue required to make up more positive samples.
But the current binary classifier needs to strictly identify a relatively complete vehiclecar,determine this is acar的可能性,毕竟有20categories are waiting for you to compare the likelihood.(In addition, it will be mentioned later,We also need to predict the probability of this candidate box定位信息,This positioning information needs to be combined with a completecarfeature information to compare,Predict the offset of the positioning)因此,When training the classifier,Positive and negative samples and previous trainingCNN的时候不同.
这里,正样本is annotated by the dataset itselfGround Truth框.
如果IoU小于0.3,As a training负样本.(注意如果IoU大于0.3samples are not used for training,被抛弃掉)
When training, you should also pay attention to the balance of positive and negative samples.
③Mining of indistinguishable negative samples(hard negative mining)
R-CNN还使用了 hard negative mining,Take the data of some negative samples that are difficult to distinguish as“A collection of wrong questions”,Join into the next round of training.
This can also improve the performance of the classifier.
思考:为什么要用SVM分类,而不在CNNdirectly in the modelsoftmaxClassification as a result
This problem is ultimately related to the lack of training data.
在CNN模型训练中,We don't have enough positive samples,所以需要降低IoU值的要求,将IoU>0.5The candidate boxes of are classified as positive samples of the training set,Let the model learn.这样的设计,It can be clearly seen that定位性能there will be some loss(After all, half of the car is used to make do with learning.,In the final prediction, you can only recognize half of the car in the box).
If our sample is large enough(Such as enough),should not be loweredIoU的要求,And regard the target candidate box in the right box as a positive sample,In this way, there is no problem of positioning performance loss..
So helplessly, the author will predict that the candidate frame features are not so good.(即用CNN提取出来的4096维特征向量)artificially perform secondary processing,Split its feature information into分类+偏移预测(后文讲述).
The author did some preliminary experiments,It is found that positive and negative samples need to be divided during training,在训练CNNUse different positive and negative samples to train the feature extractor and classify the candidate boxes,会提高mAP(模型的性能).When classifying the candidate box,We need to divide the positive and negative samples more strictly,A positive sample must beGround Truth框.
As shown in the above diagram,Offset prediction is what we need to talk about belowBounding box Regression.
3 Prediction of candidate box offset(Bounding box Regression)
Because the candidate frame will inevitably generate localization errors,所以我们可以对生成的候选框进行偏移修正.
Bounding box regression是受DPMalgorithm-inspired,它通过训练一个线性回归模型,给予一组特征(CNN提取的特征),来预测一个新的检测框,这个新框的偏移量是这个Regression预测的目标.
这个偏移量是相对于正确位置(如Ground Truth框完整地框中某个目标)的偏移量,偏移量通过一组偏移系数计算得到,而偏移系数则是学出来的.
下面结合论文和我的思考细讲一下Bounding box Regression干了一件什么事.
I put this part into another sub-article:R-CNN prediction box regression(Bounding box regression)问题详述
R-CNNSome thoughts and contributions of
以上是R-CNNThe main content of the target detection task,Let’s talk about some of the other contributions and ideas involved in the paper.
其中,prediction box regression(Bounding box regression)has been discussed in detail in the main section.
Visualize learned features(The Interpretability Study of Neural Networks)
R-CNNIt is also one of the cornerstones of interpretability analysis of convolutional neural networks.(Other work such asZFNet等).R-CNNA visualization method is proposed,Visually show what the network has learned.
①About the activation value of the neuronactivation
Review two concepts first:
在CNN模型中,The output of a layer is generally:长×宽×通道数,where each number in the matrix represents a neuronneuron.
The output of each layer is called the activation value of this group of neuronsactivations,It will be passed to the next layer as input.
Activation values mentioned in the paperactivationis the number of outputs in a channel.
The author's proposed method to visualize the learned features,is to findAlexNetThe picture area with the largest activation value of some neurons in.什么意思呢?
②The activation value corresponds to the original image receptive field
I drew a picture below,Display the original image receptive field corresponding to the activation value of a certain layer.一张227×227incoming imageAlexNet网络,its last pooling layerpool5的输出的feature map是6×6大小,有256个通道.
We take the first output of the pooling layer1个通道上的(3,3)这个激活值(red nub),Corresponding to the original image is a195×195的感受野.并且(3,3)This location is close to the center,Covers most of the original image.
池化层输出的256channel, we can think of it as256high-level semantic feature classes,For example, suppose the first1channels represent“光晕”特征,第2channels represent“parallel texture”等等.
那么,The first output of the pooling layer1个通道上(3,3)The higher the activation value at,means the original receptive field“光晕”Features are more likely.
So we can according to the output of the pooling layer on each channel,the size of the activation value of a given neuron,To analyze whether the features mentioned in the receptive field of the original image are correct、合理.So you can on convolutional neural network to extract the characteristics of the visualization!
The author is how to do?
The author combines all candidate boxes of the entire dataset(大约有1000万个)Feed the convolutional neural network,Features extracted to the output of the pooling layer(6×6×256).
Each channel,每个6×6的feature map中选取(3,3)The activation value at represents most of the receptive field of the original image.Sort the activation values of the channel horizontally from large to small,Find the candidate box with the top activation value,show the part of the receptive field.
这样,in each channel category,those with high activation values,A set of well-extracted features can be visualized.
Then this channel category can be analyzed for interpretability..
Let's look at the original picture of the paper,第1Rows are those channel units with the highest activation values,The receptive field corresponding to the candidate frame of the original image,Extracted here ispeoplehigh-level semantic features of.
第2The row extracted isdot(点阵)的高级特征,Among them, we can find that the dog's face is also classified as lattice information,Because the dog's two eyes and nose are also like a lattice.
fine tuning的作用、The meaning of the fully connected layer(Ablation Control Experiment)
The author compares with and withoutfine tuningThe training of the evaluation of the model performance.
And, in turn, get rid ofAlexNet最后两个全连接层,观察对性能的影响.
结果如下图(Notes taken fromb站up“同济子豪兄”):
图中显示,使用fine tuning训练的模型,性能提升显著.
并且在使用fine tuning训练的时,The role of the fully connected layer is obvious(图中mAP 47.3提升至54.2);相比不带fine tuning训练的模型,The role of the fully connected layer is not obvious(图中mAP 44.2到44.7提升不明显).
这可能说明,When using pretrained models for transfer learning,CNNExtracted common features,而全连接层fccompleted tasks in a specific area.
mentioned at the end,“supervised pre-training/domain-specific fine-tuning”(First supervised training,After fine-tuning in specific areas)这一范式,A problem-solving trend in the field of computer vision with relatively small amounts of data.
【UiPath2022+C#】UiPath 循环
常用 crud 的思考和设计
网络信息安全运营方法论 (上)
ACL 的一点心得
C语言入门笔记 —— 分支与循环
Unity huatuo 革命性热更系列1.2 huatuo热更环境安装与示例项目