当前位置：网站首页>Positive and negative sample division and architecture understanding in image classification and target detection

Positive and negative sample division and architecture understanding in image classification and target detection

2022-07-03 10:01:00 【Star soul is not a dream】

understand Deep learning with supervised learning Of The key Lies in Reasoning and Training Phase in Separate , A model can be understood by understanding the reasoning of various deep neural network architectures and the operation in the training stage .

The reasoning stage Yes, it will Model As a similar Nonlinear function of black box , For example, through the combination of various convolution modules, a backbone, Output what you want shape Tensor , Do it again post-processing .

Training phase Is the need to Divide positive and negative samples , then According to the task To design a Loss function , Use optimization algorithms such as SGD Update the neurons in an iterative way weight and bias, The goal of optimization is to minimize the loss function , So the trained model can Fit the training set .

We can usually put All Neural Networks With Encoder - decoder Understand the architecture of .

Image classification ：

The reasoning stage ： Input as image , And then there was Encoder （ Such as CNN） Encode as tensor , It's usually W/H Reduce x times , And the number of channels C increase y times , Encoded into a new tensor （W/x, H/x, yC）. And then there was decoder , Join in FC、softmax etc. . Of course , Can also be softmax All the previous understanding is Encoder , hold softmax Understood as a decoder .
Training phase ： The same as the reasoning stage , But is softmax Output vector Need and The labeled label calculates the cross entropy loss （ Commonly used ）, So as to back propagate updates softmax All before weight and bias.

object detection ：

The reasoning stage ： Target detection is more complex , Generally speaking , The architecture of target detection is Backbone + Neck + Detection head. Interestingly The name , trunk And then there was Neck And finally Detection head of decision .Backbone Often We are Large image classification data set Pre training model for training on （ Encoder for image classification ）, This is because Annotation of classification problems Cheaper , However, the features extracted from the two tasks of the network can be used universally , Therefore, it is an idea of transfer learning .Neck yes Backbone Some of the output tensors Feature fusion operation , Get better combination features to It is suitable for the detection of targets of different sizes .Detection head yes Neck The tensor after fusion is operated , Output what you want shape Tensor . And finally post-processing , Delete a part of the vector according to the threshold , And then use NMS Remove redundant borders .

Of course , We can Backbone + Neck as Encoder ,Detection head as decoder . Be careful ： Some architectures may not Neck , Such as YOLO v1, So it will bring performance loss .

Backbone + Neck + Detection head Our architecture allows us to design individual modules separately , Then we can construct different target detection models by replacing .

2. Training phase ：

The core of the training phase is The design of loss function .Detection head The output tensor and the loss of label annotation , So as to update the network . therefore , This part does not cover the above post-processing . The key here is The choice of positive and negative samples , To calculate the loss .

stay Image classification task in Positive sample yes This kind All labeled images , The negative sample is Other categories All images . Network input positive sample image , then Predicted value and label vector 1 Where to seek loss , So the predicted value will become larger , Thus reducing losses , because softmax constraint , Then the other values of the prediction vector will become smaller ; When the network inputs the negative sample image of a class at present , The predicted value of the class to which the image belongs will become larger , Other values will also become smaller . therefore , For image classification , We don't need to pay attention to the division of positive and negative samples , Because by Labeled one-hot code , Naturally, it is equivalent to distinguishing positive and negative samples .

Target detection task in , Enter an image , Unlike image classification , Units of positive and negative samples No longer an image , and Is an area in an image , therefore An image has multiple positive and negative samples , Although the size of these areas is smaller than the image in image classification , But because of the huge number , So compared with target detection slow More . that How to get these areas （ sample ）？ How to divide so many areas into positive and negative samples ？ These are two important questions . The former ： A common practice is anchor based To get these areas , Some generated on small pieces of each image A priori box anchor It's a sample . the latter ： Commonly used are based on and Real box Of IOU To divide positive and negative samples , Different algorithms have different strategies . If anchor Divided into positive samples , So right. This positive sample Conduct Return to You can get Prediction box , Then the prediction box can participate in the loss function Calculation of positioning loss , Prediction box and Real box Calculated distance .

Notice that there are three kinds of boxes ：

Real box
A priori box anchor
Prediction box

Sum up , In target detection Positive samples are not Real box , The real dimension box is the goal of optimization , just as Image classification Medium one-hot Encoded vector equally . Positive sample Those who choose Partial a priori box anchor, just as Image classification Medium An image of a class . And through the model A priori box anchor And what you get is Prediction box , just as Image classification Medium Prediction vector , therefore Prediction box and real box Loss. Of course , image yolov1 did not anchor, So there are some differences .

Backbone + Neck + Detection head modular ：

Input: Image, Patches, Image Pyramid
Backbones: VGG16, ResNet-50, SpineNet , EffificientNet-B0/B7, CSPResNeXt50, CSPDarknet53, swin transformer
Neck:
- Additional blocks: SPP, ASPP, RFB , SAM
- Path-aggregation blocks: FPN, PAN, NAS-FPN, Fully-connected FPN, BiFPN, ASFF, SFAM
Heads:
- Dense Prediction (one-stage):
  - RPN, SSD, YOLO(v2-v5), RetinaNet (anchor based)
  - YOLOv1, CornerNet, CenterNet, MatrixNet, FCOS(anchor free)
- Sparse Prediction (two-stage):
  - Faster R-CNN, R-FCN, Mask R-CNN(anchor based)
  - RepPoints(anchor free)