当前位置：网站首页>Update iteration summary of target detection based on deep learning (continuous update ing)

Update iteration summary of target detection based on deep learning (continuous update ing)

2022-07-07 20:18:00 【Breeze_】

RCNN The process of

Selective Search Do candidate box extraction ,1000~2000 individual
Candidate box Resize To a fixed size , Input CNN The extracted features
Feature directly SVM classification , Get the results of classification
Further adjust the position

SPPNet Innovation of

Combined with the spatial pyramid method CNNs Multi scale input of .
SPP Net My first contribution is After the last convolution layer , Access to the pyramidal pool , Ensure that the input to the next full connection layer is fixed . let me put it another way , In ordinary CNN In Institutions , The size of the input image is often fixed （ such as 224*224 Pixels ）, The output is a vector of fixed dimensions .SPP Net In ordinary CNN The structure adds ROI Pooling layer （ROI Pooling）, The input image of the network can be any size , The output is the same , It's also a vector of fixed dimensions . in short ,CNN Originally only fixed input 、 Fixed output ,CNN add SSP after , Can then Any input 、 Fixed output .
ROI The pooling layer is generally behind the convolution layer , In this case, the input of the network can be of any scale , stay SPP layer Every one of them pooling Of filter Will adjust the size according to the input , and SPP The output is a vector of fixed dimensions , And then give the full connection FC layer .
Only extract convolution features from the original image once .
stay R-CNN in , Each candidate box begins with resize To uniform size , And then as CNN The input of , This is very inefficient .
and SPP Net According to this shortcoming, we have optimized ： Only one convolution calculation of the original image , Then we get the convolution feature of the whole graph feature map, Then find each candidate box in feature map Mapping on patch, Put this patch As the convolution feature of each candidate box, input to SPP layer And later layers , Complete feature extraction .
such ,R-CNN To compute convolution for each region , and SPPNet You only need to compute convolution once , So it saves a lot of computing time , Than R-CNN It's about a hundred times faster .
Specifically Next ：fast rcnn Medium spplayer（ROI Pooling）, Finally, the output dimension is consistent because it is a one-dimensional linear direct splicing （ We'll use that later FC）;yolov3 And later series spplayer, The dimension consistency after output is due to the pooling process of different core sizes , The step size is used s=1,padding=k//2, Finally get WH_out = k+1 Characteristics of scale . The two have different purposes , The former is to solve Any input fixed output The problem of , The latter is for promotion Small target detection ability Other questions .

Fast RCNN The process of

Selective Search Do candidate box extraction ,1000~2000 individual
Calculate the characteristics of the whole image shared feature map, And put the candidate box （ROI, Region of interest ） Map to the corresponding shared feature map（New）
notes ： The mapping rules are simple , It's just dividing the coordinates by “ Enter the picture and feature map The ratio of the size of ”, Got it feature map Upper box coordinate
utilize ROI Pooling Features adjusted to a fixed size （New）
Feed features into CNN Extract new features
The two losses of classification and regression are supervised and trained at the same time （ Full connection ）（New）

ROI Pooling operation
1. According to input image, take ROI Mapping to feature map Corresponding position
2. Divide the mapped area into the same size sections（sections Quantity is the same dimension as output ）
3. For each sections Conduct max pooling operation , obtain batch×channel×W×H The characteristics of dimensions

Faster RCNN The process of

Calculate the characteristics of the whole image feature map
Feed features into RPN The Internet , Return the information of a series of candidate boxes （ The goal is + coordinate ,k Anchor frames ）, Here we need to do regression training （New）
utilize ROI Pooling Features adjusted to a fixed size
Feed features into CNN（FC） Extract new features
The two losses of classification and regression are supervised and trained at the same time （ Full connection ）

SSD Innovation of

Use VGG16 Network as feature extractor （ and Faster R-CNN Used in CNN equally ）, Replace the following full connection layer with convolution layer , And then add a custom volume layer , At last, convolution is directly used for detection .
differ Faster R-CNN Only in the last feature layer anchor, SSD Take... On multiple feature layers default box, You can get different scales default box
Take different aspect ratios on each cell of the feature map default box, Generally, the aspect ratio is {1,2,3,1/2,1/3} Select the , Sometimes an additional aspect ratio is added 1 But with a special scale box
In order to make the positive and negative samples as balanced as possible （ Generally, the proportion of positive and negative samples is about 1：3）,SSD use hard negative mining, That is, the negative samples are arranged in descending order according to the confidence of their predicted background class , Choose the one with less confidence top-k As a negative sample of training .
Q1, How to set up default boxes Q2, How to match prior boxes Q3, How to get the predicted results

YOLOv1（ Treat the detection task as a regression task ）

Network structure ：24 A convolution +2 A full connection （ Image location + Category probability ）
Input ：1x3x448x448 Scale image
Output ：7 × 7 × 30 Scale of ,30=20+（4+1）*2,20 Is the number of categories ,4 For position ,1 by score Degree of confidence
Loss function ： It is divided into Coordinate prediction 、 Containing the bounding box of the object confidence forecast （ High weight ）、 Without the bounding box of the object confidence forecast （ Less weight ）、 Classified forecast Four parts , It uses L2 Loss

YOLOv2 Of Innovation points

DarkNet As a backbone
introduce Anchor Mechanism , Avoided YOLOv1 The problem of information loss caused by direct regression results of medium and full connections , Use K-means clustering
introduce BathNormalization, Play a certain role in improving the convergence speed of the model , Prevent model over fitting
Use high-resolution network input
Use anchor The position of the prediction target is directly regressed by the coordinate center and the offset
reference SSD Use multi-scale feature map to do detection
Multiscale training , The prediction effect of large scale is good
Remove the last convolution 、global avgpooling Layers and softmax layer , And added three $3\times 3 \times 2014$ Convolution layer , I've added one passthrough layer , Finally using $1\times 1$ The convolution layer outputs the prediction results

YOLOv3 Of Innovation points

Use the new backbone Darknet-53（ Introduce residual block ,53 Convolution layers ）
Use FPN Do multi-scale prediction
Use logical return instead of Softmax Be a classifier

YOLOv4 Of Innovation points

Input end ： New data enhancements such as CutMix and Mosaic
Backbone network ：CSPDarkNet-53,Mish Activation function ,DroupBlock
Neck network ： Space Pyramid pooling SPP, Path aggregation PAN, Characteristic pyramid network FPN
Head network ：CIoU Loss ,DIoU_NMS

RetinaNet Of innovation spot ：

The author of One-stage A series of algorithms are studied and the class imbalance problem is found , It is suggested that Focal Loss, It is an improvement of the loss function ,one-stage combination Focal Loss The combined network is RetinaNet
What is category imbalance （class imbalance）？
answer ： The number of negative samples is greater than the number of positive samples , For example, the area containing objects （ Positive sample ） Very few , Areas that do not contain objects （ Negative sample ） quite a lot . For example, the detection algorithm will generate a large wave in the early stage bbox. And in a regular picture , Just a few at most object. It means , Most of bbox Belong to background. Simply speaking , because bbox Quantity explosion . Precisely because bbox Of background Of bbox That's too much , So if the classifier mindlessly puts all bbox Uniformly classified as background,accuracy You can also brush it very high . So , The training of classifier fails . Classifier training failed , The detection accuracy is naturally low .
Focal Loss The definition of , Introduced modulating factor namely $(1-p_t)^\gamma$ , $p_t$ It reflects the difficulty of classification