当前位置:网站首页>[singleshotmultiboxdetector (SSD, single step multi frame target detection)]
[singleshotmultiboxdetector (SSD, single step multi frame target detection)]
2022-07-05 11:42:00 【Network starry sky (LUOC)】
List of articles
Project source code :
SSD Characteristics
• Uniform dense sampling
• Sampling at different scales • Different scale Scale characteristic map sampling
• For small target detection, the effect is good
• Rapid prediction
• It's hard to train ( The positive and negative samples are extremely unbalanced )
The following figure shows the efficiency comparison of some network structures , Generally speaking ,map achieve 75% To 80% That's all right.
Comparison of algorithms
SSD Description of algorithm
Input :3003003 Image .
Backbone network :VGG 16 add 10 Convolution layers
6 Feature map output : From the backbone 6 Different depth positions are extracted feature map, Of each location feature map Two output routes will be passed , One route feature map after 1 Convolution layers , Direct output 4 or 6 individual anchor Coordinate regression information of the box . Another route feature map after 1 Convolution layers , Direct output 4 or 6 individual anchor Framed 21 Target categories (20 class + Background class ) The degree of confidence .
Final output : take 6 A scale output The results are sent to NMS in , Get a unique prediction box .
Advantages and disadvantages
advantage : Relatively contemporaries YOLO1, Using multi-scale features for target detection , Greatly improve the accuracy .
shortcoming : SSD The disadvantage is that the recognition of small-size targets is still poor , Not yet Faster R-CNN The level of . This is mainly due to the fact that small-scale targets often use lower level anchor To train ( Because small targets are at a lower level IOU more ), The characteristic nonlinearity of the lower level is not enough , Can't train to enough accuracy .
Backbone network
Its backbone network has been changed VGG net
Positive and negative sample selection :
anchor Prediction box and GT box IOU>0.5, As a candidate positive sample , Others are candidate negative samples ( The number is enormous ).
SSD The conditions for a candidate positive sample and a candidate negative sample to become positive , And how to make sure 1:3 Positive sample ratio :
give an example : Let's say it's here 8732 individual default box in , after FindMatches The candidate positive samples are obtained P individual , Candidate negative samples are 8732−P individual . take prior box( Prediction box ) Of prediction loss ( Category loss value ) Select the highest... In descending order M individual prior box. If this P Among the candidate positive samples are a Not here M individual prior box in ( Explain this a The loss of prediction frames is small , And GT The box is very close , There is no need for the Internet to learn ), Will this a individual box Kick out of the candidate positive sample set . If this 8732−P The candidate negative sample sets have b One is here M individual prior box( Explain this b The loss of prediction frames is large , And GT The box gap is large , It is a difficult negative sample ), Then we will b Candidate negative samples as formal negative samples . That is, delete the positive samples that are easy to identify , And leave a typical negative sample , form 1:3 Of prior boxes Sample set .
Other details
SSD in , Why only conv4_3 Layer normalization ?
Loss function calculation ,variance What is the function of parameters ?
It is concluded above , Let's analyze its characteristics :
One , Reused faster rcnn Anchor box mechanism for .
stay featuremap Extract various sizes of defaultbox, It's similar Anchor A series of fixed size boxes . Different featuremap The upper scale is the same .
Two , Why do we need such multi-scale feature map prediction ?
Because small objects are easy to detect targets on large feature maps with small receptive fields , conversely , Small feature map , It feels great , It is advantageous to detect large objects .
The figure above shows an example of a layer feature graph , Each anchor point of this layer is set 4 Anchor box ,loc Is the coordinate offset value ,conf Is the confidence of the category , So for each anchor frame, you get 20+1+4 Predicted values
3、 ... and , Full convolution network structure (Convolutionalpredictorsfordetection)
• combination RFCN The advantages of the Internet , Change all fully connected networks to fully convoluted network structure .
• Use convolution to extract candidate box features (offsetbox+score).
Particular attention : When the anchor point on the feature graph you set is restored to the original graph , Coordinates of anchor points on the feature map Multiple of down sampling + Multiple of down sampling /2, This is to prevent the coordinates of your anchor point from being (0,0), direct The lower sampling multiple is still (0,0)
The following figure is about the calculation of anchor point and anchor box :
I divided the first picture into three parts , Pictured
Next, let's say based on VGG Specific changes to the network :
1. The infrastructure uses VGG, And will FC6Layer and FC7Layer Convert to convolution , And put the original MaxPooling5 The size is from 2x2-s2 Change to 3x3-s1( No sampling is done as before , This is equivalent to a fusion of features ), such pooling5 After the operation featuremap Keep the size larger , This will lead to smaller receptive fields later , That is, the area corresponding to a point in the original graph becomes smaller .
2. In order to protect the receptive field and make use of the original FC6 and FC7 Model parameters of , Use atrousalgorithm To increase the receptive field , That is expansion convolution / Cavity convolution .( Above picture fc6 To fc7 We did the expansion convolution )
Expansion convolution : To increase the receptive field , But you don't want to increase the parameter quantity , Fill in the convolution kernel 0, To achieve the goal , Here's the picture . But there will be some loss of information .
Expanding convolution kernel size = Coefficient of expansion ( Original convolution kernel size -1)+1*
•Conv6(fc6) Middle convolution kernel kernel by 3,pad by 6,dilation by 6, So the size of the real convolution kernel is 13,pad by 6 To guarantee the output featuremap No change in size , Still 19x19.
• Expansion convolution (DilatedConvolution) To solve the following problems :
• Common data sampling layer parameters cannot be learned ;
• Internal data structure missing , Spatial hierarchy, information loss ;
• Small object information cannot be reconstructed .
•TensorFlow Medium expansion convolution / Cavity convolution API:•tf.nn.atrous_conv2d(value,filters,rate,padding,name=None)
A few framed explanations :
• After the basic network , Using different levels of convolution featuremap To extract defaultbox, For each layer Of featuremap Use two parallel 3x3 Convolution is used to extract position information respectively (offsetbox) And confidence information ; combination Defaultbox and GroundTruthbox Build loss function .
• about Con4_3 When extracting data from , Will be right first featuremap Make one L2 norm The operation of ( It's going on 3*3 Before convolution prediction ), Because the level is higher , Prevent the data value from being too large .
stay CNN In the network , The deeper the level ,featuremap The size of the (size) It's going to get smaller and smaller , This design is mainly for the following two purposes :
• Reduce computing and memory requirements ;
• Finally extracted featuremap It has translation and scale invariance to some extent , Meet the business scenario requirements of classification .
• In the target detection scene , It is often necessary to deal with objects of different scales , In some networks , The image will be transformed into images of different scales and processed independently through the network , Then the results of these images with different scales are merged , But actually , In the same network , On different levels featuremaps The effect of feature processing is the same , And the object processing parameters of all scales are shared , The calculation will be faster .
Some structural details :
SSD In structure ,defaultboxes It doesn't need to be with every layer layer Of receptivefields Corresponding , By making a difference scale The size of boxes To be responsible for the specific area in the image and the specific size of the object .
Let's talk about how to calculate the size of a priori box :
When extracting a priori boxes , Mainly through the scale ( size ) And aspect ratio , A linear increasing rule is obeyed on the prior frame scale : As the feature size decreases , The a priori frame scale increases linearly .
The corresponding figure here is the following figure , The ratio of the size of the prior frame of the five layers to the original graph is from 0.2 To 0.9 Unequal . The figure below shows the calculated a priori box size of the network :
After calculating the size, we should also calculate the ratio of their width to height :
When extracting a priori boxes , Mainly through the scale ( size ) And aspect ratio , In aspect ratio , In the paper, it is suggested that the range of ratio value should be :[1,2,3,1/2,1/3]. about Conv4-3、Conv10-2 as well as Conv11-2 These three floors , Because only use 4 A priori box , Don't use 1:3 and 3:1 The proportional value of .
In addition to using the above 5 Length to width ratio , A special dimension is introduced and the aspect ratio is 1 A priori box of . The main purpose of introducing this box is to reflect that two aspect ratios appear in the final candidate box 1 But square prior boxes of different sizes .
Finally, all prior box sizes are calculated :
Let's look at the composition of the positive and negative samples :
HardNegativeMining( Hard negative sample mining ):
• After generating a priori boxes , Will produce a lot of compliance GroundTruthBox A priori box of , But there will be more frames that don't match , That is to say negativeboxes The number of is far more than positiveboxes Number of , It will also lead to extreme imbalance between data , It is difficult to converge during training . Therefore, in SSD in , use R-CNN The hard negative sample mining algorithm introduced in . The corresponding position of each object defaultboxes yes negative Of boxes Follow the forward direction loss Sort the size of , obtain loss The bigger one N individual negativeboxes Participate in model training , Finally, the ratio of positive and negative samples shall be ensured to be 1:3 about .
Then we did data enhancement :
• Flip horizontal (HorizontalFlip)
• Random clipping and color distortion (RandomCrop&ColorDistortion)
• Random acquisition block fields (Randomlysampleapath)
Training data category given criteria :
• Positive sample : If a priori box and GroundTruth Box match , Then consider that the current prior box is a positive sample ;
• Negative sample : If a priori box and all GroundTruth The boxes don't match , Then consider that the current prior box is a negative sample .
•NOTE: use hardnegativemining( Hard negative sample mining algorithm ) choice loss Large samples are taken as negative samples , Positive sample ratio 1:3;
SSD A priori box and GroundTruth There are several main matching principles :
•1. For each picture GroundTruth, Find their IoU The largest a priori box , The a priori box matches it ,
•2. For the remaining unmatched a priori boxes , If it is related to a GroundTruth Of IoU Greater than a certain threshold ( It's usually 0.5), Then the a priori box is also related to this GroundTruth( Choose the biggest IoU Of GT box ) Match . It means something GroundTruth May match multiple a priori boxes , That's ok .
•3. If a priori box and multiple GroundTruth Of IoU The value is greater than the threshold or the maximum IoU A priori box of , Then this a priori box is only related to IoU The biggest one GroundTruth matching .
Finally, I write to the end …
Loss function
stay SSD in , The loss function is defined as the position error (locatizationloss,loc) And confidence error (confidenceloss,conf) Weighted sum of
class MultiBoxLoss(nn.Module):
""" Multi box loss , A loss function for target detection . This is a combination of the following : (1) The localization loss of the predicted position of the box , (2) Loss of confidence in predicted class scores . """
def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.):
super(MultiBoxLoss, self).__init__()
self.priors_cxcy = priors_cxcy # 8732 A preselection box
# Move the bounding box from the center dimension coordinates (c_x, c_y, w, h) Convert to boundary coordinates (x_min, y_min, x_max, y_max)
self.priors_xy = cxcy_to_xy(priors_cxcy)
self.threshold = threshold
self.neg_pos_ratio = neg_pos_ratio
self.alpha = alpha
self.smooth_l1 = nn.L1Loss()
self.cross_entropy = nn.CrossEntropyLoss(reduce=False)
def forward(self, predicted_locs, predicted_scores, boxes, labels): # (N, 8732, 4) (N, 8732, n_classes)
""" Forward propagation. :param predicted_locs: Predicting location / Square box w.r.t 8732 The previous box , The size tensor (N,8732,4) :param predicted_scores: Each coding position / Box's category score , Dimensional tensor (N,8732,N\u class ) :param boxes: Real object bounding box in boundary coordinates ,N A list tensor :param labels: Real object labels ,N A list tensor :return: multibox loss, Scalar """
batch_size = predicted_locs.size(0) # This method returns the shape of the current tensor , The return value is a tuple tuple A subclass of .
n_priors = self.priors_cxcy.size(0) # 8732 A preselection box
n_classes = predicted_scores.size(2) # tuple The third element of
assert n_priors == predicted_locs.size(1) == predicted_scores.size(1)
true_locs = torch.zeros((batch_size, n_priors, 4), dtype=torch.float).to(device) # (N, 8732, 4)
true_classes = torch.zeros((batch_size, n_priors), dtype=torch.long).to(device) # (N, 8732, n_classes)
# For each image
for i in range(batch_size):
n_objects = boxes[i].size(0)
# Calculation IOU
overlap = find_jaccard_overlap(boxes[i],
self.priors_xy) # (n_objects, 8732)
# For each a priori , Find the object with the largest overlap
overlap_for_each_prior, object_for_each_prior = overlap.max(dim=0) # (8732) # A priori box IOU value , An object in a priori frame
# We don't want this to happen , That is, in our positive ( Non background ) There is no prior representation of the object
# 1. Object may not be the best object for all priorities , So don't object_for_each_prior in .
# 2. Based on thresholds (0.5), All prior values of an object can be specified as the background .
# To solve this problem -
# First , Find the prior box with the largest overlap per object .
_, prior_for_each_object = overlap.max(dim=1) # (n_objects) # Each object overlaps the largest a priori box , in total n_objects Objects
# then , Assign each object to the corresponding maximum overlap priority .( This fixes 1.)
object_for_each_prior[prior_for_each_object] = torch.LongTensor(range(n_objects)).to(device) # object
# To ensure that these priorities meet the requirements , Artificially give them a greater than 0.5 Overlap .( This fixes 2 individual .)
overlap_for_each_prior[prior_for_each_object] = 1. # IOU
# Each previous label
label_for_each_prior = labels[i][object_for_each_prior] # (8732)
# Set the priority that overlaps with the object less than the threshold value as the background ( No object )
label_for_each_prior[overlap_for_each_prior < self.threshold] = 0 # (8732)
# Storage
true_classes[i] = label_for_each_prior
# Encode the object coordinates of the central size into the form we return the prediction box to
true_locs[i] = cxcy_to_gcxgcy(xy_to_cxcy(boxes[i][object_for_each_prior]), self.priors_cxcy) # (8732, 4)
# Identify positive prior knowledge ( object / Non background )
positive_priors = true_classes != 0 # (N, 8732)
# Localization loss
# Only in positive ( Non background ) Calculate the location loss under a priori conditions
loc_loss = self.smooth_l1(predicted_locs[positive_priors], true_locs[positive_priors]) # (), scalar
# Note: indexing with a torch.uint8 (byte) tensor flattens the tensor when indexing is across multiple dimensions (N & 8732)
# So, if predicted_locs has the shape (N, 8732, 4), predicted_locs[positive_priors] will have (total positives, 4)
# Confidence loss is computed over positive priors and the most difficult (hardest) negative priors in each image
# That is, FOR EACH IMAGE,
# we will take the hardest (neg_pos_ratio * n_positives) negative priors, i.e where there is maximum loss
# This is called Hard Negative Mining - it concentrates on hardest negatives in each image, and also minimizes pos/neg imbalance
# Number of positive and hard-negative priors per image
# Be careful : Use torch.uint8( byte ) When the tensor is indexed , When an index spans multiple dimensions (N&8732) when , The tensor flattens
# therefore , If predict_locs The shape of is (N, 8732, 4), be predict_locs[positive_priors] There will be (total positives, 4)
# Loss of confidence
# The loss of confidence is a positive a priori and the most difficult in each image ( The hardest ) Calculated on the negative a priori of
# in other words , For each image ,
# We will use the most difficult (neg_pos_ratio * n_positives) Negative a priori , That is, where the loss is greatest
# This is called Hard Negative Mining - It focuses on the most difficult negative in each image , And minimize pos/neg out-off-balance
# A priori number of positives and negatives per image
n_positives = positive_priors.sum(dim=1) # (N)
n_hard_negatives = self.neg_pos_ratio * n_positives # (N)
# First , Find the loss of all prior frames
conf_loss_all = self.cross_entropy(predicted_scores.view(-1, n_classes), true_classes.view(-1)) # (N * 8732)
conf_loss_all = conf_loss_all.view(batch_size, n_priors) # (N, 8732)
# We already know what a priori knowledge is positive
conf_loss_pos = conf_loss_all[positive_priors] # (sum(n_positives))
# Next, find which priors are hard-negative
# To do this, sort ONLY negative priors in each image in order of decreasing loss and take top n_hard_negatives
# Next , Find out which prior knowledge is hard negative
# Do that , Please only sort the negative priority in each image , To reduce losses , And take the front n_hard_negatives
conf_loss_neg = conf_loss_all.clone() # (N, 8732)
conf_loss_neg[positive_priors] = 0. # (N, 8732), positive priors are ignored (never in top n_hard_negatives)
conf_loss_neg, _ = conf_loss_neg.sort(dim=1, descending=True) # (N, 8732), sorted by decreasing hardness
hardness_ranks = torch.LongTensor(range(n_priors)).unsqueeze(0).expand_as(conf_loss_neg).to(device) # (N, 8732)
hard_negatives = hardness_ranks < n_hard_negatives.unsqueeze(1) # (N, 8732)
conf_loss_hard_neg = conf_loss_neg[hard_negatives] # (sum(n_hard_negatives))
# As stated in the paper , Only the positive priors are averaged , Although both positive and hard negative priors are calculated
conf_loss = (conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.sum().float() # (), scalar
return conf_loss + self.alpha * loc_loss
The prediction process
For each prediction box , First of all, determine the category according to the category confidence ( The one with the greatest confidence ) And the confidence value , And filter out the prediction box belonging to the background . And then according to the confidence threshold ( Such as 0.5) Filter out the prediction box with lower threshold . The remaining prediction frame is decoded , according to DefaultBox A priori box +offsetbox The offset prediction value is linearly converted to obtain its true position parameters ( After decoding, we usually need to do clip, Prevent the prediction box from being positioned beyond the picture ). After decoding , Generally, it is necessary to rank in descending order according to the confidence level , Then just keep it top-k( Such as 400) A prediction box . Finally, it's going on NMS Algorithm , Filter out those prediction frames with large overlap . Finally, the remaining prediction box is the result of the test .
The last one is with faster rcnn The effect comparison picture of the end of the scene :
- Unity xlua monoproxy mono proxy class
- POJ 3176 cow bowling (DP | memory search)
- How to get a token from tokenstream based on Lucene 3.5.0
- Zcmu--1390: queue problem (1)
- Prevent browser backward operation
- COMSOL--三维随便画--扫掠
- Go language learning notes - analyze the first program
- How does redis implement multiple zones?
- pytorch-多层感知机MLP
- Sklearn model sorting
Ziguang zhanrui's first 5g R17 IOT NTN satellite in the world has been measured on the Internet of things
12. (map data) cesium city building map
iTOP-3568开发板NPU使用安装RKNN Toolkit Lite2
[crawler] Charles unknown error
爬虫(9) - Scrapy框架(1) | Scrapy 异步网络爬虫框架
【Win11 多用户同时登录远程桌面配置方法】
[crawler] bugs encountered by wasm
AutoCAD -- mask command, how to use CAD to locally enlarge drawings
7.2 daily study 4
C operation XML file
Ziguang zhanrui's first 5g R17 IOT NTN satellite in the world has been measured on the Internet of things
COMSOL -- establishment of geometric model -- establishment of two-dimensional graphics
【云原生 | Kubernetes篇】Ingress案例实战(十三)