当前位置：网站首页>[target detection] KL loss: bounding box progression with uncertainty for accurate object detection

[target detection] KL loss: bounding box progression with uncertainty for accurate object detection

2022-07-29 06:04:00 【Dull cat】

List of articles

The paper ：Bounding Box Regression with Uncertainty for Accurate Object Detection

Code ：https://github.com/yihui-he/KL-Loss

Source ：CVPR2019

Extract the main points ：

Modified the regression method , Study (x1, y1, x2, y2), The previous method is to learn (x1, y1, w, h)
Put forward KL loss Instead of smooth L1 loss
stay Soft-NMS Variance voting is added after , Fix box and score

effect ：

MS-COCO On dataset , be based on VGG-16 Infrastructure Faster RCNN, Precision from 23.6% Upgrade to 29.1%.
about ResNet50 Infrastructure FPN Mask-RCNN The average accuracy is improved 1.8%.

One 、 background

Large target detection data set （ Such as COCO） The dimension boxes of are relatively clear and accurate , But there will also be some ambiguities , Increase the difficulty of marking .

Pictured 1a and 1c Shown , When the target is partially occluded and the boundary is not clear , The callout box is difficult to determine .

For the regression of the bounding box , Generally used L1 loss, Take into account the problems that are not marked , And it is generally believed that , When the classification score is high , The corresponding regression effect is also better , But there are also graphs 2 In this case .

Insert picture description here

Two 、 Method

In order to solve the above ambiguous annotation problem （ The position of the box is not well defined ）, The author puts forward a kind of KL loss, Come and learn two things at the same time ：box Return to + Location uncertainty

The detailed content ：

To capture bbox The uncertainty of regression , The author will label The position coordinates are modeled as Dirac delta function （ Impulse function ）, Will predict box The position coordinates are modeled as Gaussian Distribution
take bbox Returning loss Defined as predicted distribution and real distribution KL The divergence （ That is to say L1 loss Etc ）

KL loss Three advantages of ：

It can well capture the uncertainty in the data set ： bbox The regressor can be indefinite bbox Get a small one loss
The learned variance is very useful in post-processing ： The author proposes variance voting （variance voting）, stay NMS When , The predicted variance is used to weight the position of its neighborhood , To vote for the candidate box
The learned probability distribution is interpretable ： Because it reflects the uncertainty of the prediction frame , For many downstream tasks （ Automatic acceleration 、 robot ） It is necessary to

Insert picture description here

2.1 BBox Distributed modeling

be based on Faster RCNN or Mask RCNN（ Pictured 3 Shown ）, The author proposes a separate regression bbox The boundary of the ,bbox It can be expressed as $(x_1, y_1, x_2, y_2) \in R^4$ .

For convenience ,bbox The coordinates of are $x$ To express , Because the optimization of each coordinate is independent .

In order to estimate the position score by position , The network in this paper will predict a probability distribution , Instead of bbox The location of .

The predicted probability distribution is simplified to a Gaussian distribution ：

Insert picture description here

among ：

$\Theta$ It's a learnable parameter
$x_e$ It's estimated bbox Location
$\sigma$ As the standard deviation , It also indicates uncertainty , $\sigma \to 0$ , It shows that the more accurate the location of network prediction , Higher confidence .

The real location can also be modeled as a special Gaussian distribution , $\sigma \to 0$ , Namely Dirac delta Distribution ：

Insert picture description here

among ：

$x_g$ yes bbox The real location of

Dirac delta Distribution ： stay 0 Place infinite , In other locations is 0

characteristic ： $\int \delta(x)dx=1$

2.2 KL-Loss： Calculate the distribution of prediction and truth Loss

Insert picture description here

Orange is true
Grey prediction is more accurate , Variance is small , The position is close to the true value
Blue prediction is poor , The variance is large , The position is far away from the true value

The above will predict the results and reality label After modeling , You can use N Sample points to estimate the parameters $\hat{\Theta}$ , Minimize prediction and true distribution KL distance ：
Insert picture description here
KL Divergence is used for regression of box position ：

When the network predicts a large variance $\sigma^2$ when , $L_{reg}$ It will be very small , Location $x_e$ Your estimate will be more accurate . Pictured 4 Shown .

because $L_{reg}$ Not dependent on the latter two , There are the following rules ：

Insert picture description here

When $\sigma=1$ when ,KL loss Evolved into a normative Euclidean loss：

Insert picture description here

The loss The function is related to the estimated position and the position standard deviation ：

Insert picture description here

because $\sigma$ It's the denominator , So at the beginning of training, there may be a gradient explosion , for fear of , Author use $\alpha = log(\sigma^2)$ Instead of $\alpha=\sigma$ ：
Insert picture description here

When $x_g-x_e\|>1$ when , The function used is similar to smooth L1 loss：
Insert picture description here

At the beginning of training , Use random Gauss to initialize FC Layer parameters , The standard deviation is set to 0.0001, The mean value is set to 0, here KL loss and smooth L1 loss similar .

2.3 Variance voting strategy ： Correct the position of the frame

NMS： Greater than IoU The threshold box is deleted directly

Non maximum suppression , For reasoning or two-stage method generation proposal, Not used in the training process , Generally, we use inner class NMS

It's the same as the name , Do not suppress the box with the largest score , And then according to IoU threshold ,IoU Greater than threshold , Then the box is suppressed （ That is, set the corresponding score to 0）,IoU Less than the threshold, the box is retained .

Insert picture description here

problem ：

When the threshold is too small , There are many suppressed boxes , It is easy to cause missed inspection （ Especially the box close to the box with the largest score ）
When the threshold is too large , There are few suppressed boxes , It is easy to cause false detection

Soft NMS： Greater than IoU Threshold box score suppression , Less than IoU The box score of the threshold remains unchanged

in consideration of NMS The problem of , Especially in dense scenes , So some improvements have been made .

NMS The score of the suppressed box will be set to 0, and soft NMS Think the box that is closer to the candidate box , The more likely it is ” False positive ”, The decay of the corresponding score should be more serious , So the score is attenuated , Attenuation mode ：

Use 1-IoU The product of the score is the attenuated value ：

When the overlap between adjacent detection frames and candidate frames exceeds the threshold $N_t$ when , Score linear decay , But this function is not a nonlinear function , Easy to mutate , So we need to find a continuous function , The score of boxes without overlap is not attenuated , The highly overlapping boxes are greatly attenuated , The high IoU There is a high penalty for , low IoU There is a low penalty for , And it is gradually transitional , So there is the second .
Gaussian penalty function ：

Insert picture description here

Var voting： Score correction （soft NMS ）+ Position correction （ Assign a higher weight to the box close to the candidate box , Its uncertainty is more ）

After predicting the variance of the position , According to the variance of these bbox Vote .

The predicted result is ：

$(x_1, y_1, x_2, y_2, s, \sigma_{x_1}, \sigma_{y_1}, \sigma_{x_2}, \sigma_{y_2})$

Variance voting is based on pairs soft-NMS Processing modification of , It's done soft-NMS after , On the resulting border $b_m$ Correct the variance based on the network learning .

The new coordinates are calculated as follows , $x_i$ It's No $i$ The coordinates of the two boxes ：
Insert picture description here

First , Select the box with the highest classification score $b$
then , For all and $b$ There is IoU Intersecting boxes ( $IoU(b_i, b)$ >0), Calculate weights $p_i$
$IoU(b_i, b)$ The bigger it is , be $p_i$ The bigger it is , That is, the closer the two boxes are $p_i$ The bigger the value is. , The greater the weight , That is to say, the closer the box is
Last , According to the weight , To update $b$ Coordinates of （ The four coordinates are updated respectively ）

The author gives a higher weight to the box that is closer to the true value but has a lower classification score .

among ：

$\sigma_t$ Is an adjustable parameter

There are two kinds of neighborhood boxes that will be reduced in weight ：

A frame with high square difference
And the candidate box IoU Smaller frame

Classification scores do not participate in voting , Because a low classification score may have a high position score .
Insert picture description here