当前位置：网站首页>Focus loss explanation

Focus loss explanation

2022-07-28 01:12:00 【@BangBang】

1. summary

The paper ：Focal Loss for Dense Object Detection
Focal Loss There is a lot of controversy on the Internet , Some people think that Focal Loss Useful , Some people think it doesn't work
stay Yolov3 The author tries to use Focal Loss, But after using it, the author found its mAP It's down 2 A little bit , So the author is also curious .

Insert picture description here
The original paper gives Focal Loss A set of parameters for , first line $r = 0$ No use Focal Loss, AP by 31.1 , But with Focal Loss Then it reached 34.0, Compared to not using Focal Loss Promoted 3 A little bit , The effect is quite obvious .

The author of the paper mentioned Focal Loss Mainly aimed at One-Stage Target detected , about one-stage Models will face class imblance This problem , That is, the imbalance between positive and negative samples . A candidate box that can match the target in an image （ Positive sample ） The number is usually only a dozen or dozens , There is no matching candidate box （ Negative sample ） There are about 10^4~10^5
Insert picture description here
You can see through the picture above , The red box doesn't match the target , And the yellow one is the matching target box . So when matching positive and negative samples, most of them are not matched , That is, it belongs to the category of negative samples , Positive samples are actually very few .

There are bound to be questions here , Why? two stage The network did not hear the problem of category imbalance .

Personal understanding ,two stage A two-step , In the first step, the category is unbalanced , It must also exist . But the final result is to determine the final coordinates of our target and whether it is the target through the second stage of detection . And through our first stage, for example Faster RCNN adopt RPN Then we finally provide the target number of the second stage network 2000 Multiple ,RPN In addition to filtering the target box with relatively good quality , The probability that the target box may be a positive sample increases , and One-stage There are tens of thousands 、 More than 100000 , about two stage In the second stage, it also has the problem of imbalance between positive and negative samples , But compared to One stage It's much better , So our paper puts forward Focal Loss Mainly aimed at one stage Online .
in the light of one stage In this $10^4$ $10^5$ Unmatched target candidate box Most of them are simple and easy to divide negative samples ( It has no effect on network training , However, too many samples will drown out a small number of samples that are conducive to training ), So if you choose all the sample training networks directly , The effect will be very bad
Before we one-stage The network also screens positive and negative samples , Namely hard negative mining, Not all negative samples will be used to train the network , Instead, choose those who have a greater impact on the loss to train the network , This can really achieve a better effect .

In the table above , The author also made a series of comparisons . The data in the front column of the table is hard negative mining Method , Take our positive and negative samples . But if we directly use the Focal Loss We will find that , Its effect is still very good . be relative to hard negative mining Method ,AP Improved 3 A little bit . Focal Loss Why is the effect better , We will introduce its theory later .

2. Focal Loss

It is mentioned in the paper that Focal Loss To solve the problem one stage Extreme imbalance between foreground and background samples in target detection , such as (1:1000). For dichotomous cross entropy loss , Its formula is as follows ：
Insert picture description here
In order to simplify the , We define $p_t$ :

be ：
$CE(p,y)=CE(p_t)= -log(p_t)$

Equilibrium cross entropy

The common method to solve the sample imbalance is , Introduce weight factor $a$ , $a$ stay [0,1] Section . When y=1 When it is a positive sample, it is $a$ , When it is a negative sample, it is $1 - a$ , So it adds $a$ The loss of the factor can be written as follows ：
$CE(p_t)=a_tlog(p_t)$
Insert picture description here
In the picture, the author does an experiment , stay $a = 0.75$ When , The best effect . So it can be seen that $a$ It's not the proportion of positive and negative samples , Because the ratio of positive and negative samples is very small , May be 1:1000, instead of 0.75

Focal Loss Definition

$a$ Is to balance the weight of positive and negative samples , But it does not distinguish between easy samples and difficult samples . So the author improved the loss function , It can reduce the weight of easy samples , So we can focus on hard negative sample （ Indistinguishable negative sample ） Training for . The author introduces a new coefficient $1-P_t)^r$ ,Focal Loss For the definition of :
$FL(p_t)=-(1-p_t)^rlog(p_t)$
$1-p_t)^r$ It can reduce the loss contribution of easy samples
Insert picture description here
When $r = 0$ when , Become our most primitive $CE(p_t)=-log(p_t)$ , Corresponding to the blue curve in the figure ,

The abscissa in the figure represents $p_t$ , If the sample is positive , $p_t=p$ , We hope p The bigger the better ,p The higher the probability, the more accurate the prediction target . For negative samples ,p The smaller the better. , That is to say 1-p The bigger the better , Therefore, we can see that whether the positive sample or the negative sample, we hope $p_t$ The bigger the better .
Insert picture description here
We can see $p_t$ Values in [0.6 1] Section , It belongs to the situation with better classification , In fact, the samples with good classification are easy to classify , In fact, we don't need to put much weight on these simple samples . From the picture we can see when $r > 0$ When , When $r = 1, 2, 5$ When , With $p_t$ The increase of , It is falling faster and faster , When $r$ The bigger it is , The smaller the weight of easy samples .

In practical application, we are Focal Loss use $a$ To balance the sample , The final Focal Loss as follows ：
$FL(p_t)=-a_t(1-p_t)^rlog(p_t)$
Yes $p_t$ Expanded Focal Loss Form the following ：
Insert picture description here
Use Focal Loss Then we will focus more on the difficult samples . For simple samples ,Focal Loss Will reduce its loss weight .
In the use of Focal Loss When , Try to mark the training set correctly , Don't make mistakes , If the marking is wrong , It must be a hard sample to learn ,Focal Loss It may be crazy to learn from the samples with wrong annotation , The effect of the model is getting worse .