当前位置：网站首页>Learn CV two loss function from scratch (3)

Learn CV two loss function from scratch (3)

2022-07-08 02:19:00 【pogg_】

Insert picture description here
notes ： Most of the content of this blog is not original , But I sort out the data I collected before , And integrate them with their own stupid solutions , Convenient for review , All references have been cited , And has been praised and collected ~

Preface ： In the last part, we finished image classification 、 The loss function commonly used in target detection , Let's continue with this article , It mainly talks about the loss function of face recognition .

Face recognition is CV Landing in the most mature direction , The loss function is too important for face models , Common face recognition framework facenet、insightface Will spend a lot of time in the paper to introduce their loss function .

So this chapter , We are in accordance with the softmax→Triplet Loss→Center Loss→Sphereface→Cosface→Arcface Introduce the loss function commonly used in face recognition in the order of .

1. Face recognition

1.1 Softmax Loss

softmax It has been introduced in detail in the chapter of activation function , Students who don't know much can turn over the previous content

https://zhuanlan.zhihu.com/p/380237014

softmax It can act as an activation function , It can also be used as a loss function （ But the activation function mentioned before softmax Another one is different ）, In the classic face recognition framework facenet in , One of the loss functions is softmax（ The others are tripet loss and center loss, Let's talk about ）, For specific applications, you can see the official open source code ：

https://github.com/davidsandberg/facenet/blob/master/src/train_softmax.py

1.2 Tripet Loss（ Triplet loss ）

Triples consist of three parts , Namely anchor, positive, negative：

anchor It's the benchmark
positive Is aimed at anchor A positive sample of , To express with anchor From the same person
negative Is aimed at anchor The negative sample of

triplet loss Our goal is to make ：
Have the same label The sample of , Their embedding stay embedding The space is as close as possible
It's different label The sample of , Their embedding Keep the distance as far as possible

As shown in the figure below , In the picture anchor And positive Belong to the same id, namely $y_{anchor}=y_{positive}$ ; and anchor And negative Belong to different id, namely $y_{anchor}\ne y_{positive}$ . After continuous learning , bring anchor And positive The European distance of becomes smaller ,anchor And negative The European distance of becomes larger . among , there anchor、Positive、negative It's all pictures d Dimensions are embedded in vectors （ We call it embedding）.
Insert picture description here
Use mathematical formulas to express ,triplet loss What we want to achieve is ：

among , $d ()$ Represents the Euclidean distance between two vectors , $α$ Represents the... Between two vectors margin , prevent $d\left(x_{i}^{a}, x_{i}^{p}\right)=d\left(x_{i}^{a}, x_{i}^{n}\right)=0$ . therefore , It can be minimized triplet loss Loss function to achieve this ：
Insert picture description here
Sum up ：

triplet loss The ultimate optimization goal is to narrow $a, p$ Distance of , Pull away $a, n$ Distance of
easy triplets : $L = 0$ namely $d (a, p) + m a r g i n < d (a, n)$ , In this case, there is no need to optimize , In most cases $a, p$ The distance is very close , $a, n$ Far away
hard triplets : $d (a, n) < d (a, p)$ , namely $a, p$ Far away
semi-hard triplets : $d (a, p) < d (a, n) < d (a, p) + m a r g i n d (a, p)$
FaceNet Is randomly selected semi-hard triplets Training , （ You can also choose hard triplets Or train together ）

How to train ：

offline mining

Training everyone epoch Initial stage , Calculate all of the training sets embedding, And pick all hard triplets and semi-hard triplets, And in time epoch Train these inside triplets.

This method is not very efficient , Because of every epoch We all need to traverse the entire data set to produce triplets.

online mining

This idea is for everyone batch The input of , Dynamically calculate useful triplets. Given batch size by $B$ （ $B$ It has to be for 3 Multiple ） The sample of , We calculate its corresponding $B_{embeddings}$ , At this point, the best we can find $B^3 triplets$ . Of course, many of them triplet Are not legal （ because triplet Need to have 2 Are the same label,1 One is different label）

1.3 Center Loss

Center Loss The function comes from ECCV2016 A paper on , Thesis link ：

http://ydwen.github.io/papers/WenECCV16.pdf

In order to improve the distinguishing ability of features , The author puts forward center loss Loss function , It can not only narrow the differences within the class , And can expand the differences between classes .

The author begins with MNIST Data sets , Change the last output dimension of the hidden layer to 2, Use softmax+ Cross entropy as a loss function , Visualize the results , As shown in the figure below . It can be seen that , Cross entropy can separate each class , The data distribution is radial , But it is not enough to distinguish , That is, there are large differences within the category .

Insert picture description here
On the left for 50K Training set of , On the right is 10K Test set of , It also indirectly shows that a large enough amount of data can make the algorithm more robust

therefore , The author wants to maintain the separability of data , Further narrow the differences between classes . In order to achieve this goal , Put forward Center Loss Loss function ：
Insert picture description here
Center Loss Is achieved by first generating vectors of all categories , Then calculate the Euclidean distance between these random vectors and the real vectors of this category , The obtained Euclidean distance is taken as center loss, Automatically adjust these initially random vectors through back propagation .

among , $c_{y_i}$ It means the first one $y_i$ The center of the class . therefore , Will usually Center Loss And cross entropy , Constitute the combined loss function ：
Insert picture description here
among , $λ$ Express center loss The intensity of punishment . Also in MNIST in , The results are shown in the figure below . You can see that with $λ$ An increase in , More restrictive , Each class will gather in the center of the class .

In the use of Center Loss Loss function , You need to introduce two hyperparameters ： $α$ and $λ$ . among , $λ$ Express center loss The intensity of punishment ; and $α$ Control the center point in the class $c_{y_i}$ Learning rate of . Center point in class $c_{y_i}$ It should follow the different characteristics , There will be changes . Usually in every mini-batch Update the center point in the class $c_{y_i}$ ：