当前位置：网站首页>[reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2

[reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2

2022-06-13 02:25:00 【Shameful child】

Deep Learning Face Representation by Joint Identification-Verification

The key challenge of face recognition is to develop effective feature representation , To reduce individual differences , At the same time, it expands the differences between individuals .
Face recognition task By extracting from different identities DeepID2 Separate to increase the difference between individuals , and Face verification task By extracting from the same identity DeepID2 Pull together to reduce individual differences
In challenging LFW On dataset , The accuracy of face verification reaches 99.15%.

Face verification

Faces of the same identity in different postures 、 lighting 、 expression 、 Age and shade may look very different . It is the eternal theme of face recognition research to enlarge the difference between people and reduce the difference between people at the same time .
Traditional face recognition research methods ：LDA、 Bayesian face and unified subspace
LDA Two linear subspaces are used to approximate the facial changes between people , And find the projection direction to maximize the ratio between them .
LDA It is a kind of supervised learning Dimension reduction technology , That is, each sample of its dataset It has category output . this and PCA Different .PCA yes Unsupervised dimensionality reduction without considering the output of sample categories .
LDA It can be summed up in one sentence , Namely “ Minimum intra class variance after projection , Maximum variance between classes ”. To project data on a low dimension , After projection, we hope the projection point of each kind of data is as close as possible , And the distance between the category centers of different categories of data is as large as possible .
There are two types of data Red and blue respectively , These data features are two-dimensional , Take these data A line projected into one dimension , Give Way The projection points of each category of data shall be as close as possible , and The distance between red and blue data centers should be as large as possible .

Necessary basic knowledge of mathematics
hypothesis The data conform to Gaussian distribution ( Theoretical basis ).
Rayleigh business （Rayleigh quotient） And the generalized Rayleigh quotient （genralized Rayleigh quotient）
Rayleigh quotient refers to such a function $R(A,x)=\frac{x^HAx}{x^Hx}$ , among x It's a non-zero vector , and A by n×n Of Hermitan matrix . So-called Hermitan Matrix is to satisfy A conjugate transpose matrix is a matrix equal to itself , namely $A^H=A$ . If our matrix A Is a real matrix , Then meet $A^T=A$ The matrix of is Hermitan matrix .
Rayleigh quotient has a very important property , That is, its maximum value be equal to matrix A The largest eigenvalue , And the minimum be equal to matrix A The smallest eigenvalue of , Yes $\lambda{_{min}}\le\frac{x^HAx}{x^Hx}\le{\lambda_{max}}$ . When the vector x When it is a standard orthogonal basis , The meet ${x^Hx}$ =1 when , The Rayleigh quotient degenerates into ： $R(A,x)=x^HAx$ , This form is in PCA There's something in it .
The generalized Rayleigh quotient is such a function $R(A,B,x)=\frac{x^HAx}{x^HBx}$ , among x It's a non-zero vector , and A,B Are all n×n Of Hermitan matrix .B Is a positive definite matrix .
It can be transformed into Rayleigh quotient format through standardization . We make $x=B^{\frac{−1}{2}}x′$ , Then the denominator is transformed into ： $x^HBx=x′^H(B^{\frac{−1}{2}})^HBB^{\frac{−1}{2}}x′=x′^HB^{\frac{−1}{2}}BB^{\frac{−1}{2}}x′=x′^Hx′$ , And molecules turn into ： $x^HAx=x′^HB^{\frac{−1}{2}}AB^{\frac{−1}{2}}x′$ .
Now our R(A,B,x) Turn into $R(A,B,x′)=\frac{x′^HB^{\frac{−1}{2}}AB^{\frac{−1}{2}}x′}{x′^Hx′}$ .
The nature of Rayleigh quotient , We can quickly learn that ,R(A,B,x′) The maximum value of is the matrix $B^{\frac{−1}{2}}AB^{\frac{−1}{2}}$ The maximum eigenvalue of , Or matrix $B^{-1}A$ The maximum eigenvalue of , And the minimum is the matrix $B^{-1}A$ The minimum eigenvalue of .
LDA Algorithm flow
Input ： Data sets $D={(x_1,y_1),(x_2,y_2),...,((x_m,y_m))}$ , Any of these samples $x_i$ by n Dimension vector , $yi∈\{C_1,C_2,...,C_k\}$ , Dimension reduced to d.
Output ： Sample set after dimension reduction $D'$
Calculate the intra class scatter matrix Sw
Compute the divergence matrix between classes Sb
Calculation of matrix $S^{−1}_wS_b$
Calculation $S^{−1}_wS_b$ One of the biggest d Characteristic values and corresponding d eigenvectors $w_1,w_2,...w_d)$ , Get the projection matrix W
For each sample feature in the sample set $x_i$ , Convert to a new sample $z_i=W^Tx_i$
Get the output sample set $D′={(z_1,y_1),(z_2,y_2),...,((z_m,y_m))}$
LDA except It can be used for other than dimension reduction , It can also be used to classify . A common LDA The basic idea of classification is to assume that the sample data of each category conforms to Gaussian distribution , Use it like this LDA After projection , Maximum likelihood estimation can be used to calculate the mean and variance of each class of projection data , Then we get the probability density function of the Gaussian distribution . When a new sample arrives , We can project it , Then, the projected sample features are brought into the Gaussian distribution probability density function of each category , Calculate the probability that it belongs to this category , The category corresponding to the maximum probability is the prediction category .
Two category LDA principle
Suppose our data set $D={(x_1,y_1),(x_2,y_2),...,((x_m,y_m))}$ , Any of these samples $x_i$ by n Dimension vector , $y_i∈\{0,1\}$ . We define $N_j(j=0,1)$ For the first time j Number of class samples , $X_j(j=0,1)$ For the first time j A collection of class samples , and $μ_j(j=0,1)$ For the first time j The mean vector of a class of samples , Definition $Σ_j(j=0,1)$ For the first time j Covariance matrix of a class of samples （ Strictly speaking, it is the covariance matrix lacking the denominator part ）.
$μ_j$ The expression of is ： $μ_j=\frac{1}{N_j}\sum_{x\in{X_j}}x,(j=(0,1))$ .
$Σ_j$ The expression of is ： $Σ_j=∑_{x∈Xj}(x−μ_j)(x−μ_j)^T,(j=0,1)$ .
Because there are two types of data , So we just need Project the data onto a straight line . Suppose the projected line is a vector w, For any sample $x_i$ , It's in a straight line w The projection of is $w^Tx_i$ , For the center point of the two categories $μ_0,μ_1$ , In a straight line w The projection of is $w^Tμ_0$ and $w^Tμ_1$ . because LDA It is necessary to make the distance between the category centers of different categories of data as large as possible , Which is maximizing $w^Tμ_0−w^Tμ_1||^2_2$ , At the same time, we hope that the projection points of the same category of data are as close as possible , That is, the covariance of projection points of similar samples $w^TΣ_0w$ and $w^TΣ_1w$ As small as possible . in summary , Our optimization goal is ： $\underbrace{argmax}_{w}J(w)=\frac{||w^Tμ_0−w^Tμ_1||^2_2}{w^TΣ_0w+w^TΣ_1w}=\frac{w^T(μ_0-μ_1)(μ_0-μ_1)^Tw}{w^T(Σ_0+Σ_1)w}$ .
Define the intraclass divergence matrix $S_w=Σ_0+Σ_1=\sum_{x\in{X_0}}(x-μ_0)(x-μ_0)^T+\sum_{x\in{X_1}}(x-μ_1)(x-μ_1)^T$ .
Define the inter class divergence matrix $S_b=(μ_0-μ_1)(μ_0-μ_1)^T$ .
The optimization goal is rewritten as ： $\underbrace{argmax}_{w}J(w)=\frac{w^TS_bw}{w^TS_ww}$ .
Metric learning maps faces to certain feature representations , Make faces with the same identity close to each other , And the faces with different identities are separated from each other .
In mathematics , A measure （ Or distance function ） Is a function that defines the distance between elements in a collection . A set with metrics is called metric space . Measurement learning is also called similarity learning
Distance measure learning The purpose is to measure the similarity between samples , This is also one of the core problems of pattern recognition . A lot of machine learning methods , such as K a near neighbor 、 Support vector machine 、 Radial basis function network and other classification methods K-means Clustering method , There are also some graph based methods , Its performance is good or bad It is mainly determined by the choice of similarity measurement methods between samples .
The goal is to recognize faces , Then we need to build a distance function to strengthen the appropriate features （ Like hair color , Face shape, etc ）; And if our goal is to recognize posture , Then we need to build a distance function to capture the similarity of posture .
In order to deal with a variety of feature similarity , You can select the appropriate features and build the distance function manually in a specific task . However, this method will require a lot of labor input , It can also be very insensitive to changes in data .
Metric learning as an ideal alternative , We can learn the metric distance function for a specific task according to different tasks .
The common goal of measurement learning is Make the distance between similar samples as small as possible , The distance between samples of different classes shall be enlarged as much as possible .

These models are largely limited by their linear properties or shallow structures , The changes between and within individuals are complex 、 Highly nonlinear , And it can be observed in the high-dimensional image space . （ Introduce the combination of face recognition technology and deep learning ）
Use both monitoring signals at the same time （ Face recognition and verification signal ） To learn these characteristics
Recognition is to classify the input image into A large number of identity classes , Verification is to classify a pair of images into Whether it belongs to the same identification （ Binary classification ）
In order to characterize the face from different angles , Extract complementary facial features from different facial regions and resolutions DeepID2 features , after PCA After dimension reduction, the final feature representation is formed by splicing .
The idea of jointly solving classification and verification tasks is applied to general object recognition , The key is Improve the classification accuracy of fixed object classes , Instead of implicit features .

Identification-verification guided deep feature learning

deep ConvNets Convolution and pooling operations in are specially designed for hierarchical extraction of visual features , From local low-level features to global high-level features

It contains four convolution layers , Behind the first three floors is the largest pool .
In order to learn different quantities of advanced features , We do not need to share the weight of the entire feature map in the higher convolution layer . Except for one place , In the third convolution layer of deep neural network , Every time 2×2 The weights of neurons in local regions are locally shared . In the fourth convolution , More appropriately, it is called local connection layer , The weights between neurons are not shared at all .
ConvNet At the last layer of the feature extraction cascade, a 160 dimension Of DeepID2 vector . To learn DeepID2 Layers are completely connected to the third and fourth convolution layers . because The fourth convolution layer extracts more global features than the third convolution layer , therefore DeepID2 Layer takes multi-scale features as input , Form what is called Multiscale convolution network .
Use activation function （ReLU） As convolution layer and deepID2 Layer neurons , about Large training datasets ,ReLU Than sigmoid The unit has better matching ability .
In two supervisory signals Lower learning DeepID2 function
The first is face recognition signal , It classifies each face image into n individual （ for example ,n=8192） One of the different identities .
Recognition is done by DeepID2 Add a layer after n road softmax Layer to achieve , This layer outputs n The probability distribution of classes . The training network makes Minimizing cross entropy loss , It is called identification loss .
$Ident(f,t,\theta_{id})=-\sum^n_{i=1}-p_ilog\hat{p_i}=-log\hat{p_t}$
among f yes DeepID2 vector ,t It's the target class , $θ_{id}$ Express softmax Layer parameters . $p_i$ Is the target probability distribution , Except for the target category t Of $p_t$ =1 Outside , all i Of $p_i$ =0. $\hat{p_i}$ Is the predicted probability distribution .
In order to correctly classify all categories at the same time ,DeepID2 Layers must form identity related distinguishing features
The second is the face verification signal , It expects to extract from faces of the same identity DeepID2 be similar .
Verify that the signal is directly to DeepID2 Regularize , Can effectively reduce changes within individuals .
Common constraints include L1/L2 Norm and cosine similarity . The paper adopts the following methods based on L2 The loss function of norm , This norm was originally formulated by Hadsell For dimensionality reduction, et al .
$Verif(f_i,f_j,y_{ij},\theta_{ve})=\begin{cases} \begin{aligned} \frac{1}{2}||f_i-f_j||_2^2 , y_{ij}=1\\ \frac{1}{2}max(0,m-||f_i-f_j||_2)^2 , y_{ij}=-1 \end{aligned} \end{cases}$
among $f_i$ and $f_j$ It is extracted from two face images in the comparison DeepID2 vector . $y_{ij}$ =1 Express $f_i$ and $f_j$ From the same identity . under these circumstances , Make two DeepID2 Between vectors L2 Minimize distance . $y_{ij}$ =-1 Different identities , It is required to be greater than the margin m Distance of . $θ_{ve}$ ={m} Is the parameter to be learned in the verification loss function .
The application of cosine similarity $Verif(f_i,f_j,y_{ij},\theta_{ve})=\frac{1}{2}(y_{ij}-\delta(wd+b))^2$ .
among $d=\frac{f_i·f_j}{||f_i||_2||f_j||_2}$ yes DeepID2 Cosine similarity between vectors , $\theta_{ve}=\{w,b\}$ Is a learnable scaling and shifting parameter ,σ yes sigmoid function , $y_{ij}$ It is two binary objects that compare whether the face image belongs to the same identity .
The goal is to learn the feature extraction function Conv（·） Parameters in $θ_c$ , and $θ_{id}$ and $θ_{ve}$ Only the parameters of identification and verification signals are transmitted during training . In the test phase , Use only θc Feature extraction . Random gradient descent method is used for parameter updating .
The gradient is identified and verified by the hyperparameter λ weighting .
The DeepID2 learning algorithm.
Input ： Training set $χ=\{(xi,li)\}$ , Initialize parameters $θ_c$ , $θ_{id}$ and $θ_{ve}$ , Hyperparameters λ, Learning rate $η (t)$ ,t← 0
When the result does not converge , The threshold is not reached
T← t+1 from χ Two training samples are taken from the $x_i,l_i）$ and $x_j,l_j）$ .
Calculate extracted features ： $f_i=Conv（x_i,θ_c）$ and $f_j=Conv（x_j,θ_c） $
$\triangledown\theta_{id}=\frac{\partial Ident(f_i,l_i,\theta_{id})}{\partial\theta_{id}}+\frac{\partial Ident(f_j,l_j,\theta_{id})}{\partial\theta_{id}}$ ;
$\triangledown\theta_{ve}=\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial\theta_{ve}}$ ; When $l_i=l_j$ From time to tome $y_{ij}$ , Otherwise, $y_{ij}=-1$ .
$\triangledown{f_i}=\frac{\partial Ident(f_i,l_i,\theta_{id})}{\partial{f_i}}+\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial{f_i}}$
$\triangledown{f_j}=\frac{\partial Ident(f_j,l_j,\theta_{id})}{\partial{f_j}}+\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial{f_j}}$ .
$\triangledown\theta_{c}=\triangledown{f_i}·\frac{\partial Conv(x_i,\theta_{c})}{\partial\theta_{c}}+\triangledown{f_j}·\frac{\partial Conv(x_j,\theta_{c})}{\partial\theta_{c}}$ .
to update $\theta_{id}=\theta_{id}-η(t)·θ_{id}$ , $\theta_{ve}=\theta_{ve}-η(t)·θ_{ve0}$ , $\theta_{c}=\theta_{c}-η(t)·θ_c$ .
Output $\theta_c$ .

Face Verification

use first SDM Algorithm testing 21 A facial landmark .
SDM(Supvised Descent Method) Method It is mainly applied to face alignment .SDM This is a method of finding function approximation , It can be used to solve the least square problem .
In the least square problem , Newton's method is a common method , But for When solving computer vision problems , There will be some problems ,
Hessian Matrix optimality is positive definite when it is locally optimal , Other places may not be positive , This means that the gradient direction is not necessarily the descending direction ;
Newton's method requires that the objective function be quadratic differentiable , But in practice, it may not be able to meet the requirements ;
Hessian The matrix will be very large , If the face has 66 Characteristic points , Each feature point has 128 dimension , Then the expansion vector can achieve 66x128, thus Hessian Matrix can be reached 8448x8448, Large dimension inverse matrix solution , It is a very large amount of calculation （O(p^{3) Operations and O(p}2) Storage space ）.
therefore Avoid losing Hessian Matrix calculation ,Hessian An indefinite problem , Large storage space and computation , Looking for such a method is the problem to be solved in the above paper .（ Raises the SDM Algorithm ）
Ix = R,I Is the characteristic ,x It's a mapping matrix ,R It's the offset .SDM The purpose of face alignment training is to get the mapping matrix x, Steps are as follows ：
Normalized sample , Make the scale of the sample uniform ;
Calculate the mean face ;
The average face , Put the face on the sample as an estimate , Align the mean center with the center of the original face shape ;
Calculation Features of the marked points based on each mean face ,sift,surf perhaps hog, Remember not to rely on the mutual characteristics of gray values ;
String the features of all points together , Form sample characteristics , All sample features form a matrix I;
Calculate the offset between the estimated face and the real face , And form a matrix R;
Solving linear equations Ix=R.
Then, according to the detected landmarks , The face image is globally aligned by similarity transformation .
Cut it out 400 A patch , According to the globally aligned face and the position of the face logo , These patches are in position 、 The proportion 、 Color channels and horizontal flipping are different .
Through a total of 200 A depth convnet extract 400 individual DeepID2 vector , Each depth convnet They are trained to extract two from a specific patch of each face and its horizontal lip shape counterpart 160 dimension DeepID2 vector .
In order to reduce a lot of DeepID2 Redundancy between features , Forward direction of use - Backward greedy algorithm to select a small number of effective and complementary DeepID2 vector （ In our experiment for 25）, This saves most of the feature extraction time during testing .

All selected 25 A patch , From which 25 individual 160 Dimensional DeepID2 vector , And connect to 4000 Dimensional DeepID2 vector . adopt PCA Further compression 4000 Dimension vector for face verification .
Learned the joint Bayesian model , For extraction based DeepID2 Face verification based on .
Joint Bayes has been successfully used to simulate the joint probability of two people whose faces are the same or different . Represent the features of human face f Modeled as the sum of changes between and within individuals , $f=\mu+\varepsilon$ . among $\mu$ and $\varepsilon$ All the training results are in line with the Gaussian distribution .
Face verification passes the log likelihood ratio test , $log\frac{P(f_1,f_2|H_{inter})}{P(f_1,f_2|H_{intra})}$ , Where the numerator and denominator are the joint probabilities of two faces under the assumption of inter individual or intra individual variation respectively .

Experiments

use CelebFaces+ Data sets are trained , It includes 202599 Zhang collected it from the Internet 10177 Identity （ celebrity ） The face image of .
from CelebFaces+（ be called CelebFaces+A） Random sampling 8192 Learn from face images of identities DeepID2 features , and 1985 Remaining face images of identities （ be called CelebFaces+B） It is used for the following feature selection and face verification model learning （ United Bayes ）.
stay CelebFaces+A Learn from DeepID2 when ,CelebFaces+B Used as a validation set , To determine the learning rate 、 Training time and super parameters λ.
CelebFaces+B Separated as 1485 A training set of identities and 500 Authentication sets for identities , For feature selection .
Throughout CelebFaces+B Training joint Bayesian models on data , And use the selected DeepID2 stay LFW Test on .

Balancing the identification and verification signals

By way of λ from 0 Turn into +∞, We study the interaction between recognition and verification signals in feature learning . λ=0 when , Verify that the signal disappears , Only the identification signal is effective . When λ increases , Verification signals gradually dominate the training process . stay λ On the other end of → +∞, Only the verification signal is still present .
$Verif(f_i,f_j,y_{ij},\theta_{ve})=\begin{cases} \begin{aligned} \frac{1}{2}||f_i-f_j||_2^2 , y_{ij}=1\\ \frac{1}{2}max(0,m-||f_i-f_j||_2)^2 , y_{ij}=-1 \end{aligned} \end{cases}$ Medium L2 Norm verification loss is used for training .

By changing the weighting parameters λ To verify the accuracy of the face . By learning DeepID2 Respectively with L2 Norm and joint Bayesian model are compared , The accuracy of face verification on the test set . Neither identification signal nor verification signal is the best signal for learning features . contrary , Effective function comes from the proper combination of the two .

The inter individual and intra individual variances are the eigenvalues of the corresponding scattering matrix , The corresponding eigenvectors represent different variation patterns .

Use λ=0,0.05 and +∞ The first two characteristics of learning PCA dimension , Separately, these features come from LFW The six identities with the largest number of face images in , And mark with different colors .
When λ=0（ Left ） when , Although the cluster center is actually different , But because of Huge differences within individuals , Different clusters are mixed together .
When λ Add to 0.05（ middle ） when , Intra individual differences were significantly reduced , Clusters become distinguishable .
When λ As it increases further toward infinity （ Right picture ）, Even though Intra individual variation is further reduced , But the center of the cluster also began to collapse , Some clusters become obviously overlapping （ Red in the right figure 、 Blue and cyan clusters ）, Make it difficult to distinguish again .

Investigating the verification signals

moderate intensity The verification signal of the is mainly used to reduce the variation in humans . To further verify this , Align all samples to the L2 The norm verifies the signal and constrains only positive or negative sample pairs （ Expressed as L2+ and L2-） For comparison .
L2+ It will only reduce the number of DeepID2 Distance between , and L2- It will only increase the number of different identifications DeepID2 Distance between （ If they are less than the margin ）.

From the test set DeepID2 The accuracy of face verification , Pass respectively L2 Norm and joint Bayesian measurement .
Also with L1 Norm and cosine verification signals and no verification signals （ nothing ） Compare . All compared identification signals are the same （ Yes 8192 Identification for classification ）.
Use L2+ Verify signal learning DeepID2 Functionality is only better than using L2 The function of learning is slightly poor . by comparison ,L2- Verification signals are not helpful for feature learning , And the result is almost the same as that without verification signal . Prove the function of the verification signal The main thing is to reduce differences within individuals .
Whenever a verification signal is added in addition to the identification signal , Face verification accuracy is usually improved .
L2 The norm is better than other comparison verification indicators . this Probably because all other constraints are better than L2 weak , And less effective in reducing differences within individuals . for example , Cosine similarity constrains angles only , Without constraining the magnitude .

Final system and comparison with other methods

Before learning joint Bayes , First, through PCA take DeepID2 Feature projection to low dimensional feature space .PCA after , Throughout CelebFaces+B Training joint Bayesian models on data , And in LFW Medium 6000 Test on a given face pair , The log likelihood ratio given by joint Bayes is compared with the threshold optimized on the training data for face verification .
Extracted from more and more patches DeepID2 The accuracy of face verification .
Make further use of the rich... Extracted from a large number of patches DeepID2 Function pool
- Repeat the feature selection algorithm six more times , Each time, select... From the patches not selected in the previous feature selection step DeepID2.
- Learn the joint Bayesian model of seven groups of selected features respectively . By further learning support vector machine , We fused seven joint Bayesian scores on each pair of comparison faces .
- In this way , Achieved higher 99.15% Face verification accuracy . About LFW The accuracy and ROC Comparison with previous state-of-the-art methods .

Conclusion

The influence of face recognition and verification supervision signal on depth feature representation is consistent with the two aspects of constructing ideal features for face recognition , namely Increase the difference between people and reduce the difference between people , Of two monitoring signals The combination produces a much better feature than any of them .

原网站

版权声明
本文为[Shameful child]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280543148718.html