当前位置:网站首页>[reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2

[reading papers] deep learning face representation by joint identification verification, deep learning applied to optimization problems, deepid2

2022-06-13 02:25:00 Shameful child

Deep Learning Face Representation by Joint Identification-Verification

  • The key challenge of face recognition is to develop effective feature representation , To reduce individual differences , At the same time, it expands the differences between individuals .
  • Face recognition task By extracting from different identities DeepID2 Separate to increase the difference between individuals , and Face verification task By extracting from the same identity DeepID2 Pull together to reduce individual differences
  • In challenging LFW On dataset , The accuracy of face verification reaches 99.15%.

Face verification

  • Faces of the same identity in different postures 、 lighting 、 expression 、 Age and shade may look very different . It is the eternal theme of face recognition research to enlarge the difference between people and reduce the difference between people at the same time .

  • Traditional face recognition research methods :LDA、 Bayesian face and unified subspace

    • LDA Two linear subspaces are used to approximate the facial changes between people , And find the projection direction to maximize the ratio between them .

      • LDA It is a kind of supervised learning Dimension reduction technology , That is, each sample of its dataset It has category output . this and PCA Different .PCA yes Unsupervised dimensionality reduction without considering the output of sample categories .

      • LDA It can be summed up in one sentence , Namely “ Minimum intra class variance after projection , Maximum variance between classes ”. To project data on a low dimension , After projection, we hope the projection point of each kind of data is as close as possible , And the distance between the category centers of different categories of data is as large as possible .

      • There are two types of data Red and blue respectively , These data features are two-dimensional , Take these data A line projected into one dimension , Give Way The projection points of each category of data shall be as close as possible , and The distance between red and blue data centers should be as large as possible .

        •  Insert picture description here
  • Necessary basic knowledge of mathematics

    • hypothesis The data conform to Gaussian distribution ( Theoretical basis ).
    • Rayleigh business (Rayleigh quotient) And the generalized Rayleigh quotient (genralized Rayleigh quotient)
      • Rayleigh quotient refers to such a function R ( A , x ) = x H A x x H x R(A,x)=\frac{x^HAx}{x^Hx} R(A,x)=xHxxHAx, among x It's a non-zero vector , and A by n×n Of Hermitan matrix . So-called Hermitan Matrix is to satisfy A conjugate transpose matrix is a matrix equal to itself , namely A H = A A^H=A AH=A. If our matrix A Is a real matrix , Then meet A T = A A^T=A AT=A The matrix of is Hermitan matrix .
      • Rayleigh quotient has a very important property , That is, its maximum value be equal to matrix A The largest eigenvalue , And the minimum be equal to matrix A The smallest eigenvalue of , Yes λ m i n ≤ x H A x x H x ≤ λ m a x \lambda{_{min}}\le\frac{x^HAx}{x^Hx}\le{\lambda_{max}} λminxHxxHAxλmax. When the vector x When it is a standard orthogonal basis , The meet x H x {x^Hx} xHx=1 when , The Rayleigh quotient degenerates into : R ( A , x ) = x H A x R(A,x)=x^HAx R(A,x)=xHAx, This form is in PCA There's something in it .
      • The generalized Rayleigh quotient is such a function R ( A , B , x ) = x H A x x H B x R(A,B,x)=\frac{x^HAx}{x^HBx} R(A,B,x)=xHBxxHAx, among x It's a non-zero vector , and A,B Are all n×n Of Hermitan matrix .B Is a positive definite matrix .
        • It can be transformed into Rayleigh quotient format through standardization . We make x = B − 1 2 x ′ x=B^{\frac{−1}{2}}x′ x=B21x, Then the denominator is transformed into : x H B x = x ′ H ( B − 1 2 ) H B B − 1 2 x ′ = x ′ H B − 1 2 B B − 1 2 x ′ = x ′ H x ′ x^HBx=x′^H(B^{\frac{−1}{2}})^HBB^{\frac{−1}{2}}x′=x′^HB^{\frac{−1}{2}}BB^{\frac{−1}{2}}x′=x′^Hx′ xHBx=xH(B21)HBB21x=xHB21BB21x=xHx, And molecules turn into : x H A x = x ′ H B − 1 2 A B − 1 2 x ′ x^HAx=x′^HB^{\frac{−1}{2}}AB^{\frac{−1}{2}}x′ xHAx=xHB21AB21x.
        • Now our R(A,B,x) Turn into R ( A , B , x ′ ) = x ′ H B − 1 2 A B − 1 2 x ′ x ′ H x ′ R(A,B,x′)=\frac{x′^HB^{\frac{−1}{2}}AB^{\frac{−1}{2}}x′}{x′^Hx′} R(A,B,x)=xHxxHB21AB21x.
        • The nature of Rayleigh quotient , We can quickly learn that ,R(A,B,x′) The maximum value of is the matrix B − 1 2 A B − 1 2 B^{\frac{−1}{2}}AB^{\frac{−1}{2}} B21AB21 The maximum eigenvalue of , Or matrix B − 1 A B^{-1}A B1A The maximum eigenvalue of , And the minimum is the matrix B − 1 A B^{-1}A B1A The minimum eigenvalue of .
  • LDA Algorithm flow

    • Input : Data sets D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( ( x m , y m ) ) D={(x_1,y_1),(x_2,y_2),...,((x_m,y_m))} D=(x1,y1),(x2,y2),...,((xm,ym)), Any of these samples x i x_i xi by n Dimension vector , y i ∈ { C 1 , C 2 , . . . , C k } yi∈\{C_1,C_2,...,C_k\} yi{ C1,C2,...,Ck}, Dimension reduced to d.

    • Output : Sample set after dimension reduction D ′ D′ D

      • Calculate the intra class scatter matrix Sw
      • Compute the divergence matrix between classes Sb
      • Calculation of matrix S w − 1 S b S^{−1}_wS_b Sw1Sb
      • Calculation S w − 1 S b S^{−1}_wS_b Sw1Sb One of the biggest d Characteristic values and corresponding d eigenvectors ( w 1 , w 2 , . . . w d ) (w_1,w_2,...w_d) (w1,w2,...wd), Get the projection matrix W
      • For each sample feature in the sample set x i x_i xi, Convert to a new sample z i = W T x i z_i=W^Tx_i zi=WTxi
      • Get the output sample set D ′ = ( z 1 , y 1 ) , ( z 2 , y 2 ) , . . . , ( ( z m , y m ) ) D′={(z_1,y_1),(z_2,y_2),...,((z_m,y_m))} D=(z1,y1),(z2,y2),...,((zm,ym))
  • LDA except It can be used for other than dimension reduction , It can also be used to classify . A common LDA The basic idea of classification is to assume that the sample data of each category conforms to Gaussian distribution , Use it like this LDA After projection , Maximum likelihood estimation can be used to calculate the mean and variance of each class of projection data , Then we get the probability density function of the Gaussian distribution . When a new sample arrives , We can project it , Then, the projected sample features are brought into the Gaussian distribution probability density function of each category , Calculate the probability that it belongs to this category , The category corresponding to the maximum probability is the prediction category .

  • Two category LDA principle

    • Suppose our data set D = ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( ( x m , y m ) ) D={(x_1,y_1),(x_2,y_2),...,((x_m,y_m))} D=(x1,y1),(x2,y2),...,((xm,ym)), Any of these samples x i x_i xi by n Dimension vector , y i ∈ { 0 , 1 } y_i∈\{0,1\} yi{ 0,1}. We define N j ( j = 0 , 1 ) N_j(j=0,1) Nj(j=0,1) For the first time j Number of class samples , X j ( j = 0 , 1 ) X_j(j=0,1) Xj(j=0,1) For the first time j A collection of class samples , and μ j ( j = 0 , 1 ) μ_j(j=0,1) μj(j=0,1) For the first time j The mean vector of a class of samples , Definition Σ j ( j = 0 , 1 ) Σ_j(j=0,1) Σj(j=0,1) For the first time j Covariance matrix of a class of samples ( Strictly speaking, it is the covariance matrix lacking the denominator part ).
    • μ j μ_j μj The expression of is : μ j = 1 N j ∑ x ∈ X j x , ( j = ( 0 , 1 ) ) μ_j=\frac{1}{N_j}\sum_{x\in{X_j}}x,(j=(0,1)) μj=Nj1xXjx,(j=(0,1)).
    • Σ j Σ_j Σj The expression of is : Σ j = ∑ x ∈ X j ( x − μ j ) ( x − μ j ) T , ( j = 0 , 1 ) Σ_j=∑_{x∈Xj}(x−μ_j)(x−μ_j)^T,(j=0,1) Σj=xXj(xμj)(xμj)T,(j=0,1).
    • Because there are two types of data , So we just need Project the data onto a straight line . Suppose the projected line is a vector w, For any sample x i x_i xi, It's in a straight line w The projection of is w T x i w^Tx_i wTxi, For the center point of the two categories μ 0 , μ 1 μ_0,μ_1 μ0,μ1, In a straight line w The projection of is w T μ 0 w^Tμ_0 wTμ0 and w T μ 1 w^Tμ_1 wTμ1. because LDA It is necessary to make the distance between the category centers of different categories of data as large as possible , Which is maximizing ∣ ∣ w T μ 0 − w T μ 1 ∣ ∣ 2 2 ||w^Tμ_0−w^Tμ_1||^2_2 wTμ0wTμ122, At the same time, we hope that the projection points of the same category of data are as close as possible , That is, the covariance of projection points of similar samples w T Σ 0 w w^TΣ_0w wTΣ0w and w T Σ 1 w w^TΣ_1w wTΣ1w As small as possible . in summary , Our optimization goal is : a r g m a x ⏟ w J ( w ) = ∣ ∣ w T μ 0 − w T μ 1 ∣ ∣ 2 2 w T Σ 0 w + w T Σ 1 w = w T ( μ 0 − μ 1 ) ( μ 0 − μ 1 ) T w w T ( Σ 0 + Σ 1 ) w \underbrace{argmax}_{w}J(w)=\frac{||w^Tμ_0−w^Tμ_1||^2_2}{w^TΣ_0w+w^TΣ_1w}=\frac{w^T(μ_0-μ_1)(μ_0-μ_1)^Tw}{w^T(Σ_0+Σ_1)w} wargmaxJ(w)=wTΣ0w+wTΣ1wwTμ0wTμ122=wT(Σ0+Σ1)wwT(μ0μ1)(μ0μ1)Tw.
    • Define the intraclass divergence matrix S w = Σ 0 + Σ 1 = ∑ x ∈ X 0 ( x − μ 0 ) ( x − μ 0 ) T + ∑ x ∈ X 1 ( x − μ 1 ) ( x − μ 1 ) T S_w=Σ_0+Σ_1=\sum_{x\in{X_0}}(x-μ_0)(x-μ_0)^T+\sum_{x\in{X_1}}(x-μ_1)(x-μ_1)^T Sw=Σ0+Σ1=xX0(xμ0)(xμ0)T+xX1(xμ1)(xμ1)T.
    • Define the inter class divergence matrix S b = ( μ 0 − μ 1 ) ( μ 0 − μ 1 ) T S_b=(μ_0-μ_1)(μ_0-μ_1)^T Sb=(μ0μ1)(μ0μ1)T.
    • The optimization goal is rewritten as : a r g m a x ⏟ w J ( w ) = w T S b w w T S w w \underbrace{argmax}_{w}J(w)=\frac{w^TS_bw}{w^TS_ww} wargmaxJ(w)=wTSwwwTSbw.
  • Metric learning maps faces to certain feature representations , Make faces with the same identity close to each other , And the faces with different identities are separated from each other .

    • In mathematics , A measure ( Or distance function ) Is a function that defines the distance between elements in a collection . A set with metrics is called metric space . Measurement learning is also called similarity learning
    • Distance measure learning The purpose is to measure the similarity between samples , This is also one of the core problems of pattern recognition . A lot of machine learning methods , such as K a near neighbor 、 Support vector machine 、 Radial basis function network and other classification methods K-means Clustering method , There are also some graph based methods , Its performance is good or bad It is mainly determined by the choice of similarity measurement methods between samples .
    • The goal is to recognize faces , Then we need to build a distance function to strengthen the appropriate features ( Like hair color , Face shape, etc ); And if our goal is to recognize posture , Then we need to build a distance function to capture the similarity of posture .
    • In order to deal with a variety of feature similarity , You can select the appropriate features and build the distance function manually in a specific task . However, this method will require a lot of labor input , It can also be very insensitive to changes in data .
    • Metric learning as an ideal alternative , We can learn the metric distance function for a specific task according to different tasks .
    • The common goal of measurement learning is Make the distance between similar samples as small as possible , The distance between samples of different classes shall be enlarged as much as possible .
    •  Insert picture description here
  • These models are largely limited by their linear properties or shallow structures , The changes between and within individuals are complex 、 Highly nonlinear , And it can be observed in the high-dimensional image space . ( Introduce the combination of face recognition technology and deep learning )

  • Use both monitoring signals at the same time ( Face recognition and verification signal ) To learn these characteristics

  • Recognition is to classify the input image into A large number of identity classes , Verification is to classify a pair of images into Whether it belongs to the same identification ( Binary classification )

  • In order to characterize the face from different angles , Extract complementary facial features from different facial regions and resolutions DeepID2 features , after PCA After dimension reduction, the final feature representation is formed by splicing .

  • The idea of jointly solving classification and verification tasks is applied to general object recognition , The key is Improve the classification accuracy of fixed object classes , Instead of implicit features .

Identification-verification guided deep feature learning

  • deep ConvNets Convolution and pooling operations in are specially designed for hierarchical extraction of visual features , From local low-level features to global high-level features

  •  Insert picture description here

  • It contains four convolution layers , Behind the first three floors is the largest pool .

  • In order to learn different quantities of advanced features , We do not need to share the weight of the entire feature map in the higher convolution layer . Except for one place , In the third convolution layer of deep neural network , Every time 2×2 The weights of neurons in local regions are locally shared . In the fourth convolution , More appropriately, it is called local connection layer , The weights between neurons are not shared at all .

  • ConvNet At the last layer of the feature extraction cascade, a 160 dimension Of DeepID2 vector . To learn DeepID2 Layers are completely connected to the third and fourth convolution layers . because The fourth convolution layer extracts more global features than the third convolution layer , therefore DeepID2 Layer takes multi-scale features as input , Form what is called Multiscale convolution network .

  • Use activation function (ReLU) As convolution layer and deepID2 Layer neurons , about Large training datasets ,ReLU Than sigmoid The unit has better matching ability .

  • In two supervisory signals Lower learning DeepID2 function

    • The first is face recognition signal , It classifies each face image into n individual ( for example ,n=8192) One of the different identities .

      • Recognition is done by DeepID2 Add a layer after n road softmax Layer to achieve , This layer outputs n The probability distribution of classes . The training network makes Minimizing cross entropy loss , It is called identification loss .

      • I d e n t ( f , t , θ i d ) = − ∑ i = 1 n − p i l o g p i ^ = − l o g p t ^ Ident(f,t,\theta_{id})=-\sum^n_{i=1}-p_ilog\hat{p_i}=-log\hat{p_t} Ident(f,t,θid)=i=1npilogpi^=logpt^

      • among f yes DeepID2 vector ,t It's the target class , θ i d θ_{id} θid Express softmax Layer parameters . p i p_i pi Is the target probability distribution , Except for the target category t Of p t p_t pt=1 Outside , all i Of p i p_i pi=0. p i ^ \hat{p_i} pi^ Is the predicted probability distribution .

      • In order to correctly classify all categories at the same time ,DeepID2 Layers must form identity related distinguishing features

    • The second is the face verification signal , It expects to extract from faces of the same identity DeepID2 be similar .

      • Verify that the signal is directly to DeepID2 Regularize , Can effectively reduce changes within individuals .

      • Common constraints include L1/L2 Norm and cosine similarity . The paper adopts the following methods based on L2 The loss function of norm , This norm was originally formulated by Hadsell For dimensionality reduction, et al .

      • V e r i f ( f i , f j , y i j , θ v e ) = { 1 2 ∣ ∣ f i − f j ∣ ∣ 2 2 , y i j = 1 1 2 m a x ( 0 , m − ∣ ∣ f i − f j ∣ ∣ 2 ) 2 , y i j = − 1 Verif(f_i,f_j,y_{ij},\theta_{ve})=\begin{cases} \begin{aligned} \frac{1}{2}||f_i-f_j||_2^2 , y_{ij}=1\\ \frac{1}{2}max(0,m-||f_i-f_j||_2)^2 , y_{ij}=-1 \end{aligned} \end{cases} Verif(fi,fj,yij,θve)=21fifj22,yij=121max(0,mfifj2)2,yij=1

      • among f i f_i fi and f j f_j fj It is extracted from two face images in the comparison DeepID2 vector . y i j y_{ij} yij=1 Express f i f_i fi and f j f_j fj From the same identity . under these circumstances , Make two DeepID2 Between vectors L2 Minimize distance . y i j y_{ij} yij=-1 Different identities , It is required to be greater than the margin m Distance of . θ v e θ_{ve} θve={m} Is the parameter to be learned in the verification loss function .

      • The application of cosine similarity V e r i f ( f i , f j , y i j , θ v e ) = 1 2 ( y i j − δ ( w d + b ) ) 2 Verif(f_i,f_j,y_{ij},\theta_{ve})=\frac{1}{2}(y_{ij}-\delta(wd+b))^2 Verif(fi,fj,yij,θve)=21(yijδ(wd+b))2.

        • among d = f i ⋅ f j ∣ ∣ f i ∣ ∣ 2 ∣ ∣ f j ∣ ∣ 2 d=\frac{f_i·f_j}{||f_i||_2||f_j||_2} d=fi2fj2fifj yes DeepID2 Cosine similarity between vectors , θ v e = { w , b } \theta_{ve}=\{w,b\} θve={ w,b} Is a learnable scaling and shifting parameter ,σ yes sigmoid function , y i j y_{ij} yij It is two binary objects that compare whether the face image belongs to the same identity .
  • The goal is to learn the feature extraction function Conv(·) Parameters in θ c θ_c θc, and θ i d θ_{id} θid and θ v e θ_{ve} θve Only the parameters of identification and verification signals are transmitted during training . In the test phase , Use only θc Feature extraction . Random gradient descent method is used for parameter updating .

  • The gradient is identified and verified by the hyperparameter λ weighting .

  • The DeepID2 learning algorithm.

    • Input : Training set χ = { ( x i , l i ) } χ=\{(xi,li)\} χ={ (xi,li)}, Initialize parameters θ c θ_c θc, θ i d θ_{id} θid and θ v e θ_{ve} θve, Hyperparameters λ, Learning rate η ( t ) η(t) η(t),t← 0
    • When the result does not converge , The threshold is not reached
      • T← t+1 from χ Two training samples are taken from the ( x i , l i ) (x_i,l_i) xi,li and ( x j , l j ) (x_j,l_j) xj,lj.
      • Calculate extracted features : f i = C o n v ( x i , θ c ) f_i=Conv(x_i,θ_c) fi=Convxi,θc and $f_j=Conv(x_j,θ_c) $
      • ▽ θ i d = ∂ I d e n t ( f i , l i , θ i d ) ∂ θ i d + ∂ I d e n t ( f j , l j , θ i d ) ∂ θ i d \triangledown\theta_{id}=\frac{\partial Ident(f_i,l_i,\theta_{id})}{\partial\theta_{id}}+\frac{\partial Ident(f_j,l_j,\theta_{id})}{\partial\theta_{id}} θid=θidIdent(fi,li,θid)+θidIdent(fj,lj,θid);
      • ▽ θ v e = λ ⋅ ∂ V e r i f ( f i , f j , y i j , θ v e ) ∂ θ v e \triangledown\theta_{ve}=\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial\theta_{ve}} θve=λθveVerif(fi,fj,yij,θve); When l i = l j l_i=l_j li=lj From time to tome y i j y_{ij} yij, Otherwise, y i j = − 1 y_{ij}=-1 yij=1.
      • ▽ f i = ∂ I d e n t ( f i , l i , θ i d ) ∂ f i + λ ⋅ ∂ V e r i f ( f i , f j , y i j , θ v e ) ∂ f i \triangledown{f_i}=\frac{\partial Ident(f_i,l_i,\theta_{id})}{\partial{f_i}}+\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial{f_i}} fi=fiIdent(fi,li,θid)+λfiVerif(fi,fj,yij,θve)
      • ▽ f j = ∂ I d e n t ( f j , l j , θ i d ) ∂ f j + λ ⋅ ∂ V e r i f ( f i , f j , y i j , θ v e ) ∂ f j \triangledown{f_j}=\frac{\partial Ident(f_j,l_j,\theta_{id})}{\partial{f_j}}+\lambda·\frac{\partial Verif(f_i,f_j,y_{ij},\theta_{ve})}{\partial{f_j}} fj=fjIdent(fj,lj,θid)+λfjVerif(fi,fj,yij,θve).
      • ▽ θ c = ▽ f i ⋅ ∂ C o n v ( x i , θ c ) ∂ θ c + ▽ f j ⋅ ∂ C o n v ( x j , θ c ) ∂ θ c \triangledown\theta_{c}=\triangledown{f_i}·\frac{\partial Conv(x_i,\theta_{c})}{\partial\theta_{c}}+\triangledown{f_j}·\frac{\partial Conv(x_j,\theta_{c})}{\partial\theta_{c}} θc=fiθcConv(xi,θc)+fjθcConv(xj,θc).
      • to update θ i d = θ i d − η ( t ) ⋅ θ i d \theta_{id}=\theta_{id}-η(t)·θ_{id} θid=θidη(t)θid, θ v e = θ v e − η ( t ) ⋅ θ v e 0 \theta_{ve}=\theta_{ve}-η(t)·θ_{ve0} θve=θveη(t)θve0, θ c = θ c − η ( t ) ⋅ θ c \theta_{c}=\theta_{c}-η(t)·θ_c θc=θcη(t)θc.
    • Output θ c \theta_c θc.

Face Verification

  • use first SDM Algorithm testing 21 A facial landmark .

    • SDM(Supvised Descent Method) Method It is mainly applied to face alignment .SDM This is a method of finding function approximation , It can be used to solve the least square problem .

    • In the least square problem , Newton's method is a common method , But for When solving computer vision problems , There will be some problems ,

      • Hessian Matrix optimality is positive definite when it is locally optimal , Other places may not be positive , This means that the gradient direction is not necessarily the descending direction ;
      • Newton's method requires that the objective function be quadratic differentiable , But in practice, it may not be able to meet the requirements ;
      • Hessian The matrix will be very large , If the face has 66 Characteristic points , Each feature point has 128 dimension , Then the expansion vector can achieve 66x128, thus Hessian Matrix can be reached 8448x8448, Large dimension inverse matrix solution , It is a very large amount of calculation (O(p3) Operations and O(p2) Storage space ).
      • therefore Avoid losing Hessian Matrix calculation ,Hessian An indefinite problem , Large storage space and computation , Looking for such a method is the problem to be solved in the above paper .( Raises the SDM Algorithm )
    • Ix = R,I Is the characteristic ,x It's a mapping matrix ,R It's the offset .SDM The purpose of face alignment training is to get the mapping matrix x, Steps are as follows :

      1. Normalized sample , Make the scale of the sample uniform ;
      2. Calculate the mean face ;
      3. The average face , Put the face on the sample as an estimate , Align the mean center with the center of the original face shape ;
      4. Calculation Features of the marked points based on each mean face ,sift,surf perhaps hog, Remember not to rely on the mutual characteristics of gray values ;
      5. String the features of all points together , Form sample characteristics , All sample features form a matrix I;
      6. Calculate the offset between the estimated face and the real face , And form a matrix R;
      7. Solving linear equations Ix=R.
  • Then, according to the detected landmarks , The face image is globally aligned by similarity transformation .

  • Cut it out 400 A patch , According to the globally aligned face and the position of the face logo , These patches are in position 、 The proportion 、 Color channels and horizontal flipping are different .

  • Through a total of 200 A depth convnet extract 400 individual DeepID2 vector , Each depth convnet They are trained to extract two from a specific patch of each face and its horizontal lip shape counterpart 160 dimension DeepID2 vector .

  • In order to reduce a lot of DeepID2 Redundancy between features , Forward direction of use - Backward greedy algorithm to select a small number of effective and complementary DeepID2 vector ( In our experiment for 25), This saves most of the feature extraction time during testing .

    •  Insert picture description here
  • All selected 25 A patch , From which 25 individual 160 Dimensional DeepID2 vector , And connect to 4000 Dimensional DeepID2 vector . adopt PCA Further compression 4000 Dimension vector for face verification .

  • Learned the joint Bayesian model , For extraction based DeepID2 Face verification based on .

    • Joint Bayes has been successfully used to simulate the joint probability of two people whose faces are the same or different . Represent the features of human face f Modeled as the sum of changes between and within individuals , f = μ + ε f=\mu+\varepsilon f=μ+ε. among μ \mu μ and ε \varepsilon ε All the training results are in line with the Gaussian distribution .
    • Face verification passes the log likelihood ratio test , l o g P ( f 1 , f 2 ∣ H i n t e r ) P ( f 1 , f 2 ∣ H i n t r a ) log\frac{P(f_1,f_2|H_{inter})}{P(f_1,f_2|H_{intra})} logP(f1,f2Hintra)P(f1,f2Hinter), Where the numerator and denominator are the joint probabilities of two faces under the assumption of inter individual or intra individual variation respectively .

Experiments

  • use CelebFaces+ Data sets are trained , It includes 202599 Zhang collected it from the Internet 10177 Identity ( celebrity ) The face image of .
  • from CelebFaces+( be called CelebFaces+A) Random sampling 8192 Learn from face images of identities DeepID2 features , and 1985 Remaining face images of identities ( be called CelebFaces+B) It is used for the following feature selection and face verification model learning ( United Bayes ).
    • stay CelebFaces+A Learn from DeepID2 when ,CelebFaces+B Used as a validation set , To determine the learning rate 、 Training time and super parameters λ.
    • CelebFaces+B Separated as 1485 A training set of identities and 500 Authentication sets for identities , For feature selection .
    • Throughout CelebFaces+B Training joint Bayesian models on data , And use the selected DeepID2 stay LFW Test on .

Balancing the identification and verification signals

  • By way of λ from 0 Turn into +∞, We study the interaction between recognition and verification signals in feature learning . λ=0 when , Verify that the signal disappears , Only the identification signal is effective . When λ increases , Verification signals gradually dominate the training process . stay λ On the other end of → +∞, Only the verification signal is still present .
  • V e r i f ( f i , f j , y i j , θ v e ) = { 1 2 ∣ ∣ f i − f j ∣ ∣ 2 2 , y i j = 1 1 2 m a x ( 0 , m − ∣ ∣ f i − f j ∣ ∣ 2 ) 2 , y i j = − 1 Verif(f_i,f_j,y_{ij},\theta_{ve})=\begin{cases} \begin{aligned} \frac{1}{2}||f_i-f_j||_2^2 , y_{ij}=1\\ \frac{1}{2}max(0,m-||f_i-f_j||_2)^2 , y_{ij}=-1 \end{aligned} \end{cases} Verif(fi,fj,yij,θve)=21fifj22,yij=121max(0,mfifj2)2,yij=1 Medium L2 Norm verification loss is used for training .
    •  Insert picture description here
  • By changing the weighting parameters λ To verify the accuracy of the face . By learning DeepID2 Respectively with L2 Norm and joint Bayesian model are compared , The accuracy of face verification on the test set . Neither identification signal nor verification signal is the best signal for learning features . contrary , Effective function comes from the proper combination of the two .
  •  Insert picture description here
  • The inter individual and intra individual variances are the eigenvalues of the corresponding scattering matrix , The corresponding eigenvectors represent different variation patterns .
  •  Insert picture description here
  • Use λ=0,0.05 and +∞ The first two characteristics of learning PCA dimension , Separately, these features come from LFW The six identities with the largest number of face images in , And mark with different colors .
  • When λ=0( Left ) when , Although the cluster center is actually different , But because of Huge differences within individuals , Different clusters are mixed together .
  • When λ Add to 0.05( middle ) when , Intra individual differences were significantly reduced , Clusters become distinguishable .
  • When λ As it increases further toward infinity ( Right picture ), Even though Intra individual variation is further reduced , But the center of the cluster also began to collapse , Some clusters become obviously overlapping ( Red in the right figure 、 Blue and cyan clusters ), Make it difficult to distinguish again .

Investigating the verification signals

  • moderate intensity The verification signal of the is mainly used to reduce the variation in humans . To further verify this , Align all samples to the L2 The norm verifies the signal and constrains only positive or negative sample pairs ( Expressed as L2+ and L2-) For comparison .
  • L2+ It will only reduce the number of DeepID2 Distance between , and L2- It will only increase the number of different identifications DeepID2 Distance between ( If they are less than the margin ).
  •  Insert picture description here
  • From the test set DeepID2 The accuracy of face verification , Pass respectively L2 Norm and joint Bayesian measurement .
  • Also with L1 Norm and cosine verification signals and no verification signals ( nothing ) Compare . All compared identification signals are the same ( Yes 8192 Identification for classification ).
  • Use L2+ Verify signal learning DeepID2 Functionality is only better than using L2 The function of learning is slightly poor . by comparison ,L2- Verification signals are not helpful for feature learning , And the result is almost the same as that without verification signal . Prove the function of the verification signal The main thing is to reduce differences within individuals .
  • Whenever a verification signal is added in addition to the identification signal , Face verification accuracy is usually improved .
  • L2 The norm is better than other comparison verification indicators . this Probably because all other constraints are better than L2 weak , And less effective in reducing differences within individuals . for example , Cosine similarity constrains angles only , Without constraining the magnitude .

Final system and comparison with other methods

  • Before learning joint Bayes , First, through PCA take DeepID2 Feature projection to low dimensional feature space .PCA after , Throughout CelebFaces+B Training joint Bayesian models on data , And in LFW Medium 6000 Test on a given face pair , The log likelihood ratio given by joint Bayes is compared with the threshold optimized on the training data for face verification .

  •  Insert picture description here

  • Extracted from more and more patches DeepID2 The accuracy of face verification .

  • Make further use of the rich... Extracted from a large number of patches DeepID2 Function pool

    • Repeat the feature selection algorithm six more times , Each time, select... From the patches not selected in the previous feature selection step DeepID2.

    • Learn the joint Bayesian model of seven groups of selected features respectively . By further learning support vector machine , We fused seven joint Bayesian scores on each pair of comparison faces .

    • In this way , Achieved higher 99.15% Face verification accuracy . About LFW The accuracy and ROC Comparison with previous state-of-the-art methods .

    •  Insert picture description here

    •  Insert picture description here

Conclusion

  • The influence of face recognition and verification supervision signal on depth feature representation is consistent with the two aspects of constructing ideal features for face recognition , namely Increase the difference between people and reduce the difference between people , Of two monitoring signals The combination produces a much better feature than any of them .
原网站

版权声明
本文为[Shameful child]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202280543148718.html