当前位置：网站首页>Retinaface: single stage dense face localization in the wild

Retinaface: single stage dense face localization in the wild

2022-07-03 10:01:00 【Star soul is not a dream】

Paper code ：https://github.com/deepinsight/insightface/tree/master/RetinaFace.

Accurate and efficient face location in natural scenes is still a challenge . This paper presents a robust Of single-stage Face detector ：RetinaFace, It USES Joint additional supervision （ contribution 1） And self supervised multi task learning （ contribution 2）, Pixel face location on faces of different scales .

Five contributions ：

stay WIDER FACE Five facial signs are manually marked on the dataset , And observed that with the help of this additional monitoring signal , Face detection has been significantly improved .
Added a self supervised grid coding Branch , Used to predict a pixel by pixel 3D Face information . This branch is in parallel with the existing supervisory branch .
stay WIDER FACE Test set ,RetinaFace The average accuracy of (AP) Than the current average accuracy (AP) Higher than 1.1% (AP = 91.4%).
stay IJB-C Test set ,RetinaFace Make the current best ArcFace In face authentication （face verification） Further improve （TAR=89.59 FAR=1e-6）.

By using lightweight backbone ,RetinaFace Single core can CPU In real time VGA The resolution of the Image .

5. Conclusion

This paper studies At the same time, the face of any scale is densely located and aligned This is a challenging issue , And put forward RetinaFace. And it has the best detection effect in the current most challenging face detection benchmark . Besides , When RetinaFace When combined with the latest face recognition practice , It significantly improves accuracy . These data and models have been publicly available , To promote further research on this topic .

1. introduction
Automatic face location is face image analysis, such as face attributes （ expression , Age ,ID distinguish ） The prerequisite for . Face location in a narrow sense may refer to traditional face detection , It aims to estimate the face detection frame without any scale and location a priori . However , This paper refers to the generalized definition of face location , Including face detection 、 Face to face comparison （ face alignment）、 Pixelated face analysis and 3D Dense correspondence regression （3D dense correspondence regression）. This dense facial positioning provides accurate facial position information for all different scales .

Inspired by the general target detection method , These tests include all the latest developments in deep learning , Face detection has made remarkable progress in recent years . Different from general target detection , The proportion of face detection features changes less ( from 1:1 To 1:1.5), But the scope changes more ( From a few pixels to 1000 Pixels ). The latest advanced methods focus on single stage Design , The design intensively samples the position and scale of the face on the feature pyramid , Compared with the two-stage method , Shows promising performance and faster speed . Follow this route , Our improved single-level face detection framework , also Through the use of strong supervision and self-monitoring signals of multi task loss , The best dense face location method at present is proposed . The thought is shown in the picture 1：

chart 1. The proposed one-stage pixel level face location method uses additional supervision （extra-supervised） And self supervised multi task learning , And existing box Classification and regression branches are parallel . Each positive anchor （positive anchor） The output is ：(1) One face scores ;(2) A person's face frame ;(3) Five faces landmarks;(4) Dense objects projected on the image plane 3D Face vertex .

Usually , Face detection training process includes classification and box Return to loss .Chen Et al. Provided better feature observation for face classification based on aligned face shape , Proposed to face detection and alignment Combined in a joint cascading framework . suffer

[6] （D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In ECCV, 2014. 1, 2） Inspired by the ,MTCNN and STN Detect the face and five facial signs at the same time . Due to the limitation of training data ,JDA、MTCNN and STN Not verified yet tiny Can face detection benefit from additional supervision of five facial markers . In this paper , One of the questions we aim to answer is , Can we use additional surveillance signals constructed from five facial signs , To promote the current in WIDER FACE hard test set[60] The best performance in (90.3%[67]).

stay Mask R-CNN in , By adding the branches of the prediction object mask while recognizing and regressing the bounding box of the existing branches , The detection performance has been significantly improved . This confirms that dense pixel level annotation also helps to improve detection . Unfortunately , about WIDER FACE Challenging face , It is impossible to make intensive facial annotation ( In the form of more annotations or semantic fragments ). Due to the supervision signal, it is not easy to obtain , The question is whether we can apply unsupervised methods to further improve the face .

FAN Put forward a kind of anchor Level attention map to improve the detection of occluded faces . However , The proposed attention map is very rough , And does not contain semantic information . But recently , Self supervised 3D Morphological model has achieved good 3D face modeling in natural environment . In particular, the mesh decoder realizes real-time speed by using graph convolution in shape and texture . However , The challenges of applying grid decoder to single-stage architecture are : (1) Camera parameters are difficult to estimate accurately (2) The joint potential shape and texture are predicted in a single feature vector ( On the characteristic pyramid 1×1 Conv), instead of ROI Pooling characteristics , There is a risk of feature transfer . In this paper , We use a branch of grid decoder through self supervised learning , Used to predict pixel level in parallel with existing supervisory branches 3D Face shape . In general, our main contributions are as follows ：

On the basis of single-stage design , A new method named RetinaFace Pixel based face location method , This method adopts a multi task learning strategy to predict the face score at the same time 、 Face frame 、 The key points of five faces and the three-dimensional position and corresponding relationship of each face pixel .
stay WIDER FACE hard On a subset ,RetinaFace It is higher than the current two-stage method 1.1%（ Average accuracy reaches 91.4%）
stay IJB-C On dataset ,RetinaFace Contribute to ArcFace Verification accuracy of (TAR =89.59% 、FAR=1e-6). This shows that better face location can significantly improve the ability of face recognition .
By using lightweight backbone ,RetinaFace Can be in a single CPU Real time running on the core vga Resolution image .
Additional comments and code have been released , To promote future research .

2. Related work
Image pyramid vs Characteristic pyramid :
         The earliest sliding windows can be traced back to decades ago （ The classifier is applied to a dense image grid ）.Viola-Jones The milestone work explores the cascade chain , It can remove false face regions from the image pyramid in real time and efficiently , This scale invariant face detection framework has been widely used . Although the sliding window on the image pyramid is the main detection paradigm , But with the emergence of the feature pyramid , Sliding anchor on multi-scale feature map Quickly occupied the dominant position of face detection .
Two stages vs . Single stage :
         The current face detection method inherits some achievements of the general target detection method , It can be divided into two categories : Two stage approach ( Such as FAST R-CNN) And a one-stage approach ( Such as SSD and RetinaNet). The two-stage approach uses “ Suggestions and improvements （proposal and refinement）” Mechanism , It has high positioning accuracy . The single-stage method intensively samples the position and scale of human face , This leads to a great imbalance between positive samples and negative samples in the training process . To deal with this imbalance , Sampling and re-weighting Methods . Compared with the two-stage method , The single-stage method is more efficient , Higher recall rate , But there is a higher false positive rate 、 The risk of decreased positioning accuracy .
Context modeling :
         To enhance model capture tiny The ability of contextual reasoning in face ,SSH and PyramidBox The context module is applied to the feature pyramid to expand the receptive field obtained in the Euclidean grid . To enhance CNNs The ability of non rigid transformation modeling , Deformable convolution network (DCN) A new deformable layer is used to model geometric transformation .2018 Year of WIDER Face Challenge The champion solution shows , Rigidity ( Expand ) And non rigid ( deformation ) Context modeling is complementary and orthogonal , It can improve the performance of face detection .
Multi task learning :
         The combination of face detection and alignment is widely used , Because the aligned face shape provides better features for face classification . stay Mask R-CNN in , By adding a branch of the prediction object mask in parallel to the existing branch , Significantly improved detection performance .Densepose Adopted Mask-RCNN The architecture of , Get the dense part labels and coordinates in each selected area . Dense regression branches are trained through supervised learning . Besides , A dense branch is a small FCN Apply to each RoI, To predict pixel to pixel dense mapping .

3.RetinaFace

3.1 Multitasking loss
For every training anchor i , Minimize the multitask loss function ：

$\large L=L_{cls}(p_{i}, p_{i}^{*}) + \lambda_{1}p_{i}^{*}L_{box}(t_{i}, t_{i}^{*}) +\lambda_{2}p_{i}^{*}L_{pts}(l_{i}, l_{i}^{*}) +\lambda_{3}p_{i}^{*}L_{pixel}$ （1）

(1) Face classification loss $L_{cls}(p_{i}, p_{i}^{*})$ , among $p{_{i}}$ by anchor i Prediction probability for face , also $\large p_{i}^{*}$ about positive anchor yes 1, about negative anchor yes 0. Classified loss $L_{cls}$ The second is classification ( face / Not the face ) Of softmax loss.

（2） Face frame regression loss $\large L_{box}(t_{i}, t_{i}^{*})$ , among $\large t_{i} = \left \{ t_{x}, t_{y}, t_{w}, t_{h}\right \}$ and $\large t_{i}^{*} = \left \{ t_{x}^{*} ,t_{y}^{*},t_{w}^{*}, t_{h}^{*}\right \}$ Indicates prediction box and And positive anchor The coordinates of the relevant real box . According to the literature 【16】 Standardized box regression goal ( That is, the center position 、 Width and height ) And use $\large L_{box}(t_{i}, t_{i}^{*}) = R(t_{i}-t_{i}^{*})$ , here R yes The literature 【16】 Defined smooth-L1 Loss function .

（3） Face key point regression loss $\large L_{pts}(l_{i}, l_{i}^{*})$ , among $\large l_{i} = \left \{ l_{x1}, l_{y1}, ..., l_{x5}, l_{y5}\right \}{_{i}}$ and $\large l_{i}^{*} = \left \{ l_{x1}^{*} ,l_{y1}^{*}, ... , l_{x5}^{*}, l_{y5}^{*}\right \}{_{i}}$ respectively Predicted five face key points and And right anchor About the true value . And box centre Return to similar , Five face key point regression is also based on anchor The target normalization method of the center .

（4） Dense regression loss $\large L_{pixel}$ （ Refer to the formula 3）.

Loss balance parameters $\large \lambda _{1} - \lambda _{3}$ Set to 0.25、0.1 and 0.01, It means that we have added signals from supervision The importance of better border and key positioning .