当前位置：网站首页>【Cascade FPD】《Deep Convolutional Network Cascade for Facial Point Detection》

【Cascade FPD】《Deep Convolutional Network Cascade for Facial Point Detection》

2022-07-02 07:44:00 【bryant_ meng】

Insert picture description here

CVPR-2013

List of articles

1 Background and Motivation
2 Related Work
3 Advantages / Contributions
4 Method
5 Experiments
6 Conclusion（own） / Future work

1 Background and Motivation

face keypoint detection advantageous to face recognition and analysis

face keypoint detection The difficulty lies in extreme poses, lightings, expressions, and occlusions Scene

Existing methods ：

classifying（component detector） search windows, want scanning, Using local features
directly predicting keypoint positions (or shape parameters)

The author designed a cascade CNN structure ——a cascaded regression approach for facial point detection with three levels of convolutional networks,significantly improves the prediction accuracy of SOTA and latest commercial software

2 Related Work

Many used Adaboost, SVM, or random forest classifiers as component detectors and detection was based on local image features.
regression-based approaches
Convolutional networks

3 Advantages / Contributions

Put forward cascade Of CNN Structure is used to accurately locate the key points of the face , The effect on some data is better than SOTA And commercial software
use locally sharing weights Carry out more targeted training on different key points of the face

4 Method

The cascade network structure is as follows
Insert picture description here
cascade three levels of convolutional networks to make coarse-to-fine prediction

Five key points ：

left eye center (LE)
right eye center (RE)
nose tip (N)
left mouth corner (LM)
right mouth corner (RM)

1）level 1

The input is the whole face , The three networks predict

whole face (F)—— It refers to the five key points on the face
eyes and nose (EN)
nose and mouth (NM)

The results of the three networks will be averaged as a follow-up level Part of the input

2）level2 and level3

The input is the previous level Predict the coordinates of the key points of the face as a benchmark patch

level2 and level3 Yes 10 A network , Predict separately 5 Horizontal and vertical coordinates of key points

Predictions at the last two levels are strictly restricted because local appearance is sometimes ambiguous and unreliable.

3） Final forecast

Insert picture description here
Also in level1 Based on the predicted results refine（ $\Delta$ ）

4） Specific network structure

level1 Three networks ,level2 and level3 Each has 10 A network , What does it look like ？

Have a look first level1 Of F1

Insert picture description here
Look at other structures

level1 Yes S0 and S1,level2 and level3 They all use S2

5）locally sharing weights

globally sharing weights does not work well on images with fixed spatial layout, such as faces

For example, while eyes and mouth may share low-level features (e.g. edges), they are very different at high-level.

Let's first look at the formula of convolution
Insert picture description here

Abbreviation $C (s, n, p, q)$

$C R (s, n, p, q)$ It means in tanh Then an absolute value is added

except $w$ and $b$ More on $u$ and $v$ Outer and normal convolution （ No, locally shared weight） It's the same

Input feature map $(h, w, m)$

$m$ Enter the number of channels
$n$ Number of output channels , $t$ The number of output channels , $t = 0, . . ., n - 1$
$s$ Yes kernel size
$i, j$ Is the spatial location index （ Not pixel space , It is the local shared space divided by the author , The specific division rules are shown in the following formula ）
$\Delta h \cdot u + 0, ... , \Delta h \cdot u + \Delta h -1$ , among $\Delta h = \frac{h-s+1}{p}$ , $u = 0, . . ., p - 1$
$\Delta w \cdot v + 0, ... , \Delta w \cdot v + \Delta w -1$ , among $\Delta w = \frac{w-s+1}{q}$ , $v = 0, . . ., q - 1$

Put the whole picture $(h, w)$ Roughly divided into $p$ x $q$ area （ use $u$ and $v$ To index ）, The size of each area is approximately $\Delta h$ x $\Delta w$ , Weight sharing in each area , Not the whole picture （ Normal convolution weight sharing in the whole graph ——kernel size Of course, it is not shared in the scope ）

Let's look at the formula of pool layer
Insert picture description here
gain coefficient $g$ and shifted by a bias $b$ , $s$ is the side length of square pooling regions

FC layer
Insert picture description here

$n$ Output vector dimension , $m$ The dimension of the input vector
$j = 0, . . ., n - 1$

tanh function
Insert picture description here

6） Specific input size
Insert picture description here

You can see F1 Our network is also expanded on the basis of human faces

level2 and level3 stay level1 Output point position Expand up and out

5 Experiments

5.1 Datasets

13, 466 face images,5, 590 images are from LFW + 7, 876 from the web
BioID has 1, 521 images of 23 subjects
LFPW contains 1, 432 face images from the web

The evaluation index
Insert picture description here

$(x, y)$ Is the key point of prediction
${x}',{y}')$ yes GT
$l$ is the width of the bounding box returned by our face detector

The error is greater than %5 Think failure

$l$ by bi-ocular distance（ Binocular distance ） More common ,but it has problem on faces with large pose variations, since bi-ocular distance of near-profile faces is much shorter than that of frontal faces, That is, it will magnify the error of the side face , The above will be relatively better