当前位置：网站首页>Interpretation of Dagan paper

Interpretation of Dagan paper

2022-07-06 19:30:00 【‘Atlas’】

List of articles

solve the problem
Algorithm
experiment
- SOTA Methods to compare
- Ablation Experiment
Conclusion

The paper : 《Depth-Aware Generative Adversarial Network for Talking Head Video Generation》
github: https://github.com/harlanhong/CVPR2022-DaGAN

solve the problem

Existing problems ：
Existing video generation schemes mainly use 2D characterization , Face 3D Information is actually critical to this task , Then note that it costs a lot ;

resolvent ：
The author of this paper proposes a self-monitoring scheme , Automatically generate dense from face video 3D Geometric information , No need for any annotation data ; Based on this information , Further estimate the sparse face key points , Used to capture important movements of the head ; Depth information is also used for learning 3D Cross modal （ Appearance and depth ）attention Mechanism , Guide the generation of a sports field used to distort the original image ;
What this article puts forward DaGAN It can generate highly realistic faces , And it has achieved good results on faces that have never been seen before ;

The contributions of this paper mainly include the following three points ：
1、 Introduce self-monitoring method to fit depth map from video , And use it to improve the generation effect ;
2、 Propose novel and deep concerns GAN, Depth guided facial key point estimation and cross modal （ Depth and image ）attention Mechanism , Introduce depth information into the generation network ;
3、 Full experiments show the accurate depth fitting of face images , At the same time, the production effect exceeds SOTA;

Algorithm

DaGAN The method is shown in the figure 2, It is composed of generator and discriminator ;
The generator consists of three parts ：
1、 Self supervised deep information learning sub network $F_d$ , Self supervised learning depth estimation from two consecutive frames in video ; Then fix $F_d$ Conduct the whole network training ;
2、 Depth information guided sparse key detection sub network $F_{kp}$ ;
3、 The feature distortion module uses key points to generate change regions , It has combined appearance information with motion information by distorting source image features , Get distorted features $F_w$ ; To ensure that the model pays attention to details and facial microexpressions , Learn more about paying attention to in-depth information attention map, Its refinement $F_w$ obtain $F_g$ , Used to generate images $I_g$ ;
Insert picture description here

Self supervision Face Depth Learning

Author's reference SfM-Learner, Make optimization , Use consecutive frames $I_{i+1}$ As the source diagram and $I_i$ As a target diagram , Learn set elements , Depth map $D_{I_i}$ , Similar internal parameter matrix $K_{ {I_i}->I_{i+1}}$ , Related camera attitude $R_{ {I_i}->I_{i+1}}$ And transformation $t_{ {I_i}->I_{i+1}}$ , And SfM-Learner The difference is the camera internal parameters K Need to learn ;
Flow chart 3：
1、 $F_d$ Extract the target graph $I_i$ Depth map of $D_{I_i}$ ;
2、 $F_p$ Extract learnable parameters $R 、 t 、 K$ ;
3、 According to the equation 3、4 Add source map $I_{i+1}$ Get by geometric transformation $I'_i$
Insert picture description here
$q_k$ Represents the source map $I_{i+1}$ Distorted pixels on ;
$p_j$ Represents the target graph $I_i$ Previous pixel ;
Loss function $P_e$ Such as the type 5 Shown , Use L1 Loss and SSIM Loss

Sparse key motion modeling

1、 take RGB And $F_d$ Extract the depth map for concat;
2、 Through the key point estimation module $F_{kp}$ Get face sparse keys , Such as the type 6, Due to the introduction of depth map , Make the prediction key points more accurate ;
Insert picture description here
Feature distortion strategy , Pictured 4
1、 Such as the type 7, Calculate the initial offset between the original graph and the driving graph ${O_n}$ ;

2、 Generate 2D coordinate map z;
3、 take O be applied to z, Get the motion area $w_m$ ;
4、 Use $w_m$ Distort the downsampled image to get the initial distorted feature image ;
5、 Occlusion estimator $\tau$ Predict the motion flow through the distorted characteristic graph mask $M_m$ And occlusion diagram $M_o$ ;
6、 Use $M_m$ Distortion $I_s$ Through the encoder $\epsilon_I$ Obtained appearance feature diagram , With the $M_o$ Fusion generation $F_w$ , Such as the type 8. $F_w$ It not only retains the original image information, but also extracts the motion information between two faces .
Insert picture description here

Cross modal attention Mechanism

In order to effectively use the learned depth map to improve the generation ability , The author proposes cross modal attention Mechanism , Pictured 5.
1、 Through the depth encoder $\epsilon_d$ Extract depth map $D_{sz}$ Characteristics of figure $F_d$ ;
2、 Through three separate 1X1 The convolution layer will $F_d$ 、 $F_w$ It maps to 3 Hidden feature layer $F_q$ 、 $F_k$ 、 $F_v$ ;
3、 Such as the type 9, adopt attention Generate $F_g$ .
Insert picture description here
4、 Refined by decoder $F_g$ Generate the final image $I_g$ .

Training

During the training process, the original diagram and the driving diagram are the same , The loss function is as follows 10,
Insert picture description here
$L_P$ For perceived loss ;
$L_G$ Use the lowest double loss ;
$L_E$ Equivariant loss , Ensure that the original image is transformed , The key points are transformed accordingly ;
$L_D$ Loss through distance , Prevent facial keys from gathering ;

experiment

SOTA Methods to compare

stay VoxCeleb1 Data set with SOTA The comparison test results are shown in table 1、2
Insert picture description here
stay VoxCeleb1 On dataset , The effect of cross identity reproduction is shown in the figure 6

stay CelebV On dataset , And SOTA Method comparison test is shown in table 3, The effect of cross identity reproduction is shown in the figure 7

Ablation Experiment

FDN： Facial depth network ;
CAM： Cross modal attention Mechanism
Results such as table 4,
Insert picture description here
The generation effect is shown in the figure 8

DaGAN Effect video

Conclusion

DaGAN Use self-monitoring method to learn facial depth map , On the one hand, it is used for more accurate facial key point estimation ; On the other hand, design cross modal （ Depth map and RGB） Mechanism to obtain micro expression changes . therefore DaGAN Produce more realistic and natural results .

原网站

版权声明
本文为[‘Atlas’]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/187/202207061131014592.html