当前位置：网站首页>Interpretation of TPS motion (cvpr2022) video generation paper

Interpretation of TPS motion (cvpr2022) video generation paper

2022-07-26 06:10:00 【‘Atlas’】

List of articles

solve the problem
Algorithm
experiment
Conclusion

The paper ：《Thin-Plate Spline Motion Model for Image Animation》
github： https://github.com/yoyo-nb/Thin-Plate-Spline-Motion-Model

solve the problem

problem ：
Although some current work uses unsupervised methods to carry out arbitrary target attitude transfer , But when there is a big difference between the source image and the target image , There are still challenges for the current unsupervised scheme .
Method ：
This paper proposes unsupervised TPS Motion,
1、 Put forward thin-plate spline（TPS） motion estimation , To generate more flexible optical flow , Migrate source graph features to target graph features ;
2、 In order to complete the missing area , Use multi-resolution occlusion mask Carry out effective feature fusion .
3、 The additional auxiliary loss function is used to ensure the division of labor of each module of the network , Make it possible to generate high-quality pictures ;

Algorithm

TPS Motion The overall flow chart of the algorithm is shown in Figure 2 Shown ,
Insert picture description here
TPS Motion It mainly includes the following modules ：
1、 Key point detection module $E_{kp}$ ： Generate $K * N$ Key points are used to generate K individual TPS Transformation ;
2、 Background motion prediction $E_{bg}$ ： Estimate the background transformation parameters ;
3、 Dense motion network (Dense Motion Network)： This is a hourglass The Internet , Use $E_{bg}$ Background transformation and $E_{kp}$ Of K Of K individual TPS Transform for optical flow estimation 、 Multiresolution occlusion mask forecast , Used to guide missing areas ;
4、 Repair the network （Inpainting Network）： A fellow hourglass The Internet , Use the predicted optical flow to distort the original image feature , Repair missing areas of feature map at each scale ;

TPS motion estimation

1、 adopt TPS With minimum distortion , Transform the original image to the target image , Such as the type 1, $P^X_i Diagram X Upper part i A key point$ ;
Insert picture description here
$E_{kp}$ Use $K * N$ A key point , Calculation k individual tps Transformation , Each use N A key point （N=5）,TPS The calculation is as follows 2, $p It's coordinates , A And w It's a formula 1 Solved coefficient , U For the offset term$ ,

2、 The background transformation matrix is as follows 4, among $A_{bg}$ By the background motion predictor $E_{bg}$ Generate ;
Insert picture description here
3、 adopt Dense Motion Network take K+1 Transform predictions contribution map $\tilde M \in R^{(K+1)\times H \times W}$ , after softmax obtain $M$ , Such as the type 5,

With the K+1 Three transformations are combined to calculate optical flow , Such as the type 6,
Insert picture description here
Because there are only some in the early stage of training TPS Transformation works , As a result contribution map Some places are 0, Therefore, it is easy to fall into local optimization during training ;
Author use dropout Make some contribution map by 0, Will type 5 Change to formula 7, $b_i Obey the Bernoulli distribution , The probability of 1-P$ , So that the network will not rely too much on some TPS Transformation , Train a few epoch after , The author removed it ;
Insert picture description here

4、 Repair the network （Inpainting Network） The encoder extracts the features of the original image for transformation , The decoder reconstructs the target graph ;

Multiresolution occlusion Mask

Some papers prove , The focus areas of different scale feature maps are different , Low resolution focuses on abstract forms , High resolution attention to detail texture ; Therefore, the author predicts occlusion at each layer mask;
Dense Motion Network In addition to predicting optical flow, it also predicts multi-resolution occlusion mask, By adding an additional convolution layer to each encoder layer ;
Insert picture description here
Inpaintting Network Fusion of multi-scale features to generate high-quality images , The details are as shown in the picture 3 Shown ;
1、 Put the original picture S Into the encoder , Optical flow $\tilde T$ Used to transform the feature map of each layer ;
2、 Use predicted occlusion mask Feature map after occlusion transformation ;
3、 Use skip connection Output with shallow decoder concat;
4、 Through two residual networks and upper sampling layer , Generate the final image ;

Training loss function

Refactoring loss ： Use VGG-19 Calculate the reconfiguration loss , Such as the type 9;
Insert picture description here
CO transformation loss ： Used for constraint key detection module , Such as the type 10;

Background loss ： Used to constrain the background Motion predictor , Make sure the forecast is more accurate , $A_{bg}$ From S To D Background affine transformation matrix ; $A'_{bg}$ Express D To S Background affine transformation matrix , Prevent the predictive output matrix from being 0,loss Unused 11, But form 12;
Insert picture description here

Distortion loss ： Used to constrain Inpainting Network, Make the estimation of optical flow more reliable , Such as the type 13,Ei Indicates the second of the network i Layer encoder ;

The overall loss function is as follows 14

Testing phase

FOMM There are two patterns ： standard 、 relevant ;
The former uses drive video $D_t$ Every frame and S, According to formula 6 It is estimated that motion, But when S And D When the difference is large （ such as S And D There are great differences in the body size of Chinese people ）, Poor performance ;
The latter is used to estimate $D_1$ to $D_t$ Of motion, Apply it to S, This requires $D_1$ And S Of pose near ;
MRAA Propose a new model , Animation through decoupling , Additional training network for prediction motion, be applied to S, This article uses the same pattern ; Training shape And pose Encoder ,shape Key points of encoder learning S Of shape,pose Key points of encoder learning $D_t$ Of pose, Decoder reconstruction key points are reserved S Of shape And $D_t$ Of pose, Use the same video for two frames during training , The key points of one frame are randomly transformed to simulate the other individual pose;
For image animation , take S And $D_t$ The key points of shape And pose Encoder , Get the key points of reconstruction through the decoder , According to the type 6 It is estimated that motion.

experiment

Evaluation indicators
L1 Represents the pixels of the driving graph and the generated graph L1 distance ;
Average keypoint distance (AKD) Indicates the distance between the key points of the generated graph and the driving graph ;
Missing keypoint rate (MKR) Represents the ratio of key points that exist in the driving graph but do not exist in the generating graph ;
Average Euclidean distance (AED) Said the use of reid Model extraction generates graph and drive graph features , Compare the two L2 Loss ;
The video reconstruction results are shown in table 1;
Insert picture description here
chart 6 Show the image animation results , stay 4 Data sets with MRAA Compare ,

surface 2 Show real users' comments on continuity and authenticity ;

surface 4 Show the ablation results ;

surface 3 It's different K Impact on the results ,FOMM、MRAA Use K=5,10,20; This article is written in 2,4,8;
Insert picture description here

Conclusion

The unsupervised image animation method proposed by the author ：
1、 adopt TPS Estimate the flow of light , Use at the beginning of training dropout, Prevent falling into local optimum ;
2、 Multiresolution occlusion mask For more effective feature fusion ;
3、 Design additional auxiliary losses ;
This method obtains SOTA, But when the identity of the characters in the source map and the driving map are extremely mismatched , The effect is not ideal ;

原网站

版权声明
本文为[‘Atlas’]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207260610106437.html