当前位置：网站首页>Depth estimation self-monitoring model monodepth2 paper summary and source code analysis [theoretical part]

Depth estimation self-monitoring model monodepth2 paper summary and source code analysis [theoretical part]

2022-07-25 13:59:00 【Apple sister】

This paper mainly focuses on monodepth2 The theory and source code of , For the practical part, please refer to my other blog Depth estimation self-monitoring model monodepth2 In the actual combat of their own data sets —— Single card / DOCA training 、 Reasoning 、Onnx Transformation and quantitative index evaluation

One 、 Paper understanding ：
Insert picture description here
This paper is mainly based on monocular video streaming method , You can also add binocular stereo image training ,
Inherit on Principal Unsupervised Learning of Depth and Ego-Motion from Video(CVPR 2017) Video unsupervised scheme , Add three improvements . The basic principle of image reconstruction is as follows ：
Insert picture description here
The above formula is equivalent to the conversion under two camera coordinate systems , That is, the original image is first converted to its camera coordinate system with the inverse of the internal parameter , Then rotate the translation matrix to another camera coordinate system , Then use internal parameters to switch to the image coordinate system of another camera , Note that the reverse is used here warping, Can guarantee source Graph and refactoring target The pixels in the image correspond one by one , And depth Z It's number multiplication , Can change position . Network usage depth Network prediction depth , That is to say D, The input is number 0 Frame image ,, Reuse pose Network prediction of pose transformation , That is to say T, Input is -1 and 0,0 and 1 Two pairs of images , And then use D,T And known K, Use the first -1 Frame and number 1 Frames are reconstructed separately 0 frame （target Images ）, Calculate the loss between the original image and the reconstructed image respectively , Finally, the loss is minimized pixel by pixel .

The loss function consists of two parts , It mainly includes photometric reconstruction loss and L1 Weighting of losses 、 Edge smoothing loss :

Reconstruction losses Lp：

Insert picture description here

Smoothing loss Ls：

The main improvement is ：

A minimum re projection error , It is used to improve the robustness of the algorithm in dealing with occluded scenes 、
A full resolution multiscale sampling method , It can reduce visual artifacts
A kind of auto-masking loss, Used to ignore the pixels in the training pixels that violate the camera motion assumption

The first point means that other methods use the mean value of projection error of multiple input pictures , In this way, due to the occlusion of some pixels , Corresponding pixel not found , The penalty of loss function is large , It will cause inaccurate edges , This paper adopts the so-called loss function, which is the smallest impulse projection loss in multiple input pictures , It can make the depth edge clearer , More accurate .
The second point means that other methods are in CNN Calculate the loss directly on the depth map output from each layer , The low resolution depth map may appear voids and visual artifacts （texture-copy artifacts.）, In this paper, the depth map output from each intermediate layer is up sampled with bilinear interpolation to a resolution consistent with the input , Reduced visual artifacts .
The third point means that other methods are based on moving objects mask, Some cannot be evaluated , Some are based on optical flow （optical flow） More complex methods , This paper adopts the automatic calculation mask, Use binary parameters Insert picture description here
Other questions ：
1、 The failure of this algorithm ： Violating Lambert's hypothesis , Such as distorted , Reflective , Areas with high saturation , And blurred edges 、 Complex shaped targets
2、 mention full eigen There are some still sequences of cameras in the data set , Still doing well , One more KITTI The evaluation effect of the completed data set is also good , We can see how to complete it later .
3、 It was used reflection padding Instead of zero padding, In the decoder, the point beyond the boundary is replaced by the nearest boundary pixel . Axis angle representation in pose network , The rotation translation matrix is multiplied by 0.01, forecast 6 Position and posture of degrees of freedom . The final scale restoration adopts the method of median scaling , Scale the output and truth to the same scale . The truth value adopts the truth value scale of the whole test set .
4、 The experimental part is tested around three improvement points .

Two 、 Code reading ：
1. Input part ：
The data input part randomly makes color enhancement and flipping augmentation, The input part of the network is done augmentation Part of . If elected share Of encoder, All frame All must be input , Otherwise, only the third 0 The frame gets depth. The data input into the network has made four scale changes , At first 5 Kind of , original , Set up , Set up /2、4、8, Then delete the original . The internal parameter matrix also makes four scale changes （ This is for the calculation of image reconstruction ） Only input the setting resolution encoder and depth_decoder in , Using the unused number of input and output channels, we get four resolutions disp chart . If there is depth_gt The network is also input as a monitoring signal to accelerate loss convergence .

2.depth The Internet ：
depth The network inputs the obtained four scale images encoder, obtain futures Input again depth_decoder. The whole network is similar to U-NET structure .

3.pose The Internet ：
Pose The network has three options ： share encoder、 A separate resnetencoder( Default )、 A separate pose The Internet . If you use shared mode , Put... Directly encoder Corresponding frame_id Feature input for decoder, Use a separate encoder Each pair of original size in the input data frame Spliced together to input encoder, Posedecoder Output two features ,R and T, cam_T_cam This matrix is a combination R and T Of .

4. Calculate the reconstructed image ：
generate_images_pred function , Is the use depth Network output disp( available depth) and pose Network output RT, Reconstruction image . First of all, put the disp The image is sampled to the original resolution by bilinear interpolation , Then turn each depth map into a point cloud , use meshgrid Function draw axis , Pass the point cloud again RT Go to another camera coordinate system , Then use internal parameters to change to the coordinates of another image . After turning, I got sample, The last two dimensions represent 0 Frame to 1 or -1 The coordinate correspondence of the frame , And then use F.grid_sample Function from 1 or -1 According to sample Coordinate point value of （ Non integer coordinate interpolation ）, The reconstruction 0 Frame image , Can be compared with the original 0 Frame image calculation loss . This is the reverse warp The operation of , This ensures that the coordinate points correspond one by one （ Although there will be repeated pixels ）, positive warp It's reconstruction 1 or -1 frame , There is no guarantee of one-to-one correspondence .

5. Calculate the loss ：
compute_losses Function to calculate the loss , Including reconstruction loss and smoothing loss . The reconstruction loss is calculated for each layer depth map （ Up sampled to original resolution ） The reconstruction loss of each set of front and back frame input reprojection_losses, The index adopts SSIM and L1 weighted mean , This operation corresponds to the second major contribution . Then I calculated identity_reprojection_losses, That is, the similarity between the front and back frames , Then all the pieces are spliced together and pressed channel Minimum value , In this way, the first of the main contributions is achieved at the same time 、 Three , That is, the minimum value rather than the average value of the corresponding pixel loss of multiple images is included loss, It can better deal with occluded scenes , Make the edges clear , At the same time auto-mask, The area where the pixels of the previous and subsequent frames have not changed much is ignored （ Moving area ）, Areas with changes are preserved . Other... Are also provided in the code mask Contrast of forms . If it is pre entered mask, Then multiply by mask matrix . Smoothing loss uses inputs Input image of other resolutions in , Calculate jointly with the corresponding depth map . Finally, the total loss is calculated .

6. Because I saw it for the first time pytorch Code , Record some small gains ：
a.nn.Sequential and nn.ModuleList The difference between , See detailed explanation PyTorch Medium ModuleList and Sequential - You know (zhihu.com)
b.pytorch Write all the parameter categories in options.py Inside , use self.parser.add_argument Read , Very clear . It also includes the parameters used by the author in the control experiment , Many are configurable .
c. Used in the network nn.init.kaiming_normal_ This initialization function ( At first glance, it is the great God ), Specifically for relu Do initialization , see https://blog.csdn.net/dss_dssssd/article/details/83959474, also nn.ELU Activation function , Negative numbers can be output , Then there SSIM The network takes advantage of reflection padding：nn.ReflectionPad2d
d.model.model_dict() and optimizer.state_dict（） Store the model parameters and optimizer parameter dictionary .
e. SummaryWriter Can do visualization , Include loss、 Model, etc
f .SSIM The calculation uses variance = The mean of the squares - The square of the mean , And the average pooling is used to calculate the local mean variance .
g.checkpoints It means https://www.cnblogs.com/jiangkejie/p/13049684.html
h.len(self) and __getitem__(self, index) This format is used to customize some properties , These two representatives len() And direct indexing

原网站

版权声明
本文为[Apple sister]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250921469621.html

当前位置：网站首页>Depth estimation self-monitoring model monodepth2 paper summary and source code analysis [theoretical part]

Depth estimation self-monitoring model monodepth2 paper summary and source code analysis [theoretical part]

边栏推荐

猜你喜欢

随机推荐