当前位置：网站首页>The first TOF related data set available for deep learning: deep learning for confidence information in stereo and TOF data fusion (iccv 2017)

The first TOF related data set available for deep learning: deep learning for confidence information in stereo and TOF data fusion (iccv 2017)

2022-06-11 09:36:00 【F_ L_ O_ W】

summary

Title of thesis ： Deep Learning for Confidence Information in Stereo and ToF Data Fusion
Cited ： 29
Thesis link ： https://openaccess.thecvf.com/content_ICCV_2017_workshops/papers/w13/Agresti_Deep_Learning_for_ICCV_2017_paper.pdf
Data links ： https://lttm.dei.unipd.it//paper_data/deepfusion/
Author's unit ： University of Padua, Italy
Journal or conference ： ICCV 2017

This article presents a method for stereo camera and ToF Camera depth map fusion framework . The key is to get ToF Confidence between depth and solid depth . In the fusion process between two depth maps , The constraint of local consistency is imposed based on the confidence information .

Besides , One of the great contributions of this article is , This paper provides a set of synthetic data sets that can be used for deep learning network training . Network training and testing , On this dataset .

Experimental results show that , The fusion framework can effectively improve the accuracy of depth map .

ps： After this article , The team published another article dedicated to removing MPI The article , The address of thesis translation and interpretation is ：https://blog.csdn.net/flow_specter/article/details/123073483
Insert picture description here

Method and network structure

Under the assumption of stereo acquisition system and ToF When the systems have been calibrated , The algorithm consists of the following four steps ：

from ToF sensor The obtained depth information is first projected onto the reference stereo camera angle ;
A high resolution depth map is calculated by stereo matching algorithm . In particular , In this paper SGM Algorithm ;
Use CNN The network estimates the stereoscopic parallax and ToF Confidence of depth map ;
Take the sample from ToF Output results and stereo parallax for fusion , The way of integration is LC（Mattoccia etc. ,2009） An extended version of the technique .

Use e-learning confidence

For a certain scene $i$ for , First of all, there are the following definitions ：

$D_{T,i}$ ： Projected onto a stereoscopic system ToF Parallax map ;
$A_{T,i}$ ： Amplitude diagram projected onto the stereoscopic system ;
$D_{S,i}$ ： Parallax map obtained by stereoscopic system ;
$I_{R,i}$ ： The right image of the stereoscopic system that has been converted to gray
$I_{L',i}$ ：warp To the left from the perspective of the right （ Has been converted to grayscale ）
$\Delta '_{LR.i}$ ： Based on a two-step process warp The difference between the left graph and the right graph of , First of all, there is ：
$\Delta_{L R, i}=\left|\frac{I_{L, i}}{\mu_{L, i}}-\frac{I_{R, i}}{\mu_{R, i}}\right|$
among , Zoom factor $\mu_{L,i}$ as well as $\mu_{R,i}$ Calculated from the left and right images respectively . This result is then divided by a $\sigma_{\Delta_{LR}}$ , That is to say, in all scenarios in the training data set $\Delta_{LR,j}$ The mean of the standard deviation of ：

$\Delta_{L R, i}^{\prime}=\Delta_{L R, i} / \sigma_{\Delta_{L R}}$

Besides , also ：
$\begin{aligned} D_{T, i}^{\prime} &=D_{T, i} / \sigma_{D_{T}} \\ D_{S, i}^{\prime} &=D_{S, i} / \sigma_{D_{S}} \\ A_{T, i}^{\prime} &=A_{T, i} / \sigma_{A_{T}} \end{aligned}$

Among them $\sigma_{D_T}$ , $\sigma_{D_S}$ , $\sigma_{A_T}$ Are the average values of the standard deviations of multiple scenes in the training set .

Final , $\Delta'_{LR,i}$ , $D'_{T,i}$ , $D'_{S,i}$ , $A'_{T,i}$ concat together , Form a four channel input , Feed CNN, The outputs correspond to ToF Data and Stereo Confidence graph of data $P_T$ and $P_S$ .

The schematic diagram of the network reasoning structure is as follows ：
Insert picture description here
Input CNN Of the training image block shape by $142 * 142$ The size of 4 Access map .
The network in the middle stack is 6 One with ReLU The convolution of layer , Except for the last convolution, there is no convolution . front 5 Convolution layers , Every floor has 128 A green filter , The window size of the first layer is $5 * 5$ , Others are $3 * 3$ . The last layer of convolution has only two filters , Thus, the output has only two channels , These two channels contain estimated ToF And the confidence of stereogram . be aware , In order to make the output and input have the same resolution , No pooling layer is used .
meanwhile , In order to deal with the size reduction caused by convolution , Turn each image outward in advance pad 7 Pixel .

Training details

The image block is obtained by randomly clipping from the complete image （pad after 142 * 142）, A considerable amount of training data can be obtained .
In training , Some standard data enhancement methods can be used , For example, rotate the positive and negative 5°, Horizontal and vertical flipping, etc .
In the experiment , Extract... From each graph 30 individual patch, Consider the enhanced version , A total of 6000 individual patch.

Whether it's ToF data , Or binocular data , Its confidence is GT There are estimated parallax values and GT The absolute difference between parallax . More specifically , The confidence level is calculated as ： First, give a threshold , Will be greater than the threshold value clip fall , Then divide by the threshold , So that all confidence levels fall to $[0, 1]$ Between .

Loss function for training , Then the confidence degree and confidence degree of network estimation are calculated GT Between MSE.

The optimizer uses SGD, The momentum is 0.9.bs=16. The weight initialization method is Xavier initialization . The initial learning rate is $10^{-7}$ , The attenuation coefficient of learning rate is 0.9, every other 10 The wheel decays once . The concrete implementation of the network adopts MatConvNet Structure . stay i7-4790CPU as well as NVIDIA Titan X GPU Configured PC On , Network training takes about three hours .

Eyes and ToF Parallax fusion

LC Refer to Local Cosistent, Is a method for optimizing stereo matching data . The idea behind this approach is , Every effective depth estimation should be a function of data color representation and spatial consistency .
And this rationality , It will further spread to the adjacent pixels . In the end , Every point will gather rationality from all directions , And through WTA To get the final parallax value .
The parameters used in the network are : $\gamma_s =8$ , $\gamma_c = \gamma_t = 4$ .
LC One of the extension methods of is ： The depth estimation of multiple sources is weighted according to the confidence , Formula for ：
$\Omega_{f}^{\prime}(d)=\sum_{g \in \mathcal{A}}\left(P_{T}(g) \mathcal{P}_{f, g, T}(d)+P_{S}(g) \mathcal{P}_{f, g, S}(d)\right)$

among , $P_T(g)$ as well as $P_S(g)$ Namely ToF System and Stereo The system is in pixels $g$ The confidence level on . in an article , The confidence is estimated by the network .

Synthetic data

Another major contribution of this paper is , A file named SYNTH3 Synthetic data set , This synthetic data set can be directly used for the training of deep learning networks , Which includes 40 A scenario （20 Each scene is unique and different , in addition 20 One for the front 20 Rendering results from different angles of three scenes ）.
Although the number of scenes is not large , But compared with all the data sets at that time , Is already the biggest stereo-ToF The data set , It can also maintain different characteristics of different scenes , It's not easy .

The test set includes the following items: 15 Data collected in a unique scene .
Each synthetic data is passed through Blender Of 3D Rendering function implementation , Specifically, the scene is rendered by using a virtual camera .

Different scenes include furniture and other objects of different shapes , It also includes different indoor environments , For example, speaking of living room 、 Kitchen or office . Besides , The data also includes some outdoor scenes with irregular geometric structure . in general , The data looks relatively real , And more suitable for Stereo-ToF Simulation of acquisition . The depth distance in the scene is 50cm To 10m Between , It provides a wide measurement range .

In the simulation scenario , A virtual place with ZED Stereo camera with the same parameters , And an imitation Kinect v2 Camera parameters ToF The camera . The baseline length of the stereoscopic imaging system is 12cm. The relevant parameters of the two are ：
Insert picture description here
Stereo-ToF Schematic diagram of the system ：

For each scenario , The dataset includes ：
(1) Left and right pictures collected by stereo system 1920 * 1080 Large and small color images ;
(2) ToF Depth map estimated by the system ;
(3) ToF The correlation amplitude diagram obtained by the system .

Color images can be directly generated by Blender Medium 3D Renderers LuxRender obtain ,
ToF The camera uses Sony EuTEC Developed ToF-Explorer The simulator gets .
ToF The simulator is used by Blender as well as LuxBlender The generated scene information is used as input .
Besides , The dataset also contains the depth truth of the scene （ Align with the right image of the stereo camera ）.SYNTH3 It should be the first one that can be used for deep learning ToF Composite datasets .

experimental result

The training and testing of the fusion algorithm proposed in this paper are SYNTH3 On dataset .

Test set scenarios

Insert picture description here

Confidence estimation results

Insert picture description here

Qualitative and quantitative results of parallax estimation

Insert picture description here

reference

[1] S. Mattoccia. A locally global approach to stereo correspondence. In Proc. of 3D Digital Imaging and Modeling (3DIM), October 2009. 2, 3, 5

appendix ： Data description

What is provided in this article SYNTH3 The download address of the data is ：https://lttm.dei.unipd.it//paper_data/deepfusion/.

For each scenario in the dataset , There are ：

$512 * 424$ Of ToF Depth map ;
Projected into the perspective of a stereoscopic system ToF Depth map , A resolution of $960 * 540$ ;
They are different from 16、80、120MHz Frequency acquisition ToF amplitude chart , A resolution of $512 * 424$ ;
stay 120 MHz Obtained in frequency 、 And has been projected to the perspective of the stereo camera ToF amplitude chart ;
stay 16、80、120 MHZ Separately obtained in frequency $512 * 424$ Resolution ToF Intensity diagram ;
ToF Perspective GT Depth map ;
Left and right angle color images obtained by stereo system , The resolution is $1920 * 1080$ ;
Disparity map and depth map estimated by stereo camera , A resolution of $960 * 540$ , Right view ;
On the right view of the stereo camera GT Depth and GT Parallax map .