当前位置：网站首页>3D Semantic Segmentation - 2DPASS

3D Semantic Segmentation - 2DPASS

2022-08-03 23:06:00 【Lemon_Yam】

2DPASS（ECCV2022）主要贡献：
提出了2D Prior Aided Semantic Segmentation 2DPASS,This method uses the two-dimensional prior information of the camera toAuxiliary three-dimensional semantic segmentation.To the best of the author's team,2DPASS is the first to distill multimodal information and apply it to a single point cloud（模态）的语义分割方法
Single-knowledge distillation using the multi-scale fusion proposed in the paper（MSFSKD）策略,2DPASS 在 SemanticKITTI 和 NuScenes achieved on these two large-scale benchmarks显著的性能提升且达到了 SOTA

前言

️在自动驾驶领域中,Camera can get the dense color information and granular textureInability to get accurate depth information and unreliable in low light conditions,Lidar can provide accurate depth information butCan only capture sparse and textureless data.因此,Cameras and lidar capture complementary information,This makes semantic segmentation through multimodal data fusion a research hotspot.

在这里插入图片描述

Although the use of multimodal data can effectively improve model performance,but based on fusion（fusion-based）The method requires paired data（paired data）and has the following limitations：

Due to the difference between the camera and the lidar视野（FOVs）不同,Cannot establish point-to-pixel mapping for points outside the image plane.通常情况下,激光雷达和相机的 FOVs overlap only on a small part（如上图所示,The red part is in the right image FOVs 重叠部分）,This greatly limits the application of fusion-based methods
Fusion-based methods at runtimeProcess images and point clouds simultaneously（By multitasking or cascading）,从而消耗了更多的计算资源,This puts a lot of burden on real-time applications

因此,论文提出了一种2D Prior Aided Semantic Segmentation（2DPASS）general training program,to improve representation learning on point clouds（representation learning）能力.论文提出的 2DPASS 在训练过程中Make full use of the 2-d image with rich semantic information,然后在Semantic segmentation without strict pairwise data constraints.在实际应用中,2DPASS Fusion via Auxiliary Modalities（auxiliary modal fusion）and multi-scale fusion to single knowledge distillation（MSFSKD）,From multiple modal dataGet richer semantic and structural information,然后The information extraction to pure 3D 网络.compared to fusion-based methods,The paper's solution has the following better properties：

通用性：可以在Only a small amount of network structure is modifieddown easily with any other 3D The segmentation models are integrated together
灵活性：The fusion module is only during training用于增强 3D 网络,训练后,Enhanced 3D modelCan be deployed without image input
有效性：Even if only a small part of the overlap of multimodal data,The method proposed in the paper also可以显著提高性能

实验结果显示,在装备了 2DPASS 之后,Baseline model used in the paper（baseline model）Significant performance boost with only point cloud input.具体来说,它在 SemanticKITTI 和 NuScenes achieved on these two large-scale accepted benchmarks SOTA.

网络结构

在这里插入图片描述

from the original imageCrop out a small piece of the image（480x320）作为 2D 输入（Since the camera image is very large,Multimodality leading to sending raw images to the paper pipeline is difficult to deal with）,这个步骤Speeds up the training process without unnecessary performance degradation
The cropped image and lidar point cloud are passed through 2D 和 3D 编码器,并行Generate multiscale features
对于每个尺度,Complementary two-dimensional knowledge through MSFSKD 从而有效地Transfer to a 3D network（Make full use of texture and color perception of two-dimensional prior knowledge,and retain the original 3D specific knowledge）
on every scale 2D 和 3D features are used to generate semantic segmentation predictions,These predictions are纯 3D Label supervision

️在推理过程中,可以丢弃与 2D 相关的分支,compared to fusion-based methods,这有效地To avoid the additional computational burden in actual application.

编码器

The thesis using two-dimensional convolution ResNet34 编码器作为 2D 网络,而Use sparse convolution to build 3D 网络.具体来说,论文设计了一个Layered encoderSPVCNN,And is used on each scale ResNet 瓶颈设计,同时用 LeakyReLU 激活函数代替 ReLU 激活函数.在这两个网络中,We extract from different scales L 个特征图,get 2D features ${F_l^{2D}\}_{l=1}^L$ and 3D features ${F_l^{3D}\}_{l=1}^L$ .

️其中,Sparse convolution of one advantage is that it is thin,卷积运算Only consider non-empty voxels

解码器

对于 2D 网络,论文采用 FCN As a decoder for each coding layer on the characteristics of sampling.具体来说,Can through to the first $(L - l + 1)$ The feature map of the coding layer is upsampled to obtain the first $l$ Decoding feature maps $D_l^{2D}$ ,where all upsampled feature maps will beMerge by element-wise addition.最后,The fused feature mapthrough a linear classifier（linear classifier）进行 2D 语义分割.
对于 3D 网络,论文采用 U-Net 作为解码器.其中,The characteristics of the different scalesupsample to original size,并将它们连接在一起,then feed them into the classifier.论文发现,This structure canBetter learn hierarchical information,Get forecast results more efficiently at the same time.

Point-to-Pixel Correspondence

在这里插入图片描述

2D The feature is generated as shown above (a) 所示（以第 $l$ Layer features as an example）,其首先使用反卷积2D features $F_l^{2D} \in R^{H_l \times W_l \times D_l}$ Sampling to consistent with the original image resolution得到特征图 $\tilde{F}_l^{2D}$ ,然后再Project a point cloud to an image patch（image patch）上并Generate points to pixels（P2P）的映射,最后根据 P2P Mapping the 2D feature map $\tilde{F}_l^{2D}$ Convert to Pointwise 2D Features $\hat{F}_l^{2D}$ .其中,The mapping relationship between points and pixels is as follows：

$\begin{aligned} \ [u_i, v_i, 1]^T &= \frac{1}{z_i} \times K \times T \times [x_i, y_i, z_i, 1]^T \\ M^{img} &= \{ (\lfloor v_i \rfloor , \lfloor u_i \rfloor) \}_{i=1}^N \in R^{N \times 2} \end{aligned}$

️其中, $p_i=(x_i, y_i, z_i)\in R^3$ is a point in the point cloud data, $\hat{p}_i=(u_i, v_i)\in R^2$ is the projected pixel data, $K\in R^{3\times 4}$ 为相机内参, $\in R^{4 \times 4}$ 为相机外参, $\lfloor \cdot \rfloor$ 为下取整.

️由于 NuScenes Medium lidar and camera工作频率不同,需要通过全局坐标系将时间戳 $t_l$ of lidar frames转换为时间戳 $t_c$ camera frame.NuScenes Extrinsic parameter matrix in the dataset $T$ 为：
$T_{camera} \leftarrow ego_{t_c} \times T_{ego_{t_c}} \leftarrow global \times T_{global} \leftarrow ego_{t_l} \times T_{ego_{t_l}} \leftarrow lidar$

3D The feature is generated as shown above (b) 所示（以第 $l$ Layer features as an example）,which first obtainsPoint to voxel mapping $M_l^{voxel}$ ,然后给定3D features of sparse convolutional layers $F_l^{3D} \in R^{N^{'}_l \times D_l}$ ,根据 $M_l^{voxel}$ in the original feature map $F_l^{3D}$ 上进行closest interpolation（nearest interpolation）得到Point-by-point 3D features $\tilde{F}_l^{3D}\in R^{N \times D}$ ,Finally, by discarding the image field of view（FOVs）outside points to filter related points.Its related formula is as follows：

$\begin{aligned} M_l^{voxel} &= \{ (\lfloor \frac{x_i}{r_l} \rfloor, \lfloor \frac{y_i}{r_l} \rfloor, \lfloor \frac{z_i}{r_l} \rfloor) \}_{i=1}^N \in R^{N \times 3} \\ \hat{F}_l^{3D} &= \{ f_i | f_i \in \tilde{F}_l^{3D}, M_{i, 1}^{img} \le H, M_{i, 2}^{img} \le W \}_{i=1}^N \in R^{N^{img} \times D_l} \end{aligned}$

️其中, $P=\{(x_i, y_i, z_i)\}_{i=1}^N$ 是点云数据, $r_l$ 为第 $l$ layer voxel resolution, $H$ is the height of the image field of view, $W$ is the width of the image field of view

MSFSKD

如下图所示,MSFSKD The internal structure includesModal Fusion and Modality Preserving Knowledge Distillation.其中 2D 特征和 3D 特征通过 2D Learner 进行融合,并使用两个 MLP As well as the nonlinear mapping characteristics逐点相加,Then the output features和原 2D 特征进行融合,再结合 classifier（全连接层）To obtain the fusion feature points $S_l^{2D3D}$ .而 3D Parts are enhanced by features,并结合 classifier（全连接层）获取 3D 预测分数 $S_l^{3D}$ ,and do distillation at the result level.