当前位置：网站首页>【点云处理之论文狂读前沿版10】—— MVTN: Multi-View Transformation Network for 3D Shape Recognition

【点云处理之论文狂读前沿版10】—— MVTN: Multi-View Transformation Network for 3D Shape Recognition

2022-07-03 08:53:00 【LingbinBu】

MVTN: Multi-View Transformation Network for 3D Shape Recognition

摘要
相关工作
方法
实验

摘要

问题： 在众多的点云处理方法中，Multi-view projection 方法的视角往往是启发式地设置或者是对所有形状都是相同的设置。
方法： 提出了一种方法，学习如何更好地设置这些视角。
细节： 引入了 Multi-View Transformation Network (MVTN)，用于寻找用于3D形状识别的最优视角，整个网络的设计都是可导的。MVTN可以通过端到端的形式进行训练，并搭配任意的多视角网络用于3D形状识别。本文将MVTN和一个新的适应性多视角网络进行整合，该网络不仅可以处理3D mesh，还可以处理点云。
代码：https://github.com/ajhamdi/MVTN Pytorch版本

方法

Overview of Multi-View 3D Recognition

多视角网络的训练可以表示为：
$\begin{aligned} & \underset{\boldsymbol{\theta}_{\mathbf{C}}}{\arg \min } \sum_{n}^{N} L\left(\mathbf{C}\left(\mathbf{X}_{n}\right), y_{n}\right) \\ =& \underset{\boldsymbol{\theta}_{\mathbf{C}}}{\arg \min } \sum_{n}^{N} L\left(\mathbf{C}\left(\mathbf{R}\left(\mathbf{S}_{n}, \mathbf{u}_{0}\right)\right), y_{n}\right) \end{aligned}$
其中 $L$ 是具体任务的损失函数， $N$ 是数据集中3D形状的数量， $y_{n}$ 是第 $n$ 个3D形状 $\mathbf{S}_{n}$ 的label。 $\mathbf{u}_{0} \in \mathbb{R}^{\tau}$ 是整个数据集的 $\tau$ 个场景参数集合，这些参数表示了影响渲染图片的性质，包括视点、光线、颜色和背景。 $\mathbf{R}$ 是渲染器，以形状 $\mathbf{S}_{n}$ 和参数 $\mathbf{u}_{0}$ 作为输入，得到每个形状的 $M$ 个多视角图像 $\mathbf{X}_{n}$ 。在MVCNN中， $\mathbf{C}=\operatorname{MLP}\left(\max _{i} \mathbf{f}\left(\mathbf{x}_{i}\right)\right)$ ， $\mathbf{f}: \mathbb{R}^{h \times w \times c} \rightarrow \mathbb{R}^{d}$ 是一个2D CNN backbone；在ViewGCN中， $\mathbf{C}=\operatorname{MLP}\left(\right. cat \left._{\mathrm{GCN}}\left(\mathbf{f}\left(\mathbf{x}_{i}\right)\right)\right)$ ， $_{\mathrm{GCN}}$ 是从图卷积网络中学习到的视图特征聚合。 $\boldsymbol{\theta}_{\mathbf{C}}$ 是多视图网络 $\mathbf{C}$ 的参数。在实验部分，场景参数 $\mathbf{u}$ 表示成指向目标中心的相机视角的方位角(azimuth) 和仰角 (elevation) angles，因此 $\tau=2M$

Multi-View Transformation Network (MVTN)

之前的多视图方法都是以图像 $\mathbf{X}$ 作为3D形状的唯一表示，其中 $\mathbf{X}$ 是使用固定的场景参数 $\mathbf{u}_0$ 得到的。相反，本文考虑一个更通用的情况，将 $\mathbf{u}$ 设置成边界为 $±ubound \pm \mathbf{u}_{\text {bound }}$ 内的变量，其中 $\mathbf{u}_{\text {bound }}$ 是正数，定义了场景参数的允许范围。将每个方位角和仰角的 $\mathbf{u}_{\text {bound }}$ 分别设置为 $180^{\circ}$ 和 $90^{\circ}$ 。

Differentiable Renderer

渲染器 $\mathbf{R}$ 以3D形状 $\mathbf{S}$ (mesh or point cloud)和场景参数 $\mathbf{u}$ 作为输入，输出是对应的 $M$ 个图像 $\left\{\mathbf{x}_{i}\right\}_{i=1}^{M}$ 。由于 $\mathbf{R}$ 可导，梯度 $\frac{\partial \mathbf{x}_{i}}{\partial \mathbf{u}}$ 可以从每个图像反向传播到整个场景参数，因此能够构造一个端到端的学习框架。

当 $\mathbf{S}$ 表示为3D mesh时， $\mathbf{R}$ 有两个分量：rasterizer 和 shader。首先，在给定相机视角和将face分配给像素后，rasterizer将mesh从世界坐标系变换到视图坐标系中。然后shader根据face的分配对每个像素创建多个值，并将这些值进行融合。

当 $\mathbf{S}$ 表示为点云时， $\mathbf{R}$ 可以使用alpha-blending mechanism。

View-Points Conditioned on 3D Shape

通过学习Multi-View Transformation Network (MVTN) $\mathbf{G} \in \mathbb{R}^{P \times 3} \rightarrow \mathbb{R}^{\tau}$ 和参数 $\boldsymbol{\theta}_{\mathbf{G}}$ ，将 $\mathbf{u}$ 设计成3D形状的函数，其中 $P$ 是从形状 $\mathbf{S}$ 采样得到点的数量。MVTN的训练可以表示为：
$\begin{aligned} \underset{\boldsymbol{\theta}_{\mathbf{C}}, \boldsymbol{\theta}_{\mathrm{G}}}{\arg \min } & \sum_{n}^{N} L\left(\mathbf{C}\left(\mathbf{R}\left(\mathbf{S}_{n}, \mathbf{u}_{n}\right)\right), y_{n}\right) \text { s. t. } \quad \mathbf{u}_{n}=\mathbf{u}_{\text {bound }} \cdot \tanh \left(\mathbf{G}\left(\mathbf{S}_{n}\right)\right) \end{aligned}$
其中 $\mathbf{G}$ 对3D形状进行编码，预测具体任务多视图网络 $\mathbf{C}$ 的最优视点。由于 $\mathbf{G}$ 的目标仅仅是预测视点所以 $\mathbf{G}$ 的结构很简单，并且很轻量。与此同时，在 $\mathbf{G}$ 中还使用了简单的点编码器(比如PointNet中的shared MLP)，用于处理从 $\mathbf{S}$ 得到的 $P$ 个点，并且生成维度为 $b$ 的coarse形状特征。然后shallow MLP从这个全局形状特征中回归出场景参数 $\mathbf{u}_n$ ，为了将预测参数 $\mathbf{u}$ 的数值强制放入 $±ubound \pm \mathbf{u}_{\text {bound }}$ 范围内，使用了tanh函数将 $\mathbf{u}$ 缩放到 $±ubound \pm \mathbf{u}_{\text {bound }}$ 内。

MVTN for 3D Shape Classification

为了对MVTN进行训练，用于3D形状分类，我们定义了一个交叉熵损失函数，但是其他的损失函数和正则项也可以用。多视角网络( $\mathbf{C}$ )和MVTN( $\mathbf{G}$ )使用相同的损失函数共同训练。我们的网络结构的优点在于能够处理3D点云。当 $\mathbf{S}$ 是一组点云时，简单地将 $\mathbf{R}$ 定义为一个可导地点云渲染器。