当前位置：网站首页>[point cloud processing paper crazy reading frontier version 11] - unsupervised point cloud pre training via occlusion completion

[point cloud processing paper crazy reading frontier version 11] - unsupervised point cloud pre training via occlusion completion

2022-07-03 09:14:00 【LingbinBu】

OcCo：Unsupervised Point Cloud Pre-training via Occlusion Completion

Abstract
introduction
Method
- Generating Occlusions
- The Completion Task
experiment
- OcCo Pre-Training Setup
- Fine-Tuning Setup
analysis
Discuss
New words

Abstract

Method ： A pre training method for point cloud is proposed Occlusion Completion (OcCo)
Technical details ：
1. mask The blocked point in the camera's perspective
2. Learn one encoder-decoder Model , Used to reconstruct occluded points
3. Use encoder As the initialization of the downstream point cloud task
application ： object classification & part-based and semantic segmentation
Code ：https://github.com/hansen7/OcCo ( Support PyTorch and TensorFlow)

introduction

OcCo It has the following properties ：

Study in small samples (few-shot learning) The sampling efficiency can be improved in the experiment
It can improve generalization in classification and segmentation tasks
After fine adjustment, it is easier to find the local minimum
adopt network dissection Able to describe a more semantic representation
stay jittering, translation and rotation Better classification quality can still be maintained under transformation

Method

remember $\mathcal{P}$ by 3D A group of point clouds in European space , $\mathcal{P}=\left\{p_{1}, p_{2}, \ldots, p_{n}\right\}$ , Each of these points $p_{i}$ It contains coordinates $\left(x_{i}, y_{i}, z_{i}\right)$ And other features ( Color and normal vector ) Vector . First from occlusion mapping $o(\cdot)$ Begin to describe , And then introduce ompletion model $c(\cdot)$ , Pseudocode and structural details are shown in the appendix .

Generating Occlusions

Define a randomised occlusion mapping $\mathbb{P} \rightarrow \mathbb{P}$ , among $\mathbb{P}$ It's point cloud space , The description is from all point clouds $\mathcal{P}$ To cover the point cloud $\tilde{\mathcal{P}}$ Mapping between . This mapping is done by removing $\mathbb{P}$ Those points that cannot be seen from a specific point of view $\tilde{\mathcal{P}}$ , Steps are as follows ：

The complete point cloud in the world coordinate system is projected onto the coordinates in the camera coordinate system according to the viewpoint of the camera
Determine the occluded point under this viewpoint
Then back project the points in the camera coordinate system to the world coordinate system

Viewing the point cloud from a camera

Define from... Through pinhole camera 3D Mapping between the world coordinate system and a specific camera coordinate system ：

among $(x, y, z)$ Is the coordinate of the original point cloud in the world coordinate system , The camera viewpoint is rotated matrix $\mathbf{R}$ Vector of peaceful shift $\mathbf{t}$ decision . Inside the camera $\mathbf{K}$ By focal length $f$ ,skewness $\gamma$ , The width of the image $w$ , high $h$ decision . After giving the above parameters , You can calculate the coordinates of the point in the camera coordinate system $\left(x_{\mathrm{cam}}, y_{\mathrm{cam}}, z_{\mathrm{cam}}\right)$ .

Determining occluded points

Handle points in two ways $\left(x_{\mathrm{cam}}, y_{\mathrm{cam}}, z_{\mathrm{cam}}\right)$ ：

In the camera coordinate system 3D spot $\left(x_{\mathrm{cam}}, y_{\mathrm{cam}}, z_{\mathrm{cam}}\right)$
Depth is $z_{\mathrm{cam}}$ Of 2D Pixel coordinates $\left(f x_{\mathrm{cam}} / z_{\mathrm{cam}}, f y_{\mathrm{cam}} / z_{\mathrm{cam}}\right)$

In this way , If some points obtained by projection have the same pixel coordinates , But the depth value is different , Then there may be an occlusion relationship between these points . To determine which points are obscured , We use... First Delaunay triangulation To rebuild a polygon mesh, Then remove those belonging to hidden surface The point of , This hidden surface adopt z-buffering decision .

Mapping back from camera frame to world frame

Once the occluded point is removed , We can re project the point into the original world coordinate system , The principle used is formula 1 The inverse transformation of . therefore randomised occlusion mapping $o(\cdot)$ The construction steps of are as follows ：

Fix a set of initial point clouds $\mathcal{P}$
Given the camera's internal parameter matrix $\mathbf{K}$ , External parameters under multiple viewpoints $\left[\left[\mathbf{R}_{1} \mid \mathbf{t}_{1}\right], \ldots,\left[\mathbf{R}_{V} \mid \mathbf{t}_{V}\right]\right]$ , among $V$ Indicates the number of viewpoints
For each viewpoint $\in[V]$ , Use the formula 1 take $\mathcal{P}$ Are projected into the corresponding camera coordinate system
Find the occlusion points and remove them
Back project the remaining points into the world coordinate system , For each viewpoint $\in[V]$ , Get the final occlusion point cloud $\tilde{\mathcal{P}}_{v}$

The Completion Task

Given by occlusion mapping $o(\cdot)$ Obtained point cloud $\tilde{\mathcal{P}}$ , The goal of the completion task is to start from $\tilde{\mathcal{P}}$ Learn one completion mapping $\mathbb{P} \rightarrow \mathbb{P}$ , Used to complete the point cloud $\hat{\mathcal{P}}$ . If meet $\mathbb{E}_{\tilde{\mathcal{P}} \sim o(\mathcal{P})} \ell(c(\tilde{\mathcal{P}}), \mathcal{P}) \rightarrow 0$ , It means that completion mapping To be accurate , among $\ell(\cdot, \cdot)$ Is the loss function . The structure of the complement model is a encoder-decoder Network of ,encoder Map the occluded network into a vector ,decoder Complete the point cloud . After pre training ,encoder The weight of can be used as the initial value of downstream tasks .

experiment

OcCo Pre-Training Setup

Used in all experiments ModelNet40 As a pre training data set . The internal parameter of the camera is set to $\left\{ {f=1000,\gamma=0,w=1600,h=1200} \right\}$ . For each group of point clouds , Random selection 10 Group viewpoint , The viewpoint rotation is different , Pan is set to 0.

completion model in ,encoder It can be set to PointNet, PCN and DGCNN.decoder choice folding operation , The reconstruction step is divided into two steps , The first step will be 1024 The occlusion vector of dimension is converted to include 1024 Point coarse Point cloud $\hat{\mathcal{P}}_{\text {coarse }}$ , Then on $\hat{\mathcal{P}}_{\text {coarse }}$ Every point in the uses $\times 4$ Of 2D Mesh reconstruction with 16384 Point fine shape $\hat{\mathcal{P}}_{\text {fine }}$ , Use Chamfer Distance (CD) As a prediction $\hat{\mathcal{P}}$ and ground truth $\mathcal{P}$ Loss function between ：
$\begin{aligned} \mathrm{CD}(\hat{\mathcal{P}}, \mathcal{P}) &= \frac{1}{|\hat{\mathcal{P}}|} \sum_{\hat{x} \in \hat{\mathcal{P}}} \min _{x \in \mathcal{P}}\|\hat{x}-x\|_{2}+\frac{1}{|\mathcal{P}|} \sum_{x \in \mathcal{P}} \min _{\hat{x} \in \hat{\mathcal{P}}}\|x-\hat{x}\|_{2} \end{aligned}$
The ultimate model loss is coarse and fine The shape of the Chamfer distances Weighted sums ：
$\ell:=\operatorname{CD}\left(\hat{\mathcal{P}}_{\text {coarse }}, \mathcal{P}_{\text {coarse }}\right)+\alpha \mathrm{CD}\left(\hat{\mathcal{P}}_{\text {fine }}, \mathcal{P}_{\text {fine }}\right)$

Fine-Tuning Setup

Few-shot learning

The goal of small sample learning is to use very limited data to train accurate models , During the training , Random selection $K$ individual classes, Every category Both contain $N$ Samples .

Object classification

Part segmentation

Semantic segmentation

analysis

Visualisation of optimisation landscape

Visualisation of learned features

Unsupervised mutual information probe

Detection of semantic concepts

Discuss

some time , It would be more interesting to design a complete model that shows the occlusion steps considered . The model may converge faster , You need fewer parameters , It can also be used as a strong bias in training .