当前位置：网站首页>[mv-3d] - multi view 3D target detection network

[mv-3d] - multi view 3D target detection network

2022-07-02 03:27:00 【Fireworks at dawn in the city】

MV-3D: Multi view 3D Target detection network

One 、 Preface
Two 、 The whole idea
3、 ... and 、 Algorithm analysis

One 、 Preface

At present, there are two main kinds of spatial point cloud detection . One is to take 3D point cloud as input directly , Directly into the convolution network or into voxels into . The other is to make 3D The point cloud is mapped to 2D, Mainly aerial view or front view . Generally speaking, the first method is rich in target detection information , But the corresponding amount of calculation is also large ; If the second method is handled properly, the amount of calculation is relatively small , But it will lead to the loss of information .
MV-3D The paper adopts the second method , But considering the loss of information , The front view and pictures are also used for fusion correction .

The paper ： https://arxiv.org/abs/1611.07759

Two 、 The whole idea

As shown in the figure below ：
Insert picture description here
Regional proposal network (PPN) It has become an important part of high-precision target detection .MV-3D Is based on RPN Architecture , It can be seen that the whole is mainly divided into two main parts ：3D Proposal Network and Rregion-based Fusion Network. There are three types of network input ： Top view （BV）、 Front view （FV） And images （RGB）, After convolution network output and 3D Proposal Conduct ROI pooling The fusion , Then select 3D Bounding box .

3、 ... and 、 Algorithm analysis

1、3D Proposal Network

Insert picture description here
There are three types of network input ： Top view （BV）、 Front view （FV） And images （RGB）. The main idea is ：

The convolution layer in this paper adopts VGG-16, The last pool layer is removed , Therefore, the convolution part is 8 Next sampling .
3D The proposal is generated from the top view , Because projection to the aerial view can keep the size of the object more , And the difference in the vertical direction is small, which can be obtained more accurately 3D Bounding box .
In addition, in order to deal with ultra small objects , We use feature approximation to obtain high-resolution feature maps . Specially , We insert a before entering the last convolution feature 2 The bilinear up sampling layer of times is mapped to 3D Proposal network . Similarly , We are BV/FV/RGB The branch ROI A 4x/4x/2x The upper sampling layer of .
The generated 3D Proposal Project to each view and enter together with the convoluted network ROI pooling. From different views / Modal features usually have different resolutions , So use... For each view ROI Pool to get eigenvectors of the same length .

The aerial view shows

The aerial view is represented by the height 、 Intensity and density encoded , The projected point cloud is discretized into a resolution of 0.1m 2-D grid of .

For each grid cell , The height feature is calculated as the maximum height of the midpoint of the element , In order to encode more detailed height information , Point clouds are equally divided into m A slice , Calculate the height map of each piece , obtain m Height map .
The intensity feature is the reflection value of the point with the maximum height in each cell , Apply to the whole point cloud .
The point cloud density represents the number of points in each cell , It also acts on the whole point cloud .

In general, the aerial view is encoded as （m+2） Channel features .

The front view shows

The front view representation provides supplementary information for the aerial view representation . Because the LIDAR point cloud is very sparse , Projecting it onto the image plane will produce a sparse two-dimensional point map . contrary , We project it onto a cylinder , To generate a dense front view map .
Let the coordinates of a point in the point cloud be P = (x,y,z), The corresponding coordinates in the front view are $p_{fv} = (r,c)$ . The mutual transformation relationship between the two is ：

C = $\frac{\arctan(y,x)}{\Delta \theta }]$
r = $\frac{\arctan(z,\sqrt{x^2+y^2} )}{\Delta\phi }]$

Among them , $\Delta \theta$ and $Delta\phi$ They are the horizontal and vertical resolutions of the laser beam .

The renderings are as follows ：
Insert picture description here

2、Rregion-based Fusion Network

A region based fusion network is designed , Effectively combine the characteristics of multiple perspectives , Jointly classify the target suggestions and carry out direction oriented three-dimensional border regression .
Insert picture description here
In order to combine information from different features , Deep fusion method is adopted , Fuse multi view features . In addition, the paper will also deeply integrate the network and early / The architecture of late converged networks is relatively .

Those who have L Layer network , Early fusion combines the features of multiple views from the input phase ${fv}$ ：
Insert picture description here
${H_{l},l=1,···,L}$ Is the characteristic transformation function , and ⊕ It is a connection operation （ for example , Connect 、 Sum up ）.
by comparison , Later fusion uses independent subnets to learn feature transformation independently , And combine their output in the prediction stage ：
Insert picture description here
The deep fusion process designed in this paper is as follows ：

M Represent the element level mean to perform the connection operation of deep fusion , Because it is more flexible when combined with jumping training .

3、3D Bounding box regression

Considering the fusion characteristics of multi view Networks , We return from the 3D proposal to the oriented 3D bounding box . especially , The return goal is 3D Bounding box's 8 Corner ： $t=(∆x_0,···,∆x_7,∆y_0, · · · , ∆y_7, ∆z_0, · · · , ∆z_7)$ . They are encoded as the corner offset normalized by the diagonal length of the suggestion box . Despite this 24d Vector representation is redundant , But it is found that this coding method is better than the center and size coding method .

In addition, the paper mentioned in the model , The direction of the object can be calculated from the predicted three-dimensional frame angle .（ This does not give the calculation process ）

Multi task loss is used to jointly predict object categories and three-dimensional boxes . In the generation proposal network , Category loss uses cross entropy , and 3D box Loss use smooth L1_loss.

During training , just / negative roi It is based on aerial view IoU Overlap to determine . If the aerial view IoU The overlap is greater than 0.5, be 3D The proposal is considered positive . In the process of reasoning , On the three-dimensional boundary box Three dimensional after regression box On the application NMS. We will 3D The box is projected into the aerial view , To calculate their IoU overlap . We use IoU The threshold for 0.05 In order to delete redundant box, To ensure that objects cannot occupy the same space in the aerial view .

4、 Network regularization

Compared with two-dimensional network , Regularization can effectively avoid over fitting of the network , Make the whole network go on effectively . In this paper, we use two methods to regularize the region based fusion network ：drop-path training and auxiliary losses.

For each iteration , We randomly choose to make a global descent path or a local descent path, and its probability is 50%. If you choose the global descent path , We choose a single view from three views . If the local descent path is selected , Then the path of each connecting node is random, and the probability of decline is 50%. We ensure that for each connection node , Keep at least one input path .（ This is translated , I always feel inaccurate ）

In order to further enhance the presentation ability of each view , This paper adds auxiliary paths and losses to the network .
Insert picture description here
In the process of training , Add three paths and losses at the bottom , Regularize the network . The auxiliary layer shares the weight with the corresponding layer in the main network .