当前位置：网站首页>[yolo3d]: real time detection of end-to-end 3D point cloud input

[yolo3d]: real time detection of end-to-end 3D point cloud input

2022-07-02 03:27:00 【Fireworks at dawn in the city】

YOLO3D: End to end 3D Real time detection of point cloud input

Preface
Algorithm analysis

Preface

YOLO3D take YOLO be applied to 3D Target detection of point cloud , And Complex-YOLO（Complex-YOLO From here comes the interpretation of ） similar , The difference is that will yolo v2 The loss function of is extended to include yaw angle 、 Three dimensional in Cartesian coordinates box And direct regression box Height .

The paper ： https://arxiv.org/abs/1808.02350

Algorithm analysis

Model input

This paper 3D The point cloud is projected as an aerial view grid , Create two grid mappings as shown in the figure .
Insert picture description here
The first contains the maximum height , Each grid cell （ Pixels ） The value represents the height of the highest point associated with the cell . The second grid graph shows the density of points , Density calculation reference MV3D( From then on, the interpretation of the paper enters ).
Insert picture description here

Network structure

The structural reference of the paper YOLO-v2 framework , Some changes have been made .
Insert picture description here

Modified a maximum pooling layer , Take the down sample from 32 Change it to 16, With a larger grid , This helps detect small objects such as pedestrians and cyclists .
Deleted from the model skip connection, Because it will lead to inaccurate results .

Return to loss

3D box Return to

The paper is in the original YOLO v2 Two regression terms are added to generate 3D Bounding box ： Central z Coordinates and frame height .z The regression of coordinates is similar to x and y The way of return , adopt sigmoid Activate the function to coordinate .
Insert picture description here
It is worth noting that , although x and y By predicting 0 To 1 Regression between the values , Locate where the point is located in the unit , but z The value of is mapped only to be located in a vertical grid cell , As shown in the figure below . Choose to z Values are mapped to only one grid and x and y The reason for mapping to multiple grid cells is z The variability of the median value of the dimension is much smaller than x and y The variability of （ Most objects have very similar frame elevations ）.
Insert picture description here

Yaw angle regression

The direction range of the bounding box defined in the paper is from -π To π. Normalize the range to -1 To 1, And adjust our model to directly predict the direction of the bounding box through a single regression number . In the loss function , Calculate the mean square error between the actual ground situation and the angle we predicted ：
Insert picture description here

Bounding box loss function

3D box The loss is 2Dbox original YOLO Expansion of losses . The loss of yaw item is in accordance with The above calculation . The loss of height is the extension of the loss of width and length . Similarly ,z The loss of coordinates is x and y Expansion of coordinate loss .
Insert picture description here

$λ c o o r$ ： Weight assigned to coordinate loss ,
$λ c o n f$ ： The weight assigned to the prediction confidence loss ,
$λ y a w$ ： Weight assigned to bearing loss ,
$λ c l a s s e s$ ： The weight assigned to the loss Class probability ,
$L^{obj}_{ ij}$ ： A variable , It's based on i And the first j Check whether there is a real value in the positions 0 and 1 Value . If there is a box , Then for 1, Otherwise 0,
$L^{noobj}_{ ij}$ ： Contrary to the previous variable . If there are no objects , The values for 0, Otherwise, the value is 1,
$x_i , y_i , z_i$ ： Ground live coordinates ,
$\hat{x_i}, \hat{y_i}, \hat{z_i}$ ： Ground truth and predicted bearing ,
$φ_i, \hat{φ_i}$ ： Ground truth and predicted bearing … etc. ,
$C_i, \hat{C_i}$ : Truth and prediction confidence ,
$w_i , l_i , h_i$ : True case width 、 Height and length of box ,
$w^i, l^i, h^i$ : Predicted width 、 Height and length boxes
$p_i( c)、\hat{p_i}( c)$ Actual situation and predicted category probability .

Dataset processing

The paper uses KITTI Benchmark data set . Point cloud in per pixel 0.1m The resolution is 2D The projection in space is an aerial view of the grid , And MV3D Use the same resolution .

Grid diagram shows LiDAR The space range is right 30.4 rice , towards the left 30.4 rice , forward 60.8 rice . The above resolution is 0.1 Using this range will cause the input shape of each channel to be 608x608.

LiDAR The height in space is clipped in +2m and -2m Between , And shrink it to 0 To 255 Expressed as the pixel value in the maximum height channel .

Training

The network is trained in an end-to-end manner . The momentum used is 0.9、 The weight decays to 0.0005 Random gradient descent of . Train the network 150 individual epoch, Batch size is 4.
For the first few epoch, Change the learning rate from 0.00001 Slowly increase to 0.0001. If you start with a high learning rate , Our model usually diverges due to gradient instability . Continue to use 0.0001 Training 90 Time , And then use 0.0005 Training 30 Period , Last use 0.00005 At the end of the training 20 Time .