当前位置：网站首页>Learning notes 25 - multi sensor front fusion technology

Learning notes 25 - multi sensor front fusion technology

2022-07-02 01:17:00 【FUXI_ Willard】

This blog series includes 6 A column , Respectively ：《 Overview of autopilot Technology 》、《 Technical foundation of autopilot vehicle platform 》、《 Autopilot positioning technology 》、《 Self driving vehicle environment perception 》、《 Decision and control of autonomous driving vehicle 》、《 Design and application of automatic driving system 》, The author is not an expert in the field of automatic driving , Just a little white on the road of exploring automatic driving , This series has not been read , It is also thinking while reading and summarizing , Welcome to friends , Please give your suggestions in the comments area , Help the author pick out the mistakes , thank you ！
This column is about 《 Self driving vehicle environment perception 》 Book notes .

3. Multisensor front fusion technology

Pre fusion technology ： At the level of raw data , Directly fuse the data information of all sensors , Then realize the perception function according to the fused data information , Output a detection target of the result layer .

Common fusion methods based on Neural Network , Such as ：MV3D(Multi-View 3D Object Detection)、AVOD(Aggregate View Object Detection)、F-PointNet(Frustum PointNets for 3D Object Detection) etc. .

3.1 MV3D

MV3D Point cloud data detected by lidar and captured by visible light camera RGB Image fusion , The input data is an aerial view of the laser radar projection (LIDAR bird view)、 Front view (LIDAR front view) And two dimensions RGB Images , Its network structure mainly includes three-dimensional area generation network (3D proposal network) And region based converged networks (region-based fusion network), Use deep fusion (deep fusion) The way to integrate , Here's the picture ：

LIDAR point cloud data is a collection of disordered data points , Before processing point cloud data with the designed neural network model , In order to retain the information of 3D point cloud data more effectively , Easy to handle ,MV3D Project the point cloud data to a specific two-dimensional plane , Get an aerial view and front view .
3D proposal network, Be similar to Faster-RCNN Detect the area generation network applied in the model (Region Proposal Network,RPN), And promote it in three dimensions , One of the functions realized is to generate the three-dimensional candidate box of the target ; This part of the function is completed in the aerial view , There is less occlusion of each target in the aerial view , The efficiency of candidate box extraction is the best .
After extracting the candidate box , Map to three kinds of graphs respectively , Get their respective regions of interest (Region of Interest,ROI), Get into region-based fusion network To merge ; On the choice of fusion mode , Yes ： Early integration (early fusion)、 Later Integration (late fusion)、 Deep integration (deep fusion), The comparison of the three methods is shown in the following figure ：

Link to the original paper

3.2 AVOD

AVOD It is a kind of fusion of LIDAR point cloud data and RGB 3D object detection algorithm based on image information , Its input is only the aerial view generated by lidar (Bird’s Eye View,BEV)Map And the camera RGB Images , The laser radar forward graph is discarded (Front View,FV) and BEV Density characteristics in (intensity feature), As shown in the figure below ：

For input data ,AVOD First, feature extraction , Get two full resolution feature maps , Input to RPN Generate suggestions for areas without orientation , Finally, select the appropriate proposed candidates and send them to the detection network to generate a three-dimensional bounding box with orientation , Complete the target detection task ;AVOD There are two sensor data fusion ： Feature fusion and region suggestion fusion .

The illustration above ： Above, AVOD Feature extraction network , Using an encoder - decoder (encoder-decoder) structure , Each decoder first samples the input , Then it is connected in series with the output of the corresponding encoder , Finally through a 3×3 The convolution of ; This structure can extract the feature map of the resolution , It effectively avoids that small target objects occupy insufficient pixels in the output feature mapping due to down sampling 1 The problem of , The final output feature mapping contains both the underlying details , It also integrates high-level semantic information , It can effectively improve the detection results of small target objects .

The illustration above ： The above figure shows three bounding box coding methods , From left to right ：MV3D、 Shaft alignment, 、AVOD 3D bounding box coding method , And MV3D The encoding method of specifying eight vertex coordinates is compared ,AVOD The shape of the three-dimensional boundary box is constrained by a bottom and height , And only one 10 The vector representation of dimension is sufficient ,MV3D need 24 The vector representation of dimensions .

3.3 F-PointNet

F-PointNet Combine the mature target detection methods in two-dimensional images to locate the target , Get the visual cone in the corresponding 3D point cloud data (frustum), And perform bounding box regression to complete the detection task , As shown in the figure below ：

F-PointNet The whole network structure consists of three parts ： Cone of vision (frustum proposal)、 3D instance segmentation (3D instance segmentation)、 3D bounding box regression (amodal 3D box estimation); The network structure is shown in the figure below ：

F-PointNet utilize RGB The advantage of high image resolution , The adoption is based on FPN The detection model first obtains the boundary box of the target on the two-dimensional image , Then according to the known camera projection matrix , Lift the two-dimensional bounding box to the visual cone that defines the three-dimensional search space of the target , And collect all points in the truncated body to form a cone point cloud ;

The illustration above ： chart (a) Is the camera coordinate system , chart (b) Is the cone coordinate system , chart ( c ) It is a three-dimensional mask local coordinate system , chart (d) yes T-Net Predicted 3D Target coordinate system ; To avoid occlusion and blurring , For cone point cloud data ,F-PointNet Use PointNet( or PointNet++) Model for instance segmentation ; In 3D space , Objects are mostly separated , 3D segmentation is more reliable ; Split by instance , You can get the three-dimensional mask of the target object ( That is, all point clouds belonging to the target ), And calculate its centroid as the new coordinate origin , Pictured ( c ) Shown , Convert to local coordinate system , To improve the translation invariance of the algorithm ; Last , For target point cloud data ,F-PointNet By using with T-Net Of PointNet( or PointNet++) Regression operation of the model , Predict the center of the target 3D bounding box 、 Size and orientation , Pictured (d) Shown , Finally complete the detection task ;T-Net The function of is to predict the distance from the real center of the three-dimensional boundary box of the target to the centroid of the target , Then take the prediction center as the origin , Get the target coordinate system .

Summary ：
F-PointNet In order to ensure the invariance of the point cloud data in each step and finally more accurately return to the 3D boundary box , A total of three coordinate system transformations are required , They are visual cone conversion 、 Mask centroid conversion 、T-Net forecast .
Link to the original paper

原网站

版权声明
本文为[FUXI_ Willard]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020110134499.html