当前位置：网站首页>Bev instance prediction based on monocular camera (iccv 2021)

Bev instance prediction based on monocular camera (iccv 2021)

2022-06-30 05:18:00 【3D vision workshop】

Author Huang Yu @ You know

Source https://zhuanlan.zhihu.com/p/422992592

Editor 3D Visual workshop

ICCV‘21 The paper “FIERY: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras“, The author is from an autonomous driving start-up company in the UK Wayve And Cambridge University .

Driving needs to interact with road intelligence and predict their future behavior , For safe navigation .FIERY It's a monocular camera BEV Future probability prediction model . It predicts the future instance segmentation and motion of dynamic intelligent body , Convert to nonparametric future trajectory . Combined with the perception of traditional autopilot stack 、 Fusion and prediction components , Directly from RGB Monocular camera input estimation BEV forecast .

FIERY Learn to model the inherent randomness of the future based on camera driving data in an end-to-end manner , Independent of HD map , Predict multimodal future trajectory .

Open source code ：https://github.com/wayveai/fiery

Blog address ：https://wayve.ai/blog/fiery-future-instance-prediction-birds-eye-view/

The following two figures are BEV Schematic diagram of network multimodal future prediction ： First two lines ：RGB Camera input ; The predicted instance is segmented and projected to the ground plane , Visualize the average future trajectory of the dynamic agent as a transparent path ; Bottom line ： stay 100m × 100m A bird's-eye view of the size of the car , Future instance predictions are represented by a central black rectangle .

Model FIERY The overview is shown in the figure ： A camera input BEV Future prediction models

·1. Past moment {1, ..., t}, The depth probability distribution of pixels is predicted and the camera internal and external parameters are known , Input the camera into (O1, ..., Ot) Upgrade to 3D;

·2. Project features onto BEV (x1, ..., xt). Use the space converter module S , According to past self motion (a1, ..., at−1), take BEV The feature is converted to the current reference system （ Time t）.

·3. 3D Convolution time domain model learning - Empty state st.

·4. Parameterize two probability distributions ： Current and future distribution . The current distribution is in its current state st On condition that , The future distribution is in the current state st And future labels (yt+1 , ..., yt+H ) On condition that .

·5. From the future distribution in training and the current distribution in reasoning , Sample a latent code ηt. current state st And hidden code ηt Is an input to the future forecast model , Recursively predict future states (s^t+1,...,s^t+H).

·6. The status is decoded as BEV Future instance segmentation and future motion (yˆt,...,yˆt+H).

Here is the depth probability （depth probability） As a form of self - attention , The feature is modulated by predicting the depth plane according to the feature . Use a known camera for internal and external reference （ Relative to the vehicle ）, From every camera （u1t,...,unt） In a common reference coordinate system （ Time t The center of inertia of the vehicle ） Upgrade to 3D .

Modelled on the ECCV‘20 The paper “Probabilistic future prediction for video scene understanding“ The job of , Using conditional variation （variational） Method to simulate the inherent randomness of future prediction . Two distributions are introduced ： Current distribution P Only the current spatiotemporal state can be accessed st, And future distribution F You can also access the observed future tags (yt+1,...,yt+H), among H Is the future forecast range .

During training , Using samples from future distributions ηt To enforce predictions consistent with observing the future , Cover with KL- The pattern of divergence loss encourages the current distribution to cover the observed future . In reasoning , Sample from current distribution ηt, Each of these samples encodes a possible future .

The future prediction model is a convolution GRU The Internet , Change the current state st And future distribution in training F Or current distribution P Sampling potential code ηt As input , Reasoning , Recursively predict future states .

The output feature is an aerial view decoder D The input of , It is fed into multiple output heads ： Semantic segmentation 、 Instance center and instance offset （ Point to the center of the instance ）, And instance future flow （ motion ）. The figure below shows the model output diagram ：

The instance segmentation result ：(i) The instance center is obtained by non maximum suppression ;(ii) Use the offset vector to group the pixels to the nearest instance Center ;(iii) Future flows allow consistent instance identification , Adopt from t To t + 1 Future flow and time t + 1 To compare warped center.

The experimental measure is ：Video Panoptic Quality (VQP) and Generalised Energy Distance(DGED) .

Benchmark methods include ：

·VPN（“Cross-view semantic segmentation for sensing surroundings,” IEEE Robotics and Automation Letters, 2020）

·VED（“Monocular semantic occupancy grid mapping with convolutional variational encoder–decoder networks,” IEEE Robotics and Automation Letters, 2019.）

·PON（“Predicting semantic map representations from images using pyramid occupancy networks,” CVPR 2020）

·Lift-Splat（“Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” ECCV, 2020）

·STA（“Enabling spatio- temporal aggregation in birds-eye-view vehicle estimation,” ICRA, 2021）

·Fishing Net（“Fishing net: Future inference of semantic heatmaps in grids. CVPR‘20 workshop）

The experimental results are as follows ：

among Setting 1,2,3 Defined as

· Set up 1：100m × 50m,25cm The resolution of the . Forecast for the current time range .

· Set up 2：100m × 100m,50cm The resolution of the . Forecast for the current time range .

· Set up 3：32.0m × 19.2m,10cm The resolution of the . Predicting the future 2.0s. Where the model and Fishing Net Compare the two variants of , One that uses camera input , One uses lidar input .

As shown in the figure FIERY Static（ No time context ） and FIERY（ In the past 1.0s） stay NuScenes Data current frame BEV Comparison of task results of instance segmentation ：FIERY Can predict partially observable and occluded elements , Such as the protruding part of the blue ellipse .

(a) Even if it is blocked , It can also correctly predict the vehicles parked on the left .

(b) The two cars parked on the left were seriously blocked by vehicles in the opposite lane , But by fusing past information , Accurately predict their location .

This article is only for academic sharing , If there is any infringement , Please contact to delete .

3D Recommended visual quality courses ：

1. Multi sensor data fusion technology for automatic driving field

2. For the field of automatic driving 3D Whole stack learning route of point cloud target detection ！( Single mode + Multimodal / data + Code )
3. Thoroughly understand the visual three-dimensional reconstruction ： Principle analysis 、 Code explanation 、 Optimization and improvement
4. China's first point cloud processing course for industrial practice
5. laser - Vision -IMU-GPS The fusion SLAM Algorithm sorting and code explanation
6. Thoroughly understand the vision - inertia SLAM： be based on VINS-Fusion The class officially started
7. Thoroughly understand based on LOAM Framework of the 3D laser SLAM: Source code analysis to algorithm optimization
8. Thorough analysis of indoor 、 Outdoor laser SLAM Key algorithm principle 、 Code and actual combat (cartographer+LOAM +LIO-SAM)

9. Build a set of structured light from zero 3D Rebuild the system [ theory + Source code + practice ]

10. Monocular depth estimation method ： Algorithm sorting and code implementation

11. Deployment of deep learning model in autopilot

12. Camera model and calibration ( Monocular + Binocular + fisheye ）

13. blockbuster ！ Four rotor aircraft ： Algorithm and practice

14.ROS2 From entry to mastery ： Theory and practice

blockbuster ！3DCVer- Academic paper writing contribution Communication group Established

Scan the code to add a little assistant wechat , can Apply to join 3D Visual workshop - Academic paper writing and contribution WeChat ac group , The purpose is to communicate with each other 、 Top issue 、SCI、EI And so on .

meanwhile You can also apply to join our subdivided direction communication group , At present, there are mainly 3D Vision 、CV& Deep learning 、SLAM、 Three dimensional reconstruction 、 Point cloud post processing 、 Autopilot 、 Multi-sensor fusion 、CV introduction 、 Three dimensional measurement 、VR/AR、3D Face recognition 、 Medical imaging 、 defect detection 、 Pedestrian recognition 、 Target tracking 、 Visual products landing 、 The visual contest 、 License plate recognition 、 Hardware selection 、 Academic exchange 、 Job exchange 、ORB-SLAM Series source code exchange 、 Depth estimation Wait for wechat group .

Be sure to note ： Research direction + School / company + nickname , for example ：”3D Vision + Shanghai Jiaotong University + quietly “. Please note... According to the format , Can be quickly passed and invited into the group . Original contribution Please also contact .

▲ Long press and add wechat group or contribute

▲ The official account of long click attention

3D Vision goes from entry to mastery of knowledge ： in the light of 3D In the field of vision Video Course cheng （ 3D reconstruction series 、 3D point cloud series 、 Structured light series 、 Hand eye calibration 、 Camera calibration 、 laser / Vision SLAM、 Automatically Driving, etc ）、 Summary of knowledge points 、 Introduction advanced learning route 、 newest paper Share 、 Question answer Carry out deep cultivation in five aspects , There are also algorithm engineers from various large factories to provide technical guidance . meanwhile , The planet will be jointly released by well-known enterprises 3D Vision related algorithm development positions and project docking information , Create a set of technology and employment as one of the iron fans gathering area , near 4000 Planet members create better AI The world is making progress together , Knowledge planet portal ：