当前位置:网站首页>Detr3d multi 2D picture 3D detection framework
Detr3d multi 2D picture 3D detection framework
2022-06-26 03:51:00 【AI vision netqi】
Recently, there has been a wave in the circle of automatic driving BEV(Bird's Eye View, Aerial view ) Under the trend of target detection for cameras , And one of the jobs that set off this trend is that we MARS Lab And MIT, TRI There are also ideal automobile cooperation CORL2021 The paper DETR3D.
Now let's introduce our paper by Mr. Chen Kuo :DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries[1].
Do... In the look around camera image of automatic driving 3D Target detection is a thorny problem , For example, how to get from a monocular camera 2D Forecast in information 3D The object 、 The shape and size of the object change with the distance from the camera 、 How to fuse information between different cameras 、 How to deal with objects truncated by adjacent cameras, etc .
take Perspective View Turn into BEV Characterization is a good solution , Mainly reflected in the following aspects :
- BEV It is a unified and complete representation of the global scene , The size and orientation of an object can be expressed directly ;
- BEV It is easier to do sequential multi frame fusion and multi-sensor fusion ;
- BEV More conducive to target tracking 、 Trajectory prediction and other downstream tasks .
DETR3D programme

DETR3D The design of the model mainly includes three parts :Encoder,Decoder and Loss.
Encoder
stay nuScenes Data set , Each sample contains 6 A ring view camera picture . We use it ResNet Go to each picture encode To extract features , And then one more FPN Output 4 layer multi-scale features.
Decoder
Detection head CO contained 6 layer transformer decoder layer. Be similar to DETR, We preset 300/600/900 individual object query, Every query yes 256 Dimensional embedding. be-all object query Predicted by a fully connected network at BEV In the space 3D reference point coordinate (x, y, z), The coordinates go through sigmoid The normalized function represents the relative position in space .
On each floor layer In , be-all object query Between doing self-attention To interact with each other to obtain global information and avoid multiple query Converge to the same object .object query And then do with the image features cross-attention: Each one query Corresponding 3D reference point Project to the picture coordinates through the camera's internal and external parameters , Use linear interpolation to sample the corresponding multi-scale image features, If the projection coordinate falls outside the range of the picture, fill in zero , After that sampled image features To update object queries.
after attention Updated object query Through two MLP Network to predict the corresponding objects class and bounding box Parameters of . In order to make the network learn better , We always predict bounding box The central coordinate of is relative to reference points Of offset(△x,△y,△z) To update reference points Coordinates of .
Updated per layer object queries and reference points As the next layer decoder layer The input of , Calculate and update again , Total iterations 6 Time .
Loss
The design of loss function is mainly affected by DETR Inspired by the , We are in all object queries The predicted detection frame and all ground-truth bounding box The Hungarian algorithm is used for bipartite graph matching , Find out what makes loss The smallest optimal match , And calculate classification focal loss and L1 regression loss.
experimental result
We are based on FCOS3D In the process of the training backbone Training , Before use NMS and test-time augmentation In the case of FCOS3D Result .

We are based on DD3D In the process of the training backbone Training , stay nuScenes test set Got the best results on .

It is always a difficult problem to detect the truncated objects in the overlapped parts of adjacent look around cameras ,DETR3D By directly in BEV This method avoids the post-processing between cameras , It has effectively alleviated this problem . We are in the overlap mAP exceed FCOS3D about 4 A little bit .

Recent related work
DETR3D In the last year 10 Month in nuScenes I have reached the first place in the . In recent months nuScenes There are many on the list of BEV Next do visual 3D The work of target detection , It seems that our work has inspired colleagues in many fields , Everyone is trying to explore in this direction .

Let's compare DETR3D And the recent work and think about the following questions , I hope to give you some inspiration .
- How to convert a look around image into BEV?
stay DETR3D、BEVFormer[2] in , It's through reference points And the physical meaning of camera parameters features, This has the advantage of less computation , adopt FPN Of mutli-scale The structure and deformable detr Of learned offset, Even if there is only one or several reference points You can also get enough receptive field information . The disadvantage is BEV Same as polar ray Upper reference point The image features sampled by projection are the same , The image lacks depth information , The network needs to distinguish between the sampled information and the current location in the subsequent feature aggregation reference points whether match.
stay BEVDet[3] in , The transformation process follow 了 lift-splat-shoot[4] Methods , That is to say image feature map One for each position of the depth distribution, then feature Multiplied by the depth probability lift To BEV Next . This requires a lot of computation and video memory , Because there is no real depth label , So what is actually predicted is a probability that has no exact physical meaning . And a considerable part of the content in the picture does not contain objects , Will all feature Participation in the calculation may be slightly redundant .
- How to choose BEV Form of expression ?
stay DETR3D in , We don't explicitly express the whole BEV, And by sparse Of object query To represent . The most significant benefit is that it saves memory and computation . And in the BEVDet and BEVFormer in , They created a dense Of BEV feature, Although video memory has been added , But it's easier to do BEV space Under the data augmentation, Binary image BEVDet The same can be added to BEV features Of encoding, Three to adapt to a variety of 3D detection head(BEVDet It was used centerpoint,BEVFormer It was used deformable detr).
边栏推荐
猜你喜欢

第 4 篇:绘制四边形

Can string be changed?

ABP framework Practice Series (II) - Introduction to domain layer

机器学习笔记 - 时间序列的趋势分量

如何解决 Iterative 半监督训练 在 ASR 训练中难以落地的问题丨RTC Dev Meetup

Camera-CreateCaptureSession

面了个字节拿25k出来的测试,算是真正见识到了基础的天花板

Uni app QR code scanning and identification function

(15)Blender源码分析之闪屏窗口显示菜单功能

Analysis of camera memory memory leakage (II)
随机推荐
Open Camera异常分析(一)
mysql存儲過程
Three level menu applet
Solve the problem that the uniapp plug-in Robin editor reports an error when setting the font color and background color
[Flink] Flink source code analysis - creation of jobgraph in batch mode
MySQL高級篇第一章(linux下安裝MySQL)【下】
matplotlib折线图,文字显示,win10
2022.6.24-----leetcode.515
Andorid hide the title bar of the system
Camera memory memory leak analysis (III)
Comparison of static methods and variables with instance methods and variables
An easy-to-use tablayout
2022.6.20-----leetcode.715
Prism framework
高性能算力中心 — RoCE — Overview
开源!ViTAE模型再刷世界第一:COCO人体姿态估计新模型取得最高精度81.1AP
在出海获客这件事上,数字广告投放之外,广告主还能怎么玩儿?
Camera-memory内存泄漏分析(三)
763. 划分字母区间
Classic model alexnet