当前位置:网站首页>Notes on the paper "cross view transformers for real time map view semantic segmentation"
Notes on the paper "cross view transformers for real time map view semantic segmentation"
2022-07-04 04:56:00 【m_ buddy】
Reference code :cross_view_transformers
1. summary
Introduce : This article proposes a new 2D Dimensional bev Feature extraction scheme , It introduces camera prior information ( Camera internal and external parameters ) Build a multi view cross attention mechanism , It can map multi view features to BEV features . For building its multi view feature bev Bridge of feature mapping relationship , This is through BEV Location code ( You need to add the original bev queries As refine, For what is written here “ code ” That is, in the text embedding) And according to the camera calibration results ( Internal and external parameters ) The camera position code obtained by calculation (camera-aware embedding)、 Multi view features do attention obtain , This step is called cross view attention. On the whole, the network front-end of the article is used CNN As a feature extraction network , Middle end use CNN Multi level features are optimized in multi view as input BEV features ( That is, cascade optimization is used ), The back-end using CNN Form decoder for output . The overall operation is simple and efficient , In the 2080Ti GPU Achieve real-time effect on .
This article proposes based on transformer Of bev Feature extraction network ( about 2D Of bev), about bev Under the queries By adding map-view embedding Conduct refine Get the final queries. Also in multi view features ( from CNN Network get ) Will also add camera-view Of embedding Conduct refine obtain key. At the same time, in order to feel the way 3D Positional geometry also determines the position of the camera embedding( In the code, it is subtraction ), And with the above two embedding Association . Finally, the original multi view feature will also be mapped as val, Build like this attention Calculation . The above mentioned contents can be seen in the following figure as an auxiliary explanation 
2. Methods to design
2.1 The Internet pipeline
The overall algorithm proposed in this paper is shown in the figure below :
The data entered in the above figure is multi view data { I k ∈ R W ∗ H ∗ 3 , R k ∈ R 3 ∗ 3 , K k ∈ R 3 ∗ 3 , t k ∈ R 3 } k = 1 n \{I_k\in R^{W*H*3},R_k\in R^{3*3},K_k\in R^{3*3},t_k\in R^3\}_{k=1}^n { Ik∈RW∗H∗3,Rk∈R3∗3,Kk∈R3∗3,tk∈R3}k=1n( Each represents the image 、 Rotation matrix 、 Internal parameter matrix 、 Translation vector ). The final structure is obtained through the following steps :
- 1)CNN The network extracts from multi view data CNN features δ k , k = 1 , … , n \delta_k,k=1,\dots,n δk,k=1,…,n, These features will be mapped by a linetype attention Medium val.
- 2)CNN Add camera-view embedding( It's the one in the picture above positional embedding, It depends on the internal and external parameter matrix obtained by the calibration of their respective views ) As refine To get attention Medium key.
- 3)bev Of queries Will be in map embedding(bev grid adopt embedding And then get ) Next refine Get to the end attention Of queries.
- 4) Use CNN The multi-level characteristic graph output by the network adopts the cascade method refine bev features , Then it is sent to the decoding unit to get the output result .
2.2 cross view attention
stay 3D In this case, point space point x ( W ) x^{(W)} x(W) And image points x ( I ) x^{(I)} x(I) The mapping relationship of is :
x ( I ) ≃ K k R k ( x ( W ) − t k ) x^{(I)}\simeq K_kR_k(x^{(W)}-t_k) x(I)≃KkRk(x(W)−tk)
That is to say, the above situation is only approximately equal , The actual depth value is unknown , Desire exists scale The uncertainty on . In this article, we do not explicitly use depth information or implicitly encode the spatial distribution of depth , It's going to be scale The uncertainty code on uses the above mentioned camera-view embedding、map-view embedding and transformer Network for learning and adaptation . As for the above mentioned 3D Space point x ( W ) x^{(W)} x(W) And image points x ( I ) x^{(I)} x(I) The similarity relation of uses cosine similarity :
s i m k ( x I , x ( W ) ) = ( R k − 1 K k − 1 x ( I ) ) ⋅ ( x W − t k ) ∣ ∣ R k − 1 K k − 1 x ( I ) ∣ ∣ ∣ ∣ x W − t k ∣ ∣ sim_k(x^{I},x^{(W)})=\frac{(R_k^{-1}K_k^{-1}x^{(I)})\cdot(x^{W}-t_k)}{||R_k^{-1}K_k^{-1}x^{(I)}||\ ||x^{W}-t_k||} simk(xI,x(W))=∣∣Rk−1Kk−1x(I)∣∣ ∣∣xW−tk∣∣(Rk−1Kk−1x(I))⋅(xW−tk)
camera-view embedding:
Here, the position coding is carried out in the feature dimension of multiple views in combination with the camera internal and external parameters of each view , That is, map each pixel on the feature map to 3D Space goes :
d k , i = R k − 1 K k − 1 x i ( I ) d_{k,i}=R_k^{-1}K_k^{-1}x_i^{(I)} dk,i=Rk−1Kk−1xi(I)
After mapping 3D Points will be encoded through a linear network camera-view embedding( δ k , i ∈ R D \delta_{k,i}\in R^D δk,i∈RD), Refer to the following code :
# cross_view_transformer/model/encoder.py#L248
pixel_flat = rearrange(pixel, '... h w -> ... (h w)') # 1 1 3 (h w)
cam = I_inv @ pixel_flat # b n 3 (h w)
cam = F.pad(cam, (0, 0, 0, 1, 0, 0, 0, 0), value=1) # b n 4 (h w)
d = E_inv @ cam # b n 4 (h w)
d_flat = rearrange(d, 'b n d (h w) -> (b n) d h w', h=h, w=w) # (b n) 4 h w
d_embed = self.img_embed(d_flat) # (b n) d h w
...
if self.feature_proj is not None: # linear projection refine
key_flat = img_embed + self.feature_proj(feature_flat) # (b n) d h w
else:
key_flat = img_embed # (b n) d h w
Further, this code will subtract the camera position code τ k ∈ R D \tau_k\in R^D τk∈RD( The calculation process is the same as the above camera-view embedding The calculation process is similar to ), The purpose is to estimate the 3D Space location . be camera-view embedding The influence of different calculation processes on performance is shown in the following table :
map-view embedding:
This part is in bev grid embedding Based on position coding with camera τ k ∈ R D \tau_k\in R^D τk∈RD What you can do badly ( Write it down as c j n c_j^{n} cjn), It is different from the original bev queries Together into transformer Finish in attention, Its implementation can refer to :
# cross_view_transformer/model/encoder.py#L258
world = bev.grid[:2] # 2 H W
w_embed = self.bev_embed(world[None]) # 1 d H W
bev_embed = w_embed - c_embed # (b n) d H W
bev_embed = bev_embed / (bev_embed.norm(dim=1, keepdim=True) + 1e-7) # (b n) d H W
query_pos = rearrange(bev_embed, '(b n) ... -> b n ...', b=b, n=n) # b n d H W
feature_flat = rearrange(feature, 'b n ... -> (b n) ...') # (b n) d h w
...
# Expand + refine the BEV embedding
query = query_pos + x[:, None] # b n d H W
cross-view attention:
The mapped multi view features ϕ k , i \phi_{k,i} ϕk,i( As val) With the above queries and key Conduct attention operation , For the article, the cosine similarity of the following form is used :
s i m ( δ k , i , ϕ k , i , c j ( n ) , τ k ) = ( δ k , i + ϕ k , i ) ⋅ ( c j n − τ k ) ∣ ∣ δ k , i + ϕ k , i ∣ ∣ ∣ ∣ c j n − τ k ∣ ∣ sim(\delta_{k,i},\phi_{k,i},c_j^{(n)},\tau_k)=\frac{(\delta_{k,i}+\phi_{k,i})\cdot(c_j^{n}-\tau_k)}{||\delta_{k,i}+\phi_{k,i}||\ ||c_j^{n}-\tau_k||} sim(δk,i,ϕk,i,cj(n),τk)=∣∣δk,i+ϕk,i∣∣ ∣∣cjn−τk∣∣(δk,i+ϕk,i)⋅(cjn−τk)
The impact of the above mentioned variables on performance :
3. experimental result
nuScenes and Argoverse Data recording bev Segmentation performance comparison :
Different bev setting and FPS Compare :
边栏推荐
- 加密和解密
- [Yugong series] go teaching course 001 in July 2022 - Introduction to go language premise
- 附件二:攻防演练保密协议.docx
- 红队视角下的防御体系突破之第二篇案例分析
- Technology Management - learning / practice
- 附件三:防守方评分标准.docx
- Acwing game 58
- Utiliser des unités de mesure dans votre code pour une vie meilleure
- 6-4漏洞利用-SSH Banner信息获取
- qt下开发mqtt的访问程序
猜你喜欢

通过dd创建asm disk

Flutter ‘/usr/lib/libswiftCore.dylib‘ (no such file)

Kivy教程之 格式化文本 (教程含源码)
![[Yugong series] go teaching course 002 go language environment installation in July 2022](/img/a8/79c48e426ce909db960d446f923795.png)
[Yugong series] go teaching course 002 go language environment installation in July 2022

定制一个自己项目里需要的分页器

Correct the classpath of your application so that it contains a single, compatible version of com.go

GUI application: socket network chat room

GUI 应用:socket 网络聊天室

Deep understanding of redis -- bloomfilter

RPC - grpc simple demo - learn / practice
随机推荐
【MATLAB】MATLAB 仿真数字基带传输系统 — 双极性基带信号(余弦滚降成形脉冲)的眼图
RAC delete damaged disk group
电子元器件商城与数据手册下载网站汇总
C basic (VII) document operation
在代码中使用度量单位,从而生活更美好
Annexe VI: exposé sur les travaux de défense. Docx
RPC - gRPC简单的demo - 学习/实践
MySQL 索引和事务
附件五:攻击过程简报.docx
【MATLAB】MATLAB 仿真数字带通传输系统 — QPSK 和 OQPSK 系统
【MATLAB】MATLAB 仿真模拟调制系统 — DSB 系统
Developing mqtt access program under QT
Create ASM disk through DD
Talking about JVM
First knowledge of batch processing
MySQL JDBC programming
YoloV6实战:手把手教你使用Yolov6进行物体检测(附数据集)
附件六:防守工作簡報.docx
Distributed cap theory
What is context?