当前位置:网站首页>Using transformer for object detection and semantic segmentation
Using transformer for object detection and semantic segmentation
2022-07-02 07:59:00 【MezereonXP】
Introduce
This time it's about Facebook AI An article from “End-to-End Object Detection with Transformers”
Just recently Transformer It's also popular , Here's how to use Transformer For object detection and semantic segmentation .
About Transformer, You can refer to my article article .
Let me briefly introduce Transformer, This is a model architecture for sequence to sequence modeling , It is widely used in natural language translation and other fields .Transformer Abandon the previous modeling of sequence RNN Form of network architecture , The attention mechanism is introduced , Achieved a good sequence modeling and transformation capabilities .
General structure and process

As shown in the figure above , It's mainly divided into two parts :
- Backbone: Mainly CNN, Used to extract advanced semantic features
- Encoder-Decoder: Make use of advanced semantic features and give target prediction
In more detail , The architecture is as follows

We give the process in sequence :
- Input picture , Shape is ( C 0 , H 0 , W 0 ) (C_0, H_0,W_0) (C0,H0,W0), among C 0 = 3 C_0 = 3 C0=3 Represents the number of channels
- CNN After feature extraction , obtain ( C , H , W ) (C,H,W) (C,H,W) The tensor of shape , among C = 2048 , H = H 0 32 , W = W 0 32 C=2048, H=\frac{H_0}{32}, W=\frac{W_0}{32} C=2048,H=32H0,W=32W0
- utilize 1x1 Convolution of , Reduce the size of the feature , obtain ( d , H , W ) (d, H, W) (d,H,W) Tensor , among d < < C d<< C d<<C
- Compress the tensor (squeeze), The shape becomes ( d , H W ) (d, HW) (d,HW)
- Got it d d d Vector sequence , Enter as a sequence into Encoder In
- Decoder Get the output vector sequence , adopt FFN(Feed Forward Network) Get the bounding box prediction and category prediction , among FFN It's simple 3 Layer perceptron , The bounding box prediction includes the normalized center coordinates and width and height .
The effect of target detection

As shown in the figure above , You can see DETR It's not a lot of calculations , however FPS It's not high , It's just in order .
So semantic segmentation ?
Here is the general framework of semantic segmentation , As shown in the figure below :

be aware , What's depicted in the picture , Bounding box embedding (Box Embedding) In essence decoder Output ( stay FFN Before ).
And then use a multi head attention mechanism , This mechanism is essentially right Q,K,V Do many linear transformations , In this ,K and V yes Encoder The input of ,Q yes decoder Output .
among M It's the number of heads for multi head attention .
after , Through a simple CNN, Get one Mask matrix , Used to generate the result of semantic segmentation .
Semantic segmentation results analysis

We can see that compared with PanopticFPN++ Come on , The improvement of effect is limited , especially AP It's not good , General performance .
Conclusion
The article will Transformer It is applied to the field of object detection and semantic segmentation , Good results have been achieved , But the performance is better than FastRCNN Architecture like approach , There is no obvious improvement , But it shows that this sequence model has good scalability . Using one architecture to solve multiple problems , The goal of a unified model is just around the corner .
边栏推荐
- 【MobileNet V3】《Searching for MobileNetV3》
- Eklavya -- infer the parameters of functions in binary files using neural network
- EKLAVYA -- 利用神经网络推断二进制文件中函数的参数
- MoCO ——Momentum Contrast for Unsupervised Visual Representation Learning
- 联邦学习下的数据逆向攻击 -- GradInversion
- Several methods of image enhancement and matlab code
- One book 1078: sum of fractional sequences
- 業務架構圖
- open3d学习笔记四【表面重建】
- 【DIoU】《Distance-IoU Loss:Faster and Better Learning for Bounding Box Regression》
猜你喜欢

CVPR19-Deep Stacked Hierarchical Multi-patch Network for Image Deblurring论文复现
![[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video](/img/bc/c54f1f12867dc22592cadd5a43df60.png)
[CVPR‘22 Oral2] TAN: Temporal Alignment Networks for Long-term Video

How to turn on night mode on laptop

What if the laptop task manager is gray and unavailable

【MnasNet】《MnasNet:Platform-Aware Neural Architecture Search for Mobile》

【TCDCN】《Facial landmark detection by deep multi-task learning》

What if a new window always pops up when opening a folder on a laptop

How to clean up logs on notebook computers to improve the response speed of web pages

Using super ball embedding to enhance confrontation training

【FastDepth】《FastDepth:Fast Monocular Depth Estimation on Embedded Systems》
随机推荐
利用Transformer来进行目标检测和语义分割
[binocular vision] binocular stereo matching
Brief introduction of prompt paradigm
业务架构图
【Programming】
How to clean up logs on notebook computers to improve the response speed of web pages
Latex formula normal and italic
Deep learning classification Optimization Practice
JVM instructions
Mmdetection trains its own data set -- export coco format of cvat annotation file and related operations
【Sparse-to-Dense】《Sparse-to-Dense:Depth Prediction from Sparse Depth Samples and a Single Image》
Proof and understanding of pointnet principle
Embedding malware into neural networks
【Cutout】《Improved Regularization of Convolutional Neural Networks with Cutout》
CVPR19-Deep Stacked Hierarchical Multi-patch Network for Image Deblurring论文复现
Network metering - transport layer
Vscode下中文乱码问题
I'll show you why you don't need to log in every time you use Taobao, jd.com, etc?
Several methods of image enhancement and matlab code
联邦学习下的数据逆向攻击 -- GradInversion