当前位置：网站首页>Using transformer for object detection and semantic segmentation

Using transformer for object detection and semantic segmentation

2022-07-02 07:59:00 【MezereonXP】

Introduce

This time it's about Facebook AI An article from “End-to-End Object Detection with Transformers”

Just recently Transformer It's also popular , Here's how to use Transformer For object detection and semantic segmentation .

About Transformer, You can refer to my article article .

Let me briefly introduce Transformer, This is a model architecture for sequence to sequence modeling , It is widely used in natural language translation and other fields .Transformer Abandon the previous modeling of sequence RNN Form of network architecture , The attention mechanism is introduced , Achieved a good sequence modeling and transformation capabilities .

General structure and process

DETR The architecture of
As shown in the figure above , It's mainly divided into two parts ：

Backbone： Mainly CNN, Used to extract advanced semantic features
Encoder-Decoder： Make use of advanced semantic features and give target prediction

In more detail , The architecture is as follows

Architectural details
We give the process in sequence ：

Input picture , Shape is $C_0, H_0,W_0)$ , among $C_0 = 3$ Represents the number of channels
CNN After feature extraction , obtain $(C, H, W)$ The tensor of shape , among $H=\frac{H_0}{32}, W=\frac{W_0}{32}$
utilize 1x1 Convolution of , Reduce the size of the feature , obtain $(d, H, W)$ Tensor , among $d < < C$
Compress the tensor （squeeze）, The shape becomes $(d, H W)$
Got it $d$ Vector sequence , Enter as a sequence into Encoder In
Decoder Get the output vector sequence , adopt FFN（Feed Forward Network） Get the bounding box prediction and category prediction , among FFN It's simple 3 Layer perceptron , The bounding box prediction includes the normalized center coordinates and width and height .

The effect of target detection

Target detection effect
As shown in the figure above , You can see DETR It's not a lot of calculations , however FPS It's not high , It's just in order .

So semantic segmentation ？

Here is the general framework of semantic segmentation , As shown in the figure below ：

Semantic segmentation Architecture
be aware , What's depicted in the picture , Bounding box embedding （Box Embedding） In essence decoder Output （ stay FFN Before ）.
And then use a multi head attention mechanism , This mechanism is essentially right Q,K,V Do many linear transformations , In this ,K and V yes Encoder The input of ,Q yes decoder Output .
among M It's the number of heads for multi head attention .

after , Through a simple CNN, Get one Mask matrix , Used to generate the result of semantic segmentation .

Semantic segmentation results analysis

Semantic segmentation results
We can see that compared with PanopticFPN++ Come on , The improvement of effect is limited , especially AP It's not good , General performance .

Conclusion

The article will Transformer It is applied to the field of object detection and semantic segmentation , Good results have been achieved , But the performance is better than FastRCNN Architecture like approach , There is no obvious improvement , But it shows that this sequence model has good scalability . Using one architecture to solve multiple problems , The goal of a unified model is just around the corner .

原网站

版权声明
本文为[MezereonXP]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020623040975.html