当前位置:网站首页>8 kinds of visual transformer finishing (Part 2)
8 kinds of visual transformer finishing (Part 2)
2022-07-27 08:59:00 【byzy】
One 、Focal Transformer
Link to the original text :https://arxiv.org/pdf/2107.00641.pdf
Network structure

First, divide the picture into
Of patch. Then enter Patch Embedding layer ( Convolution kernel and step size are 4 The convolution of layer ), Input to Focal Transformer layer . At every stage in , Halve the size of the feature , The channel dimension is doubled .
Focal Self attention (FSA)

Conventional SA Because of all the token All pay fine-grained attention , So it takes a lot of time ; What this article puts forward FSA Yes, close to the present token Information for more granular attention , Yes, stay away from the present token Information for coarse-grained attention .
Divided into several level, Every level There are two parameters , Sub window size And horizontal and vertical quantity
(
by level Serial number ).
Child window pooling : For each level Of feature map, Each sub window gets a value through linear layer pooling . Then lengthen it into a vector , Different level Vector splicing of , Use linear layers to generate
and
; Using the features of the original window to stretch through the linear layer to generate
.
Attention Computing :
Offset for learnable relative positions ( And the following Swin Transformer Of
similar )

Two 、Swin Transformer
Link to the original text :Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | IEEE Conference Publication | IEEE Xplore
Network structure

Divide the image into
Of patch, adopt linear embedding Change the feature dimension to
, Send in Swin-T block . After that, everyone stage At first there was a patch merging, take
Of patch Merge into 1 individual , Then multiply the characteristic dimension by 2.
Swin-T block

Divide the input picture into non coincident windows , Each window contains
Of patch, Calculate self attention inside each window . But because the relationship between windows is not considered , So introduce shifted window.


For two consecutive Swin-T block , The second one uses shifted window:


-- Relative position offset , The article does not introduce the calculation method , The details may depend on the code ( May refer to The illustration swin transformer - Tencent cloud developer community - Tencent cloud Explanation ).
3、 ... and 、ResT
Link to the original text :https://arxiv.org/pdf/2105.13677.pdf
Network structure

stage form :patch embedding modular + Location code +
efficient Transformer block .efficient Transformer The multi headed self attention in the block is called EMSA.
EMSA

adopt depth-wise Convolution ( Nuclear size , Step size and padding Respectively
,
,
, here
,
by head Number ), Then generate through linear layer
and
. Calculate according to the following formula :

here Conv by 1*1 Convolution ,IN by Instance Normalization.
Finally, splice all head Output , Through the linear layer .
The rest is about routine Transformer Agreement , namely


Stem
Use 3 individual
Convolution layer ( The steps are 2,1,2,padding by 1, The middle contains BN and ReLU) Reduce the size to 1/4.
Patch embedding
Reduce input token And increase the number of channels . Use
Convolution ( step 2,padding 1) Reduce the size 1 And a half , Increased number of channels 1 times .
Location code
Use pixel-wise attention To encode the position , That is to use
Of depth-wise Convolution , Re pass sigmoid function . As shown in the figure below .


Four 、VOLO
Link to the original text :https://arxiv.org/pdf/2106.13112.pdf
First, divide the image into
Of patch.
Divided into two stage, first stage from outlooker form , Generate fine-grained features , the second stage use Transformer Aggregate global information . Every stage At the beginning, there was a patch embedding Generate token.
Outlooker
outlook attention layer( Space )+ MLP( passageway )

Outlook Attention

Outlook attention Calculate each spatial location
And its
The similarity of points in the neighborhood .
Use two linear layers to get
and
, And then put
do reshape.
namely : A given input
, For each
dimension token, Use two linear layers ( The weights are
and
) obtain
and
. Make
Indicates that the center is
All in the window of value(Unfold operation ).
take
in
Take out the position vector ,reshape by
. The output
Then add up the output at the same position in different windows (Fold operation ).

Finally, the output passes through the linear layer .
long position Outlook Attention
Equipped with
Head . take
The shape of grows
The average score after doubling is
Share ( Is this time
, Got
Divided into
Share ,
).
Divide evenly according to dimensions
Share ( And
). Finally, for each pair
, separately Outlook Attention Post stitching .
边栏推荐
- CUDA programming-02: first knowledge of CUDA Programming
- NiO example
- 4278. Summit
- View 的滑动冲突
- Redis network IO
- CUDA programming-01: build CUDA Programming Environment
- The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?
- General Administration of Customs: the import of such products is suspended
- Aruba learning notes 10 security authentication portal authentication (web page configuration)
- Apple cut its price by 600 yuan, which was almost a devastating blow to the collapse of its domestic flagship mobile phone
猜你喜欢

How to optimize the deep learning model to improve the reasoning speed

08_ Service fusing hystrix

String type and bitmap of redis
![[interprocess communication IPC] - semaphore learning](/img/47/b76c329e748726097219abce28fce8.png)
[interprocess communication IPC] - semaphore learning

Pyqt5 rapid development and practice 4.1 qmainwindow

4276. 擅长C

Rewrite the tensorrt version deployment code of yolox

NiO example

500报错

What are the differences or similarities between "demand fulfillment to settlement" and "purchase to payment"?
随机推荐
【Flutter -- GetX】准备篇
Matlab画图技巧与实例:堆叠图stackedplot
The wechat installation package has soared from 0.5m to 260m. Why are our programs getting bigger and bigger?
Ctfshow ultimate assessment
Understand various IOU loss functions in target detection
Horse walking oblique sun (backtracking method)
低成本、低门槛、易部署,4800+万户中小企业数字化转型新选择
691. 立方体IV
NIO this.selector.select()
Pass parameters and returned responses of flask
How to permanently set source
接口测试工具-Jmeter压力测试使用
NIO总结文——一篇读懂NIO整个流程
BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Trans
N queen problem (backtracking, permutation tree)
Tensorflow模型训练和评估的内置方法
【微服务~Sentinel】Sentinel之dashboard控制面板
Test picture
8 kinds of visual transformer finishing (Part 1)
Aruba学习笔记10-安全认证-Portal认证(web页面配置)
and
) obtain
and
. Make
. The output 