当前位置：网站首页>8 kinds of visual transformer finishing (Part 2)

8 kinds of visual transformer finishing (Part 2)

2022-07-27 08:59:00 【byzy】

One 、Focal Transformer

Link to the original text ：https://arxiv.org/pdf/2107.00641.pdf

Network structure

First, divide the picture into $4\times 4$ Of patch. Then enter Patch Embedding layer （ Convolution kernel and step size are 4 The convolution of layer ）, Input to Focal Transformer layer . At every stage in , Halve the size of the feature , The channel dimension is doubled .

Focal Self attention （FSA）

Conventional SA Because of all the token All pay fine-grained attention , So it takes a lot of time ; What this article puts forward FSA Yes, close to the present token Information for more granular attention , Yes, stay away from the present token Information for coarse-grained attention .

Divided into several level, Every level There are two parameters , Sub window size s_w^l And horizontal and vertical quantity s_r^l （ by level Serial number ）.

Child window pooling ： For each level Of feature map, Each sub window gets a value through linear layer pooling . Then lengthen it into a vector , Different level Vector splicing of , Use linear layers to generate and ; Using the features of the original window to stretch through the linear layer to generate .

Attention Computing ： Offset for learnable relative positions （ And the following Swin Transformer Of similar ）

$\textup{Attention}(Q_i,K_i,V_i)=\textup{Softmax}(\frac{Q_iK_i^T}{\sqrt d}+B)V_i$

Two 、Swin Transformer

Link to the original text ：Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | IEEE Conference Publication | IEEE Xplore

Network structure

Divide the image into $4\times4$ Of patch, adopt linear embedding Change the feature dimension to , Send in Swin-T block . After that, everyone stage At first there was a patch merging, take $2\times2$ Of patch Merge into 1 individual , Then multiply the characteristic dimension by 2.

Swin-T block

among MLP by 2 layer , The activation function is GELU

Divide the input picture into non coincident windows , Each window contains $M\times M$ Of patch, Calculate self attention inside each window . But because the relationship between windows is not considered , So introduce shifted window.

For two consecutive Swin-T block , The second one uses shifted window：

$\begin{aligned} \hat{z}^l&=\textup{W-MSA}(\textup{LN}(z^{l-1}))+z^{l-1}\\ z^l&=\textup{MLP}(\textup{LN}(\hat{z}^l))+\hat{z}^l\\ \hat{z}^{l+1}&=\textup{SW-MSA}(\textup{LN}(z^l))+z^l\\ z^{l+1}&=\textup{MLP}(\textup{LN}(\hat{z}^{l+1}))+\hat{z}^{l+1} \end{aligned}$

$\textup{Attention}(Q,K,V)=\textup{Softmax}(QK^T/\sqrt d+B)V$

-- Relative position offset , The article does not introduce the calculation method , The details may depend on the code （ May refer to The illustration swin transformer - Tencent cloud developer community - Tencent cloud Explanation ）.

3、 ... and 、ResT

Link to the original text ：https://arxiv.org/pdf/2105.13677.pdf

Network structure

stage form ：patch embedding modular + Location code + $L\times$ efficient Transformer block .efficient Transformer The multi headed self attention in the block is called EMSA.

EMSA

adopt depth-wise Convolution （ Nuclear size , Step size and padding Respectively s+1 ,, s/2 , here s=8/k , by head Number ）, Then generate through linear layer and . Calculate according to the following formula ：

$\textup{EMSA}(Q,K,V)=\textup{IN}(\textup{Softmax}(\textup{Conv}(\frac{QK^T}{\sqrt{d_k}})))V$

here Conv by 1*1 Convolution ,IN by Instance Normalization.

Finally, splice all head Output , Through the linear layer .

The rest is about routine Transformer Agreement , namely

$y={x}'+\textup{FFN}(\textup{LN}({x}')),{x}'=x+\textup{EMSA(\textup{LN}(x))}$

$\textup{FFN}(x)=\sigma(xW_1+b_1)W_2+b_2$

Stem

Use 3 individual $3\times3$ Convolution layer （ The steps are 2,1,2,padding by 1, The middle contains BN and ReLU） Reduce the size to 1/4.

Patch embedding

Reduce input token And increase the number of channels . Use $3\times3$ Convolution （ step 2,padding 1） Reduce the size 1 And a half , Increased number of channels 1 times .

Location code

Use pixel-wise attention To encode the position , That is to use $3\times3$ Of depth-wise Convolution , Re pass sigmoid function . As shown in the figure below .

$\hat{x}=\textup{PA}(x)=x\ast\sigma(\textup{DWConv}(x))$

Four 、VOLO

Link to the original text ：https://arxiv.org/pdf/2106.13112.pdf

First, divide the image into $8\times8$ Of patch.

Divided into two stage, first stage from outlooker form , Generate fine-grained features , the second stage use Transformer Aggregate global information . Every stage At the beginning, there was a patch embedding Generate token.

Outlooker

outlook attention layer（ Space ）+ MLP（ passageway ）

$\begin{aligned} \tilde{X}&=\textup{OutlookAtt}(\textup{LN}(X))+X\\ Z&=\textup{MLP}(\textup{LN}(\tilde{X}))+\tilde{X} \end{aligned}$

Outlook Attention

In the figure Unfold by V Take the neighborhood and straighten it along the space as a vector ,Fold Add the same position for different windows

Outlook attention Calculate each spatial location (i,j) And its $K\times K$ The similarity of points in the neighborhood .

Use two linear layers to get and , And then put do reshape.

namely ： A given input , For each dimension token, Use two linear layers （ The weights are $W_A\in\mathbb{R}^{C\times R^4}$ and $W_V\in \mathbb{R}^{C\times C}$ ） obtain $A\in\mathbb{R}^{H\times W\times K^4}$ and $V\in\mathbb{R}^{H\times W\times C}$ . Make $V_{\Delta_{i,j}}\in\mathbb{R}^{C\times K^2}$ Indicates that the center is All in the window of value（Unfold operation ）.
take in Take out the position vector ,reshape by $\hat{A}_{i,j}\in\mathbb{R}^{K^2\times K^2}$ . The output
$Y_{\Delta_{i,j}}=\textup{MatMul}(\textup{Softmax}(\hat{A}_{i,j}),V_{\Delta_{i,j}})$

Then add up the output at the same position in different windows （Fold operation ）.

$\tilde{Y}_{i,j}=\sum_{0\leq m,n\leq K}Y^{i,j}_{\Delta_{i+m- \lfloor \frac{K}{2} \rfloor, j+n-\lfloor \frac{K}{2} \rfloor}}$

Finally, the output passes through the linear layer .

long position Outlook Attention

Equipped with Head . take The shape of grows The average score after doubling is Share （ Is this time $W_A\in\mathbb{R}^{C\times N\cdot R^4}$ , Got Divided into Share , $A_n\in\mathbb{R}^{H\times W\times K^4}$ ）. Divide evenly according to dimensions Share （ $V_n\in\mathbb{R}^{H\times W\times C_N}$ And $C_N\times N=C$ ）. Finally, for each pair A_n , V_n separately Outlook Attention Post stitching .