当前位置：网站首页>【ARXIV2203】SepViT: Separable Vision Transformer

【ARXIV2203】SepViT: Separable Vision Transformer

2022-07-28 05:00:00 【AI frontier theory group @ouc】

Please add a picture description

1、Motivation

The author points out that current vision Transformer The pain point in the model is ：huge resource demands. To solve this problem , The author puts forward Separable Vision Transformer (SepViT), The overall structure is shown in the figure below .

Please add a picture description

Including the following contributions ：

Depthwise separable self-attention. It can achieve local information communication within the windows and global informaiton exchange among the windows in a single Transformer block.
Window token embedding. Helps to model the attention relationship among windows with negligible computational cost.

2、Depthwise separable self-attention

and MobileNet Proposed Deep separable convolution is very similar , Include Depthwise Self-Attention (DWA) and Pointwise Self-Attention (PWA) Two steps . One is layer by layer calculation attention, One is point by point calculation attention.

DWA As shown in the figure below , It can be seen that attention It is calculated in each layer , It's simple . however , If calculated per pixel , The computational complexity will be too high . therefore , The author used window token embedding. As shown in the picture , The input characteristics are 6x6xC, Split into 2x2=4 individual window, First, build. windows token The size is 4xCx1. four windows The size is 4xCx9. Splice the two features into 4xCx10, And then in four window Calculate attention separately in , The final result size is 4xCx10 （ Includes new winodw The characteristics and window token）.

PWA The calculation of is also very interesting , Put the new window token Take it out for similarity calculation , obtain 4x4 The weight matrix of , Using this weight matrix, four window Weighted by the characteristics of , Finally, the output characteristics .

3、Grouped Self-Attention

The author uses group convolution to separate the depth Self-Attention It has been extended , A grouping method is proposed Self-Attention. As shown in the figure below , Put the adjacent sub Window Splicing , Form bigger Window, It's similar to going to Window Divide into groups , In a group Window In depth Self-Attention signal communication . In this way ,Grouped Self-Attention Can capture multiple Window Long term visual dependence . In terms of calculating cost and performance gain ,Grouped Self-Attention Specific depth separable Self-Attention With a certain additional cost , But it also has better performance .

Please add a picture description