当前位置：网站首页>【CVPR2022】Lite Vision Transformer with Enhanced Self-Attention

【CVPR2022】Lite Vision Transformer with Enhanced Self-Attention

2022-07-28 05:00:00 【AI frontier theory group @ouc】

Please add a picture description

The paper ：https://readpaper.com/paper/633541619879256064
Code ：https://github.com/Chenglin-Yang/LVT

1、 Research motivation

Even though ViT The model is effective in various visual tasks , But at present, lightweight ViT The effect of the model in local areas is not ideal , The author thinks that ： Self attention mechanism has limitations in shallow Networks （Self-attention mechanism is limited in shallower and thinner networks）. So , The author puts forward a kind of light yet effective vision transformer It can be applied to mobile devices （Lite Vision Transformer, LVT）, With standard four-stage structure , But and MobileNetV2 and PVTv2-B0 Contains the same parameter quantity . The author mainly puts forward two new attention modular ：Convolutional Self-Attention (CSA) and Recursive Atrous Self-Attention (RASA) . Here are the introduction CSA Module and RASA modular .

Please add a picture description

2、Convolutional Self-Attention (CSA)

The process is shown in the figure above , The basic process is ：

Calculation similarity（ In the code attn）： take (hw/4, c) The matrix of passes through 1x1 Convolution becomes (hw/4, k^2, k^2).
Calculation V： Generate a (hw/4, c, k^2) Matrix , then reshape adopt 1x1 The convolution of changes the number of channels （ The picture shows BMM）, obtain (hw/4, k^2, c_out) Matrix .
Matrix multiplication ,similarity and v Multiply , obtain (hw/4, k^2, c_out)
Use fold Transform to get output

In terms of code ,CSA The code ratio of VOLO More complicated , But it doesn't seem to be different in essence （ Maybe my understanding is not in place ）. and , I feel CSA There's no VOLO concise . Interested can refer to 《VOLO: Vision Outlooker for Visual Recognition》 This paper and online code .

3、Recursive Atrous Self-Attention (RASA)

First introduced ASA, With the ordinary attention The difference in calculation is ： The author is calculating Q when , Multiscale void convolution is used . Convolution weight sharing , Reduced parameters .

meanwhile , The author used recursive operation . Every block in ,ASA Iterate twice .

4、 experimental analysis

The Internet uses 4 Phase structure . The first stage uses CSA, Other stages use RASA.

stay ImageNet The experimental results show that , The number of Dangshen is the same as MobileNetV2 and PVTv2-B0 Quite a time , The accuracy of this method is significantly higher . meanwhile , Increase to and ResNet50 When the parameter quantity is close , The performance of this method significantly exceeds that of the current method .