当前位置：网站首页>AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer

AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer

2022-06-11 04:54:00 【Shenlan Shenyan AI】

The paper ：【AAAI2022】When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
Code ：https://link.zhihu.com/?target=https%3A//github.com/microsoft/SPACH

B Station author explains video ：https://www.bilibili.com/video/BV1a3411h7su

Research motivation

This job is to use a very simple operation instead of attention, It has achieved very good results . First of all, I will introduce the motivation. The author thinks that Tranformer The key to success lies in two features ：

Global： Fast global modeling capability , Every token Can be compared with other token There's a connection
Dynamic： Dynamically learn a set of weights for each sample

The author's motivation Namely ： Can we replace... In a simpler way attention , More extreme is NO global, NO dynamics, and even NO parameter and NO arithmetic calculation .

So , The author puts forward shift block, It's simple , The essence is to perform a simple shift operation on some features to replace self-attention .

Methods to introduce

As shown in the figure below , The standard Transformer block First use attention Handle , Reuse FFN Handle . The author proposes to use shift block Instead of attention. This module is very simple , The input dimension is CHW Characteristics of , Along the C Take out a part in this direction , Then the average score is 4 Share , this 4 The features of the parts are respectively along Left 、 Right 、 On 、 Next For mobile , The characteristics of the rest remain unchanged .

In the author's implementation ,shift The step size of is set to 1 Pixel , meanwhile , choice 1/3 Channel of shift （1/12 The channel of moves to the left 1 Pixel ,1/12 The channel of moves to the right 1 Pixel ,1/12 Channel up for 1 Pixel ,1/12 The channel of moves down 1 Pixel ）. The pytroch The code is as follows , You can see it , The calculation of this module is very simple , There are basically no parameters .

On the network architecture , The target of this method is swin transformer, except attention For modules shift block replace , The other parts are exactly the same .

The author first got ShiftVIT/light, The number of parameters is significantly reduced . In order to maintain and swin transformer Almost the same , The author in stage3 and stage4 Respectively added 6 And 1 A module , Reach and swin transformer A model with basically the same parameters Shift-T, As shown in the following table .

experimental result

The following table only lists ImageNet Experimental results on image classification , It can be seen that , Direct replacement performance degrades , But adding modules Shift-T Model performance has improved , however S Models and B The performance of the model will decrease slightly . The author also did target detection 、 Experiments on semantic segmentation , Come to the conclusion that , The performance and swin It's almost the same , But when the model is small ,ShiftVIT There will be more advantages .

The authors of ablation experiments have also analyzed many , This is just an introduction shift block An experiment with only one parameter , That's it shifted channel The proportion of , You can see , When the proportion is too small , The performance will be inferior to swin-T. When set to 1/3 when , Performance is the best .

The author also conducted an interesting experiment called training scheme, Analysis of the Transformer There may also be some reasons for the performance breakthrough trick . Use Adam Replace SGD, use GELU Replace ReLU, use LN Replace BN, And add epoch The number of , Will improve performance . This also shows that , These factors may also be VIT The key to success .

summary

The author summarizes two inspirations ：1）self-attention Maybe not VIT The key to success , Use simple channels shift The operation can also surpass the small model swin transformer.2）VIT Training strategies （Adam、GELU、LN etc. ） It's the key to performance improvement .

author ： peak OUC

｜ About Deep extension technology ｜

Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services .

原网站

版权声明
本文为[Shenlan Shenyan AI]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203020544261329.html

当前位置：网站首页>AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer

AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer

边栏推荐

猜你喜欢

随机推荐