当前位置:网站首页>AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer
AAAI2022-ShiftVIT: When Shift Operation Meets Vision Transformer
2022-06-11 04:54:00 【Shenlan Shenyan AI】

The paper :【AAAI2022】When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism
Code :https://link.zhihu.com/?target=https%3A//github.com/microsoft/SPACH
B Station author explains video :https://www.bilibili.com/video/BV1a3411h7su
Research motivation
This job is to use a very simple operation instead of attention, It has achieved very good results . First of all, I will introduce the motivation. The author thinks that Tranformer The key to success lies in two features :
Global: Fast global modeling capability , Every token Can be compared with other token There's a connection
Dynamic: Dynamically learn a set of weights for each sample
The author's motivation Namely : Can we replace... In a simpler way attention , More extreme is NO global, NO dynamics, and even NO parameter and NO arithmetic calculation .
So , The author puts forward shift block, It's simple , The essence is to perform a simple shift operation on some features to replace self-attention .
Methods to introduce
As shown in the figure below , The standard Transformer block First use attention Handle , Reuse FFN Handle . The author proposes to use shift block Instead of attention. This module is very simple , The input dimension is CHW Characteristics of , Along the C Take out a part in this direction , Then the average score is 4 Share , this 4 The features of the parts are respectively along Left 、 Right 、 On 、 Next For mobile , The characteristics of the rest remain unchanged .

In the author's implementation ,shift The step size of is set to 1 Pixel , meanwhile , choice 1/3 Channel of shift (1/12 The channel of moves to the left 1 Pixel ,1/12 The channel of moves to the right 1 Pixel ,1/12 Channel up for 1 Pixel ,1/12 The channel of moves down 1 Pixel ). The pytroch The code is as follows , You can see it , The calculation of this module is very simple , There are basically no parameters .

On the network architecture , The target of this method is swin transformer, except attention For modules shift block replace , The other parts are exactly the same .

The author first got ShiftVIT/light, The number of parameters is significantly reduced . In order to maintain and swin transformer Almost the same , The author in stage3 and stage4 Respectively added 6 And 1 A module , Reach and swin transformer A model with basically the same parameters Shift-T, As shown in the following table .

experimental result
The following table only lists ImageNet Experimental results on image classification , It can be seen that , Direct replacement performance degrades , But adding modules Shift-T Model performance has improved , however S Models and B The performance of the model will decrease slightly . The author also did target detection 、 Experiments on semantic segmentation , Come to the conclusion that , The performance and swin It's almost the same , But when the model is small ,ShiftVIT There will be more advantages .

The authors of ablation experiments have also analyzed many , This is just an introduction shift block An experiment with only one parameter , That's it shifted channel The proportion of , You can see , When the proportion is too small , The performance will be inferior to swin-T. When set to 1/3 when , Performance is the best .

The author also conducted an interesting experiment called training scheme, Analysis of the Transformer There may also be some reasons for the performance breakthrough trick . Use Adam Replace SGD, use GELU Replace ReLU, use LN Replace BN, And add epoch The number of , Will improve performance . This also shows that , These factors may also be VIT The key to success .

summary
The author summarizes two inspirations :1)self-attention Maybe not VIT The key to success , Use simple channels shift The operation can also surpass the small model swin transformer.2)VIT Training strategies (Adam、GELU、LN etc. ) It's the key to performance improvement .
author : peak OUC
| About Deep extension technology |

Shenyan technology was founded in 2018 year 1 month , Zhongguancun High tech enterprise , It is an enterprise with the world's leading artificial intelligence technology AI Service experts . In computer vision 、 Based on the core technology of natural language processing and data mining , The company launched four platform products —— Deep extension intelligent data annotation platform 、 Deep extension AI Development platform 、 Deep extension automatic machine learning platform 、 Deep extension AI Open platform , Provide data processing for enterprises 、 Model building and training 、 Privacy computing 、 One stop shop for Industry algorithms and solutions AI Platform services .
边栏推荐
- Lr-link Lianrui fully understands the server network card
- Problems in compiling core source cw32f030c8t6 with keil5
- Decision tree (hunt, ID3, C4.5, cart)
- USB to 232 to TTL overview
- [Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes
- Detailed explanation of network security bypass network card
- Let me tell you how to choose a 10 Gigabit network card
- Best practices and principles of lean product development system
- Leetcode question brushing series - mode 2 (datastructure linked list) - 206:reverse linked list
- Writing a good research title: Tips & Things to avoid
猜你喜欢

codesys 獲取系統時間

选择数字资产托管人时,要问的 6 个问题
![[Transformer]On the Integration of Self-Attention and Convolution](/img/64/59f611533ebb0cc130d08c596a8ab2.jpg)
[Transformer]On the Integration of Self-Attention and Convolution

如何快速寻找STM32系列单片机官方例程

Paper reproduction: expressive body capture

Google Code Coverage best practices

New library goes online | cnopendata immovable cultural relic data

华为设备配置MCE

oh my zsh正确安装姿势

Free data | new library online | cnopendata data data of national heritage stores and auction enterprises
随机推荐
Parametric contractual learning: comparative learning in long tail problems
Visual (Single) Object Tracking -- SiamRPN
Description of construction scheme of Meizhou P2 Laboratory
Let me tell you how to choose a 10 Gigabit network card
Cartographer learning record: cartographer Map 3D visualization configuration (self recording dataset version)
The central rural work conference has released important signals. Ten ways for AI technology to help agriculture can be expected in the future
Powerful new UI installation force artifact wechat applet source code + multiple templates support multiple traffic main modes
Paper reproduction: expressive body capture
How to choose a suitable optical network card?
Retinanet+keras train their own data set to tread on the pit
Possible errors during alphapose installation test
Leetcode question brushing series - mode 2 (datastructure linked list) - 206:reverse linked list
Leetcode question brushing series - mode 2 (datastructure linked list) - 160:intersection of two linked list
Codesys get System Time
CONDA switching
MySQL regularly deletes expired data.
Lianrui: how to rationally see the independent R & D of domestic CPU and the development of domestic hardware
Overview of construction knowledge of Fuzhou mask clean workshop
C language test question 3 (program multiple choice question - including detailed explanation of knowledge points)
Huawei equipment is configured to access the virtual private network through GRE tunnel