当前位置:网站首页>【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
2022-07-29 06:03:00 【Dull cat】
List of articles

One 、 background
Transformer It has achieved good results in many fields , But the amount of calculation increases with patches An increase in quantity 、self-attention head increase in numbers 、transformer block The increase in quantity will greatly increase .
But the author raises two questions :
Is it all patches All need to pass through the whole network , In order to get better classification results ?
Is it all self-attention Many heads are needed to find the potential connections in the whole diagram ?
The author thinks that , Only the background is complex 、 Complex and difficult cases such as severe occlusion need more patch and self-attention block, Simple samples require only a small amount patch and self-attention block You can achieve a good enough effect .
Based on this , The author realizes a framework of dynamic computation , What to learn patch or Which? self-attention heads/blocks Need to keep . therefore , The network will reduce the number of simple samples patch and self-attention layer , Hard samples use all network layers .
Proposed by the author Adaptive Vision Transformer (AdaViT) It's an end-to-end structure , Be able to judge dynamically transformer In structure , Which? patch、self-attention block、self-attention heads Need to keep .
AdaViT Can improve training speed 2x, To reduce the 0.8% The classification accuracy of , It's a way to balance effect and speed .

Two 、 Method

1、Decision Network
The author gives each transformer block A lightweight multi head sub network is inserted in , That is to say decision network, The network can learn a binary result , To decide yes patch embedding、self-attention heads、blocks Whether to use .
The first l l l individual block Of decision network Yes 3 A linear layer , Parameter is W l = { W l p , W l h , W l b } W_l=\{W_l^p, W_l^h, W_l^b\} Wl={ Wlp,Wlh,Wlb}, Predict separately patch、attention head、transformer block Whether to keep .
therefore , about block Z l Z_l Zl, The accounting calculation is as follows :
- N and H Respectively transformer block Medium patch Quantity and sum self-attention head The number of , The three we got m l m_l ml Pass by sigmoid function , Express patch、attention head、transformer block The probability of being retained .
because decision It needs to be binary , So keep / Discard in infer Use threshold to judge .
But because the optimal threshold of different samples is different , So the author defines random variables M l p , M l h , M l b M_l^p, M_l^h, M_l^b Mlp,Mlh,Mlb By getting from m l p , m l h , m l b m_l^p, m_l^h, m_l^b mlp,mlh,mlb Medium sampling to judge , That is, if M l , j p = 1 M_{l,j}^p=1 Ml,jp=1, Then keep the l l l individual block No j j j individual patch embedding, If M l , j p = 0 M_{l,j}^p=0 Ml,jp=0 Give up . also , Author use Gumbel-Softmax trick [25] To ensure diversity in training .
2、Patch Selection
Transformer block In the input of , The author wants to keep those informative patch embedding.
For the first l l l individual block, If M i p = 0 M_i^p=0 Mip=0, Then discard the patch:
- z l , c l s z_{l,cls} zl,cls Will be preserved , Because this is for classification
3、Head Selection
Different heads in the multi head attention mechanism will focus on different areas , Explore more potential information .
In order to improve the speed of reasoning , Will adapt some head Give up , In order to suppress some heads , That is to say deactivation, The author explores two methods :
1、 partial deactivation
The first l l l individual block Of the i i i individual head Of attention The calculation is as follows :

2、full deactivation
The overall activation inhibition is as follows , be-all head All removed ,MSA The output coding size of is reduced as follows :

4、Block Selection
Skip unnecessary transformer block It can also reduce a lot of computation , In order to improve the flexibility of skipping , The author makes transformer block Medium MSA and FFN You can skip , Not tied together .

3、 ... and 、 effect






边栏推荐
- Most PHP programmers don't understand how to deploy safe code
- 赓续新征程,共驭智存储
- Spring, summer, autumn and winter with Miss Zhang (1)
- Synchronous development with open source projects & codereview & pull request & Fork how to pull the original warehouse
- Lock lock of concurrent programming learning notes and its implementation basic usage of reentrantlock, reentrantreadwritelock and stampedlock
- 【目标检测】Generalized Focal Loss V1
- Detailed explanation of tool classes countdownlatch and cyclicbarrier of concurrent programming learning notes
- 与张小姐的春夏秋冬(4)
- Flutter正在被悄悄放弃?浅析Flutter的未来
- Markdown syntax
猜你喜欢

SSM integration

Anr Optimization: cause oom crash and corresponding solutions
![[clustmaps] visitor statistics](/img/1a/173664a633fd14ea56696dd824acb6.png)
[clustmaps] visitor statistics

Performance comparison | FASS iSCSI vs nvme/tcp

Centos7 silently installs Oracle

【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

Huawei 2020 school recruitment written test programming questions read this article is enough (Part 1)
![[overview] image classification network](/img/2b/7e3ba36a4d7e95cb262eebaadee2f3.png)
[overview] image classification network

Ribbon learning notes II

Flink, the mainstream real-time stream processing computing framework, is the first experience.
随机推荐
How to PR an open source composer project
Training log 7 of the project "construction of Shandong University mobile Internet development technology teaching website"
【Transformer】ATS: Adaptive Token Sampling For Efficient Vision Transformers
[pycharm] pycharm remote connection server
Breaking through the hardware bottleneck (I): the development of Intel Architecture and bottleneck mining
[clustmaps] visitor statistics
MySql统计函数COUNT详解
【go】defer的使用
Markdown语法
Detailed explanation of tool classes countdownlatch and cyclicbarrier of concurrent programming learning notes
【ML】机器学习模型之PMML--概述
Markdown syntax
【Attention】Visual Attention Network
Use of file upload (2) -- upload to Alibaba cloud OSS file server
Rsync+inotyfy realize real-time synchronization of single data monitoring
The difference between asyncawait and promise
day02作业之进程管理
FFmpeg创作GIF表情包教程来了!赶紧说声多谢乌蝇哥?
【DL】关于tensor(张量)的介绍和理解
rsync+inotyfy实现数据单项监控实时同步