当前位置：网站首页>【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition

2022-07-29 06:03:00 【Dull cat】

List of articles

One 、 background

Transformer It has achieved good results in many fields , But the amount of calculation increases with patches An increase in quantity 、self-attention head increase in numbers 、transformer block The increase in quantity will greatly increase .

But the author raises two questions ：

Is it all patches All need to pass through the whole network , In order to get better classification results ？
Is it all self-attention Many heads are needed to find the potential connections in the whole diagram ？

The author thinks that , Only the background is complex 、 Complex and difficult cases such as severe occlusion need more patch and self-attention block, Simple samples require only a small amount patch and self-attention block You can achieve a good enough effect .

Based on this , The author realizes a framework of dynamic computation , What to learn patch or Which? self-attention heads/blocks Need to keep . therefore , The network will reduce the number of simple samples patch and self-attention layer , Hard samples use all network layers .

Proposed by the author Adaptive Vision Transformer (AdaViT) It's an end-to-end structure , Be able to judge dynamically transformer In structure , Which? patch、self-attention block、self-attention heads Need to keep .

AdaViT Can improve training speed 2x, To reduce the 0.8% The classification accuracy of , It's a way to balance effect and speed .

Insert picture description here

Two 、 Method

Insert picture description here
1、Decision Network

The author gives each transformer block A lightweight multi head sub network is inserted in , That is to say decision network, The network can learn a binary result , To decide yes patch embedding、self-attention heads、blocks Whether to use .

The first $l$ individual block Of decision network Yes 3 A linear layer , Parameter is $W_l=\{W_l^p, W_l^h, W_l^b\}$ , Predict separately patch、attention head、transformer block Whether to keep .

therefore , about block $Z_l$ , The accounting calculation is as follows ：
Insert picture description here

N and H Respectively transformer block Medium patch Quantity and sum self-attention head The number of , The three we got $m_l$ Pass by sigmoid function , Express patch、attention head、transformer block The probability of being retained .

because decision It needs to be binary , So keep / Discard in infer Use threshold to judge .

But because the optimal threshold of different samples is different , So the author defines random variables $M_l^p, M_l^h, M_l^b$ By getting from $m_l^p, m_l^h, m_l^b$ Medium sampling to judge , That is, if $M_{l,j}^p=1$ , Then keep the $l$ individual block No $j$ individual patch embedding, If $M_{l,j}^p=0$ Give up . also , Author use Gumbel-Softmax trick [25] To ensure diversity in training .

2、Patch Selection

Transformer block In the input of , The author wants to keep those informative patch embedding.

For the first $l$ individual block, If $M_i^p=0$ , Then discard the patch：
Insert picture description here

$z_{l,cls}$ Will be preserved , Because this is for classification

3、Head Selection

Different heads in the multi head attention mechanism will focus on different areas , Explore more potential information .

In order to improve the speed of reasoning , Will adapt some head Give up , In order to suppress some heads , That is to say deactivation, The author explores two methods ：

1、 partial deactivation

The first $l$ individual block Of the $i$ individual head Of attention The calculation is as follows ：

Insert picture description here

2、full deactivation

The overall activation inhibition is as follows , be-all head All removed ,MSA The output coding size of is reduced as follows ：

Insert picture description here

4、Block Selection

Skip unnecessary transformer block It can also reduce a lot of computation , In order to improve the flexibility of skipping , The author makes transformer block Medium MSA and FFN You can skip , Not tied together .

Insert picture description here