当前位置：网站首页>【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer

【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer

2022-07-29 06:03:00 【Dull cat】

List of articles

One 、 background

Transformer He has achieved brilliant performance in many tasks , In computer vision , Generally, the input image is divided into multiple patch, And then calculate patch Between self attention to achieve downstream tasks .

However, the amount of calculation of self attention mechanism is square with the size of the input image , therefore , Use on edge devices Transformer Has become a problem .

The author believes that different input images are important for the network , The difficulty of prediction is different . Like a car and a person in a clean background , Then it's easy to identify . If there are many different animals in a complex background , Then it is more difficult to identify .

Based on this , The author realizes a network structure , According to the difficulty of input , To dynamically adjust token To control the number of transformer The computational complexity of .

Insert picture description here

Two 、 Method

Insert picture description here

vision transformer The process is as follows ：
Insert picture description here

$\epsilon(.)$ ： encoding network, Encode the input image into positioned token
$C (.)$ ：class token Post processing of
$L$ ：transformer block
$F (.)$ ：self-attention

To kill dynamically tokens, The author wrote for each token Introduced a input-dependent halting score：
Insert picture description here

$H (.)$ yes halting module
$k$ yes token Indexes , $l$ Is a layer

Insert picture description here

$t_{k,e}^l$ yes $t_k^l$ Of the $e$ dimension
$\sigma$ yes logistic sigmoid function
$\beta$ and $\gamma$ Is the translation and scaling factor used before nonlinear operation

For the sake of layer To track halting probabilities, Every token Accounting is a supplementary parameter ：

Insert picture description here

halting probabilities as follows ：
Insert picture description here

ponder loss ： Every token Of ponder loss Will average .

Insert picture description here

The loss of classification task is ：
Insert picture description here

halting score distribution Distribution is ：
Insert picture description here
So use KL Divergence is used to measure the distribution deviation between real and predicted ：

Then the total loss is ：

Insert picture description here

3、 ... and 、 effect

Insert picture description here
From the picture 3 It can be seen that , adaptive choice token It can produce strong response to areas with high prominence and great changes , Usually related to category .

1、Token Color depth distribution ：

Draw... In the diagram token The color of the , Pictured 4 Shown , In fact, it is an image centered 2D Gaussian like distribution , This also shows that ImageNet Most of the samples are in the middle . A lot of computation comes from the middle area , Few edges participate in the calculation .
Insert picture description here

2、Halting score distribution：

Pictured 5 Draw every... Of every image layer Of halting score.

Random sampling 5k Verification set , In the first few layer,halting score With layer Deepen and increase , Slowly decrease in the back .

Insert picture description here
3、 Difficult samples and simple samples

chart 6 It shows the difficult and simple examples and the amount of calculation required by them .

Simple examples can be classified correctly ,AdaViT It is also faster than difficult cases .

Insert picture description here

4、 Category sensitivity

Samples that were initially very sure or very unsure were adaptive The impact is very small ,adaptive Reasoning can promote the categories with obvious shapes , Such as independent furniture or animals .

Insert picture description here