当前位置：网站首页>【Transformer】ATS: Adaptive Token Sampling For Efficient Vision Transformers

【Transformer】ATS: Adaptive Token Sampling For Efficient Vision Transformers

2022-07-29 08:19:00 【Dull cat】

List of articles

Insert picture description here

One 、 background

Despite the existing transformer The model has achieved good results in classification and other tasks , But the amount of calculation is still very high , It takes a lot of GFLOPs, Not applicable to many edge devices , although GFLOPs You can also reduce the network token Quantity to reduce ,DynamicViT Use the network to predict each token Score of , To determine which token It's redundant . Although this method can reduce the network GFLOPs, But the score prediction network will also introduce additional parameters , And if you want a different parameter reduction ratio, you need to train again .

Two 、 motivation

The author thinks that , For classification tasks , Not all the information in the diagram is needed to classify , Because the information of the image is redundant for the classification task . So this paper proposes a reduction token The method of quantity , It can be applied to any transformer, Not subject to the reduction ratio , And more efficient .

3、 ... and 、 Method

The author has put forward a new type called " Adaptive Token Sampler (ATS) " Module , It is a dynamic slave input token Choose the important token Module . Also a parameter-free Methods , The overall structure is shown in the figure 2 Shown . Convolution network , You usually use pooling To reduce the amount of computation ,stage Deeper , The smaller the resolution . but Transformer Such a method cannot be directly used in , because token It has nothing to do with spatial location , That is, changing the position will not affect the final result . And if you use down sampling, there will be two disadvantages , One is that the details of the target will be lost , Second, it may retain a lot of background information , It has no substantive effect on classification . Therefore, the author proposes a dynamic selection of each stage Of token The method of quantity .
Insert picture description here

ATS The process of ：

First , Yes N Inputs token Assign a score , Determine which ones are left based on the score
then , Set up K For reserved token The largest number , This K Will decide GFLOPs Upper limit
sampled tokens K’ General comparison K Small , And the relationship with the input image is shown in the figure 6 Shown

Insert picture description here

For each instance , chart 7 It shows that the author uses a few or most patches , You can get the correct classification , chart 3 Shows different each stage The use of token Number . The author also proposes a correct choice for each image token The method of quantity . Pictured 6 Shown , Different images are different stage Of token The number is different .

Insert picture description here

3.1 Token Scoring

In standard self-attention layer , Input Q、K、V All from input token from , And then you get attention matrix A：
Insert picture description here

because softmax The existence of ,A The sum of each line of is 1, Output token Hui He attention matrix effect , Thus weighted .

A Each line of contains the input token Of attention weights, This weights In fact, it means all token For output token The role of , because A The first line is cls token, Indicates the input token For output classification token The role of , So the author uses the elements in the first line as pruning A Basis of , Pictured 2 Shown . The author also made normalization , The importance scores are as follows , For long attention , Calculate each head separately , And then add up ：
Insert picture description here

3.2 Token Sampling

For each token obtain score after , It can be based on attention matrix A Yes tokens Pruned .

A more basic approach is to choose directly top-K individual tokens, But the experimental results show , This method has no dynamic selection K’ individual tokens The effect is good . The reason for its poor performance is , Directly discard all low scores token, But some of the token In fact, it may be more useful in shallow layer .

In the author's sampling method , From several similar token The abstract probability in is equal to these token The sum of the scores . And from the figure 3 You can also see , The sampling mechanism of this paper is from shallow sampling token The quantity is a little more than the deep .

Method ：

because token score Is normalized , So we can see the probability , The cumulative density function can be calculated （CDF）：
Insert picture description here