当前位置：网站首页>Adavit -- dynamic network with adaptive selection of computing structure

Adavit -- dynamic network with adaptive selection of computing structure

2022-07-06 22:29:00 【Law-Yao】

Paper Address ：https://arxiv.org/abs/2111.15668

GitHub link ：GitHub - MengLcool/AdaViT: Official implementation of AdaViT

Methods

ViT Based on the characteristics or advantages of its own structure , Have a good Abstract semantic expression or feature representation ability ：

adopt Attention Calculation , Realize the coding of global correlation information ;
adopt Multi-head Attention, Further realize the feature abstraction and fusion of different sub representation spaces ;
Through deep level Transformer layer Stack of , Further realize feature abstraction ;

However , For samples with different degrees of difficulty ,ViT Actual calculation required Patch Number 、Attention head The number or network layers can be different , Therefore, it can be constructed Conditional calculation of sample driven form （Sample-driven conditional computation）.

AdaViT By designing dynamic network structure , According to the difficulty of inputting samples 、 Adaptively select the best computing structure , Include Patch selection、Attention head selection as well as Block selection, The specific method is described as follows ：

Decision network： Every Transformer layer There will be a Decision networks （ It consists of three linear layers ）, The input of the decision network is current Transformer layer The input characteristics of , The output of Structural parameters , Respectively used to realize Patch selection、Attention head selection and Block selection. The structural parameters are further passed Gumbel-softmax sampling , Generate Binary mask：

Patch selection： except Class token outside , rest Token Will perform adaptive selection （Keep the most informative tokens）：

Head selection： For complex scenes or noisy backgrounds , It usually needs better subspace feature expression and multi Head Information fusion , To express the diversity of information ; But for simple samples , There is no need for complex diversity expression .Head selection There are two forms of implementation , One is to Mask by 0 Of Head Replace the output with all one tensor （Partial deactivation）, The second is to directly eliminate the corresponding Attention Head（Full deactivation）：

The actual results show that ,Full deactivation It can save more computation , But it will have a greater impact on the recognition accuracy .

Block selection： It mainly includes MSA And FFN The choice of conditions for , In order to realize the Depth Structural compression of dimensions （ For simple samples , There is no need for deep-seated repeated information coding ）：

Objective function： The first is task related Loss, For example, classification tasks CE loss; The second is the smoothing term that constrains the structural parameters （Gamma The parameter represents the target calculation budget , Used to constrain the calculation cost ）：

be The overall optimization goal Can be expressed as ：

experimental result

The experiment compares different network structures with AdaViT Of Computational efficiency / Recognition accuracy , wait ; For details, refer to the experimental part of the paper .