当前位置：网站首页>Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)

Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)

2022-07-27 13:53:00 【Computer Vision Research Institute】

Pay attention to the parallel stars

Never get lost

Institute of computer vision

official account ID｜ComputerVisionGzq

Study Group ｜ Scan the code to get the join mode on the homepage

Address of thesis ：https://openaccess.thecvf.com/content/CVPR2022/papers/Yin_A-ViT_Adaptive_Tokens_for_Efficient_Vision_Transformer_CVPR_2022_paper.pdf

Code address ：https://github.com/NVlabs/A-ViT

Computer Vision Institute column

author ：Edison_G

YOLOv7 Under the same volume YOLOv5 Higher accuracy , Fast 120%(FPS), Than YOLOX fast 180%(FPS), Than Dual-Swin-T fast 1200%(FPS), Than ConvNext fast 550%(FPS), Than SWIN-L fast 500%(FPS).

summary

Introduced today , It is the researchers who put forward A-ViT, An image adaptive adjustment for different complexity vision transformers (ViT) The method of reasoning cost .A-ViT By automatically reducing the number of... In the visual converter processed in the network when reasoning is in progress tokens Quantity to achieve this .

The researchers redefined the adaptive computing time for this task （ACT[Adaptive computation time for recurrent neural networks]）, Extended stop to discard redundant space tags .vision transformers Attractive architectural features make our The adaptive tokens The reduction mechanism can speed up reasoning without modifying the network architecture or reasoning hardware .

A-ViT No additional parameters or subnets are required to stop , Because the adaptive stop learning is based on the original network parameters . And previous ACT Methods compared , Distributed a priori regularization is further introduced , Can train stably . In the image classification task (ImageNet1K) in , Shows the proposed A-ViT Efficiency in filtering spatial features of information and reducing overall computing . The proposed method will DeiT-Tiny The throughput of 62%, take DeiT-Small The throughput of 38%, The accuracy rate has only decreased 0.3%, It is much better than the existing technology .

background

Transformers It has become a popular neural network architecture , It uses a highly expressed attention mechanism to calculate network output . They originated from naturallanguageprocessing (NLP) Community , It has been proved that it can effectively solve NLP A wide range of issues in , For example, machine translation 、 It means learning and question and answer .

lately ,vision transformers More and more popular in the visual community , They have been successfully applied to a wide range of visual applications , For example, image classification 、 object detection 、 Image generation and semantic segmentation . The most popular paradigm is still vision transformers It is formed by splitting the image into a series of orderly patches tokens And in tokens In between inter-/intra-calculations To solve basic tasks . Use vision transformers Processing images is still computationally expensive , This is mainly due to tokens The square of the number of interactions between . therefore , In the case of a large number of computing and memory resources , Deploy on a data processing cluster or edge device vision transformers Challenging .

New framework analysis

First look at the figure below ：

The above figure is a kind of vision transformers Enable adaptation tokens The method of calculation . Use the adaptive stop module to increase vision transformers block , This module calculates each tokens Stopping probability of . This module reuses the parameters of the existing block , And a single neuron is borrowed from the last dense layer of each block to calculate the stopping probability , No additional parameters or calculations are imposed . Once the stop condition is reached ,tokens Will be discarded . Stop by adaptation tokens, We only work on activities that are considered useful for the task tokens Perform Intensive Computing . result ,vision transformers Successive blocks in gradually receive less tokens, This leads to faster reasoning . Learned tokens Stop varies by image , But it is very consistent with image semantics （ See the example above ）. This will immediately accelerate reasoning out of the box on an off the shelf computing platform .

A-ViT An example of ： In Visualization , For the sake of simplicity , omitted (i) Other patch marks ,(ii) Attention between classes and patch tags as well (iii) Residual connection .

The first element of each tag is reserved for stopping the score calculation , Do not increase computing overhead . We use subscripts c Represents a class tag , Because it has special treatment . from k Each of the indexes token There is a single Nk accumulator , And stop at different depths . With the standard ACT Different , The mean field formula is only applicable to classification marks , Other markers contribute to category markers through attention . This allows images to be aggregated without / Patch token In the case of adaptive tokens Calculation .

Experimental analysis and visualization

Original image (left) and the dynamic token depth (right) of A-ViT-T on the ImageNet-1K validation set. Distribution of token computation highly aligns with visual features. Tokens associated with informative regions are adaptively processed deeper, robust to repeating objects with complex backgrounds. Best viewed in color.

(a) ImageNet-1K On validation set A-ViT-T The average location of each image patch tokens depth .(b) Stop fraction distribution through transformer block . Each point is associated with a randomly sampled image , Represents the average of this layer tokens fraction .

By average tokens The depth is certain ImageNet-1K Visual comparison of difficult samples in the validation set . Please note that , All the images above are correctly classified —— The only difference is that difficult samples need more depth to process their semantic information . Compared with the image on the right , The mark in the left image exits about 5 layer .

THE END

Please contact the official account for authorization.

The learning group of computer vision research institute is waiting for you to join ！

ABOUT

Institute of computer vision

The Institute of computer vision is mainly involved in the field of deep learning , Mainly devoted to face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation and other research directions . The Research Institute will continue to share the latest paper algorithm new framework , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let us really experience the real scene of getting rid of the theory , Develop the habit of hands-on programming and brain thinking ！

VX：2311123606

Previous recommendation

原网站

版权声明
本文为[Computer Vision Research Institute]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/199/202207160904029308.html

当前位置：网站首页>Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)

Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)

边栏推荐

猜你喜欢

随机推荐