当前位置:网站首页>Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)
Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)
2022-07-27 13:53:00 【Computer Vision Research Institute】
Pay attention to the parallel stars
Never get lost
Institute of computer vision



official account ID|ComputerVisionGzq
Study Group | Scan the code to get the join mode on the homepage

Address of thesis :https://openaccess.thecvf.com/content/CVPR2022/papers/Yin_A-ViT_Adaptive_Tokens_for_Efficient_Vision_Transformer_CVPR_2022_paper.pdf
Code address :https://github.com/NVlabs/A-ViT
Computer Vision Institute column
author :Edison_G
YOLOv7 Under the same volume YOLOv5 Higher accuracy , Fast 120%(FPS), Than YOLOX fast 180%(FPS), Than Dual-Swin-T fast 1200%(FPS), Than ConvNext fast 550%(FPS), Than SWIN-L fast 500%(FPS).
01
summary
Introduced today , It is the researchers who put forward A-ViT, An image adaptive adjustment for different complexity vision transformers (ViT) The method of reasoning cost .A-ViT By automatically reducing the number of... In the visual converter processed in the network when reasoning is in progress tokens Quantity to achieve this .

The researchers redefined the adaptive computing time for this task (ACT[Adaptive computation time for recurrent neural networks]), Extended stop to discard redundant space tags .vision transformers Attractive architectural features make our The adaptive tokens The reduction mechanism can speed up reasoning without modifying the network architecture or reasoning hardware .
A-ViT No additional parameters or subnets are required to stop , Because the adaptive stop learning is based on the original network parameters . And previous ACT Methods compared , Distributed a priori regularization is further introduced , Can train stably . In the image classification task (ImageNet1K) in , Shows the proposed A-ViT Efficiency in filtering spatial features of information and reducing overall computing . The proposed method will DeiT-Tiny The throughput of 62%, take DeiT-Small The throughput of 38%, The accuracy rate has only decreased 0.3%, It is much better than the existing technology .
02
background
Transformers It has become a popular neural network architecture , It uses a highly expressed attention mechanism to calculate network output . They originated from naturallanguageprocessing (NLP) Community , It has been proved that it can effectively solve NLP A wide range of issues in , For example, machine translation 、 It means learning and question and answer .
lately ,vision transformers More and more popular in the visual community , They have been successfully applied to a wide range of visual applications , For example, image classification 、 object detection 、 Image generation and semantic segmentation . The most popular paradigm is still vision transformers It is formed by splitting the image into a series of orderly patches tokens And in tokens In between inter-/intra-calculations To solve basic tasks . Use vision transformers Processing images is still computationally expensive , This is mainly due to tokens The square of the number of interactions between . therefore , In the case of a large number of computing and memory resources , Deploy on a data processing cluster or edge device vision transformers Challenging .
03
New framework analysis
First look at the figure below :

The above figure is a kind of vision transformers Enable adaptation tokens The method of calculation . Use the adaptive stop module to increase vision transformers block , This module calculates each tokens Stopping probability of . This module reuses the parameters of the existing block , And a single neuron is borrowed from the last dense layer of each block to calculate the stopping probability , No additional parameters or calculations are imposed . Once the stop condition is reached ,tokens Will be discarded . Stop by adaptation tokens, We only work on activities that are considered useful for the task tokens Perform Intensive Computing . result ,vision transformers Successive blocks in gradually receive less tokens, This leads to faster reasoning . Learned tokens Stop varies by image , But it is very consistent with image semantics ( See the example above ). This will immediately accelerate reasoning out of the box on an off the shelf computing platform .

A-ViT An example of : In Visualization , For the sake of simplicity , omitted (i) Other patch marks ,(ii) Attention between classes and patch tags as well (iii) Residual connection .
The first element of each tag is reserved for stopping the score calculation , Do not increase computing overhead . We use subscripts c Represents a class tag , Because it has special treatment . from k Each of the indexes token There is a single Nk accumulator , And stop at different depths . With the standard ACT Different , The mean field formula is only applicable to classification marks , Other markers contribute to category markers through attention . This allows images to be aggregated without / Patch token In the case of adaptive tokens Calculation .

04
Experimental analysis and visualization

Original image (left) and the dynamic token depth (right) of A-ViT-T on the ImageNet-1K validation set. Distribution of token computation highly aligns with visual features. Tokens associated with informative regions are adaptively processed deeper, robust to repeating objects with complex backgrounds. Best viewed in color.

(a) ImageNet-1K On validation set A-ViT-T The average location of each image patch tokens depth .(b) Stop fraction distribution through transformer block . Each point is associated with a randomly sampled image , Represents the average of this layer tokens fraction .

By average tokens The depth is certain ImageNet-1K Visual comparison of difficult samples in the validation set . Please note that , All the images above are correctly classified —— The only difference is that difficult samples need more depth to process their semantic information . Compared with the image on the right , The mark in the left image exits about 5 layer .



THE END
Please contact the official account for authorization.

The learning group of computer vision research institute is waiting for you to join !
ABOUT
Institute of computer vision
The Institute of computer vision is mainly involved in the field of deep learning , Mainly devoted to face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation and other research directions . The Research Institute will continue to share the latest paper algorithm new framework , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let us really experience the real scene of getting rid of the theory , Develop the habit of hands-on programming and brain thinking !
VX:2311123606

Previous recommendation
AI Help social security , The latest video abnormal behavior detection method framework
ONNX elementary analysis : How to accelerate the engineering of deep learning algorithm ?
Improved shadow suppression for illumination robust face recognition
Text driven for creating and editing images ( With source code )
Based on hierarchical self - supervised learning, vision Transformer Scale to gigapixel images
边栏推荐
- Jianzhi offer 07 rebuild binary tree -- construct binary tree from middle order and post order traversal sequence
- Evconnlistener of libevent_ new_ bind
- The finished product of wechat campus laundry applet graduation design (1) development outline
- 16-VMware Horizon 2203 虚拟桌面-Win10 自动桌面池完整克隆专用(十六)
- 软考 系统架构设计师 简明教程 | 软件测试
- ThinkPHP+宝塔运营环境实现定时任务
- Unapp prevents continuous click errors
- redis集群搭建-使用docker快速搭建一个测试redis集群
- [C Advanced] pointer array vs array pointer
- 期货开户的条件和流程
猜你喜欢

软考 系统架构设计师 简明教程 | 软件系统建模

我们要学会查看技术细节点的文档化说明

Training in the second week of summer vacation on July 24, 2022

小程序毕设作品之微信校园洗衣小程序毕业设计成品(1)开发概要

建议收藏,PMP应战篇(2)之易混淆知识点

在“元宇宙空间”UTONMOS将打开虚实结合的数字世界

What are the precautions for using carbon brushes

Data enhancement in image processing

小程序毕设作品之微信校园洗衣小程序毕业设计成品(7)中期检查报告

MySQL startup options and configuration files
随机推荐
js回调函数(callback)
NoSQL —— NoSQL 三大理论基石 —— CAP —— BASE—— 最终一致性
汇量科技app出海好地:火了十几年,美国凭什么还是出海首选淘金地
JWT login expiration - automatic refresh token scheme introduction
软考 系统架构设计师 简明教程 | 系统设计
opencv图像的缩放平移及旋转
C ftp add, delete, modify, query, create multi-level directory, automatic reconnection, switch directory
JS callback function (callback)
Wechat campus laundry applet graduation design finished product (5) assignment
2. Citrix virtual apps and desktops 2203 clipboard redirection policy
Data enhancement in image processing
[2023 Fudan Microelectronics written examination questions in advance] ~ questions and reference answers
How to maintain slip ring equipment
PCL 常用操作
小程序毕设作品之微信校园洗衣小程序毕业设计成品(8)毕业设计论文模板
Echart line chart displays the last point and vertical dotted line by default
What services will the futures company provide after opening an account?
For.. of can be used to traverse which data
Keras deep learning practice - recommend system data coding
将目标检测大尺寸图片裁剪成固定尺寸图片