当前位置:网站首页>Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)
Accuracy improvement method: efficient visual transformer framework of adaptive tokens (open source)
2022-07-27 13:53:00 【Computer Vision Research Institute】
Pay attention to the parallel stars
Never get lost
Institute of computer vision



official account ID|ComputerVisionGzq
Study Group | Scan the code to get the join mode on the homepage

Address of thesis :https://openaccess.thecvf.com/content/CVPR2022/papers/Yin_A-ViT_Adaptive_Tokens_for_Efficient_Vision_Transformer_CVPR_2022_paper.pdf
Code address :https://github.com/NVlabs/A-ViT
Computer Vision Institute column
author :Edison_G
YOLOv7 Under the same volume YOLOv5 Higher accuracy , Fast 120%(FPS), Than YOLOX fast 180%(FPS), Than Dual-Swin-T fast 1200%(FPS), Than ConvNext fast 550%(FPS), Than SWIN-L fast 500%(FPS).
01
summary
Introduced today , It is the researchers who put forward A-ViT, An image adaptive adjustment for different complexity vision transformers (ViT) The method of reasoning cost .A-ViT By automatically reducing the number of... In the visual converter processed in the network when reasoning is in progress tokens Quantity to achieve this .

The researchers redefined the adaptive computing time for this task (ACT[Adaptive computation time for recurrent neural networks]), Extended stop to discard redundant space tags .vision transformers Attractive architectural features make our The adaptive tokens The reduction mechanism can speed up reasoning without modifying the network architecture or reasoning hardware .
A-ViT No additional parameters or subnets are required to stop , Because the adaptive stop learning is based on the original network parameters . And previous ACT Methods compared , Distributed a priori regularization is further introduced , Can train stably . In the image classification task (ImageNet1K) in , Shows the proposed A-ViT Efficiency in filtering spatial features of information and reducing overall computing . The proposed method will DeiT-Tiny The throughput of 62%, take DeiT-Small The throughput of 38%, The accuracy rate has only decreased 0.3%, It is much better than the existing technology .
02
background
Transformers It has become a popular neural network architecture , It uses a highly expressed attention mechanism to calculate network output . They originated from naturallanguageprocessing (NLP) Community , It has been proved that it can effectively solve NLP A wide range of issues in , For example, machine translation 、 It means learning and question and answer .
lately ,vision transformers More and more popular in the visual community , They have been successfully applied to a wide range of visual applications , For example, image classification 、 object detection 、 Image generation and semantic segmentation . The most popular paradigm is still vision transformers It is formed by splitting the image into a series of orderly patches tokens And in tokens In between inter-/intra-calculations To solve basic tasks . Use vision transformers Processing images is still computationally expensive , This is mainly due to tokens The square of the number of interactions between . therefore , In the case of a large number of computing and memory resources , Deploy on a data processing cluster or edge device vision transformers Challenging .
03
New framework analysis
First look at the figure below :

The above figure is a kind of vision transformers Enable adaptation tokens The method of calculation . Use the adaptive stop module to increase vision transformers block , This module calculates each tokens Stopping probability of . This module reuses the parameters of the existing block , And a single neuron is borrowed from the last dense layer of each block to calculate the stopping probability , No additional parameters or calculations are imposed . Once the stop condition is reached ,tokens Will be discarded . Stop by adaptation tokens, We only work on activities that are considered useful for the task tokens Perform Intensive Computing . result ,vision transformers Successive blocks in gradually receive less tokens, This leads to faster reasoning . Learned tokens Stop varies by image , But it is very consistent with image semantics ( See the example above ). This will immediately accelerate reasoning out of the box on an off the shelf computing platform .

A-ViT An example of : In Visualization , For the sake of simplicity , omitted (i) Other patch marks ,(ii) Attention between classes and patch tags as well (iii) Residual connection .
The first element of each tag is reserved for stopping the score calculation , Do not increase computing overhead . We use subscripts c Represents a class tag , Because it has special treatment . from k Each of the indexes token There is a single Nk accumulator , And stop at different depths . With the standard ACT Different , The mean field formula is only applicable to classification marks , Other markers contribute to category markers through attention . This allows images to be aggregated without / Patch token In the case of adaptive tokens Calculation .

04
Experimental analysis and visualization

Original image (left) and the dynamic token depth (right) of A-ViT-T on the ImageNet-1K validation set. Distribution of token computation highly aligns with visual features. Tokens associated with informative regions are adaptively processed deeper, robust to repeating objects with complex backgrounds. Best viewed in color.

(a) ImageNet-1K On validation set A-ViT-T The average location of each image patch tokens depth .(b) Stop fraction distribution through transformer block . Each point is associated with a randomly sampled image , Represents the average of this layer tokens fraction .

By average tokens The depth is certain ImageNet-1K Visual comparison of difficult samples in the validation set . Please note that , All the images above are correctly classified —— The only difference is that difficult samples need more depth to process their semantic information . Compared with the image on the right , The mark in the left image exits about 5 layer .



THE END
Please contact the official account for authorization.

The learning group of computer vision research institute is waiting for you to join !
ABOUT
Institute of computer vision
The Institute of computer vision is mainly involved in the field of deep learning , Mainly devoted to face detection 、 Face recognition , Multi target detection 、 Target tracking 、 Image segmentation and other research directions . The Research Institute will continue to share the latest paper algorithm new framework , The difference of our reform this time is , We need to focus on ” Research “. After that, we will share the practice process for the corresponding fields , Let us really experience the real scene of getting rid of the theory , Develop the habit of hands-on programming and brain thinking !
VX:2311123606

Previous recommendation
AI Help social security , The latest video abnormal behavior detection method framework
ONNX elementary analysis : How to accelerate the engineering of deep learning algorithm ?
Improved shadow suppression for illumination robust face recognition
Text driven for creating and editing images ( With source code )
Based on hierarchical self - supervised learning, vision Transformer Scale to gigapixel images
边栏推荐
- Using ebpf to detect rootkit vulnerabilities
- English grammar_ Definite article the_ Small details
- opencv图像的缩放平移及旋转
- idea Gradle7.0+ :Could not find method compile()
- Huiliang technology app is a good place to go to sea: after more than ten years of popularity, why should the United States still choose to go to sea for gold
- 在“元宇宙空间”UTONMOS将打开虚实结合的数字世界
- 【每日一题】1206. 设计跳表
- 16-VMware Horizon 2203 虚拟桌面-Win10 自动桌面池完整克隆专用(十六)
- 特征工程中的缩放和编码的方法总结
- JS module, closure application
猜你喜欢

Is it still time to take the PMP Exam in September?

Egg swagger doc graphic verification code solution

Data enhancement in image processing

Figure 8 shows you how to configure SNMP

在“元宇宙空间”UTONMOS将打开虚实结合的数字世界

Interviewers often ask: how to set up a "message queue" and "delayed message queue"?

Fiddler抓包工具+夜神模拟器

Wechat campus laundry applet graduation design finished product (4) opening report

Evconnlistener of libevent_ new_ bind

腾讯云联合中国工联院发布工业AI质检标准化研究成果加速制造业智能化转型
随机推荐
Fiddler bag capturing Tool + night God simulator
宇宙没有尽头,UTONMOS能否将元宇宙照进现实?
JS divides the array into two-dimensional arrays according to the specified attribute values
JS module, closure application
Construction and application of industrial knowledge atlas (2): representation and modeling of commodity knowledge
Tencent cloud and the China Federation of industry released the research results of industrial AI quality inspection standardization to accelerate the intelligent transformation of manufacturing indus
Data enhancement in image processing
Wechat campus laundry applet graduation design finished product of applet completion work (3) background function
Unapp prevents continuous click errors
Sword finger offer II 041. Average value of sliding window
网络异常流量分析系统设计
Brief tutorial for soft exam system architecture designer | system design
We should learn to check the documented instructions of technical details
Huiliang technology app is a good place to go to sea: after more than ten years of popularity, why should the United States still choose to go to sea for gold
Small program completion work wechat campus laundry small program graduation design finished product (2) small program function
Dat.gui control custom placement and dragging
redis集群搭建-使用docker快速搭建一个测试redis集群
Training in the second week of summer vacation on July 24, 2022
《C语言》函数栈帧的创建与销毁--(内功)
Double material first!