当前位置:网站首页>【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer
【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer
2022-07-29 06:03:00 【Dull cat】
List of articles

One 、 background
Transformer He has achieved brilliant performance in many tasks , In computer vision , Generally, the input image is divided into multiple patch, And then calculate patch Between self attention to achieve downstream tasks .
However, the amount of calculation of self attention mechanism is square with the size of the input image , therefore , Use on edge devices Transformer Has become a problem .
The author believes that different input images are important for the network , The difficulty of prediction is different . Like a car and a person in a clean background , Then it's easy to identify . If there are many different animals in a complex background , Then it is more difficult to identify .
Based on this , The author realizes a network structure , According to the difficulty of input , To dynamically adjust token To control the number of transformer The computational complexity of .

Two 、 Method

vision transformer The process is as follows :
- ϵ ( . ) \epsilon(.) ϵ(.): encoding network, Encode the input image into positioned token
- C ( . ) C(.) C(.):class token Post processing of
- L L L:transformer block
- F ( . ) F(.) F(.):self-attention
To kill dynamically tokens, The author wrote for each token Introduced a input-dependent halting score:
- H ( . ) H(.) H(.) yes halting module
- k k k yes token Indexes , l l l Is a layer

- t k , e l t_{k,e}^l tk,el yes t k l t_k^l tkl Of the e e e dimension
- σ \sigma σ yes logistic sigmoid function
- β \beta β and γ \gamma γ Is the translation and scaling factor used before nonlinear operation
For the sake of layer To track halting probabilities, Every token Accounting is a supplementary parameter :

halting probabilities as follows :
ponder loss : Every token Of ponder loss Will average .

The loss of classification task is :
halting score distribution Distribution is :
So use KL Divergence is used to measure the distribution deviation between real and predicted :
Then the total loss is :

3、 ... and 、 effect

From the picture 3 It can be seen that , adaptive choice token It can produce strong response to areas with high prominence and great changes , Usually related to category .
1、Token Color depth distribution :
Draw... In the diagram token The color of the , Pictured 4 Shown , In fact, it is an image centered 2D Gaussian like distribution , This also shows that ImageNet Most of the samples are in the middle . A lot of computation comes from the middle area , Few edges participate in the calculation .
2、Halting score distribution:
Pictured 5 Draw every... Of every image layer Of halting score.
Random sampling 5k Verification set , In the first few layer,halting score With layer Deepen and increase , Slowly decrease in the back .

3、 Difficult samples and simple samples
chart 6 It shows the difficult and simple examples and the amount of calculation required by them .
Simple examples can be classified correctly ,AdaViT It is also faster than difficult cases .

4、 Category sensitivity
Samples that were initially very sure or very unsure were adaptive The impact is very small ,adaptive Reasoning can promote the categories with obvious shapes , Such as independent furniture or animals .

边栏推荐
- These process knowledge you must know
- Detailed explanation of atomic operation classes atomicreference and atomicstampedreference in learning notes of concurrent programming
- My ideal job, the absolute freedom of coder farmers is the most important - the pursuit of entrepreneurship in the future
- Thinkphp6 output QR code image format to solve the conflict with debug
- [DL] build convolutional neural network for regression prediction (detailed tutorial of data + code)
- IDEA中设置自动build-改动代码,不用重启工程,刷新页面即可
- Intelligent security of the fifth space ⼤ real competition problem ----------- PNG diagram ⽚ converter
- Show profiles of MySQL is used.
- day02作业之进程管理
- Are you sure you know the interaction problem of activity?
猜你喜欢

centos7 静默安装oracle

How to make interesting apps for deep learning with zero code (suitable for novices)

day02 作业之文件权限

【语义分割】SETR_Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformer

这些你一定要知道的进程知识

Semaphore (semaphore) for learning notes of concurrent programming

Training log III of "Shandong University mobile Internet development technology teaching website construction" project

Super simple integration HMS ml kit face detection to achieve cute stickers

Show profiles of MySQL is used.

【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer
随机推荐
Centos7 silently installs Oracle
[go] use of defer
Detailed explanation of atomic operation class atomicinteger in learning notes of concurrent programming
[clustmaps] visitor statistics
有价值的博客、面经收集(持续更新)
钉钉告警脚本
【网络设计】ConvNeXt:A ConvNet for the 2020s
Huawei 2020 school recruitment written test programming questions read this article is enough (Part 1)
Flink connector Oracle CDC 实时同步数据到MySQL(Oracle19c)
Anr Optimization: cause oom crash and corresponding solutions
Most PHP programmers don't understand how to deploy safe code
【语义分割】Mapillary 数据集简介
Use of file upload (2) -- upload to Alibaba cloud OSS file server
与张小姐的春夏秋冬(1)
Interesting talk about performance optimization thread pool: is the more threads open, the better?
与张小姐的春夏秋冬(3)
isAccessible()方法:使用反射技巧让你的性能提升数倍
赓续新征程,共驭智存储
Breaking through the hardware bottleneck (I): the development of Intel Architecture and bottleneck mining
【TensorRT】将 PyTorch 转化为可部署的 TensorRT