当前位置:网站首页>Replace self attention with MLP
Replace self attention with MLP
2022-07-02 07:51:00 【MezereonXP】
List of articles
use MLP Instead of Self-Attention
This is a job of Tsinghua University “Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”
Replace... With two linear layers Self-Attention Mechanism , In the end, it can improve the speed while maintaining the accuracy .
What's surprising about this job is , We can use MLP Instead of Attention Mechanism , This makes it necessary for us to reconsider Attention The nature of the performance improvement .
Transformer Medium Self-Attention Mechanism
First , As shown in the figure below :

We give its formal result :
A = softmax ( Q K T d k ) F o u t = A V A = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})\\ F_{out} = AV A=softmax(dkQKT)Fout=AV
among , Q , K ∈ R N × d ′ Q,K \in \mathbb{R}^{N\times d'} Q,K∈RN×d′ meanwhile V ∈ R N × d V\in \mathbb{R}^{N\times d} V∈RN×d
here , We give a simplified version , As shown in the figure below :

Also is to Q , K , V Q,K,V Q,K,V All based on input features F F F Instead of , It is formalized as :
A = softmax ( F F T ) F o u t = A F A = \text{softmax}(FF^T)\\ F_{out} = AF A=softmax(FFT)Fout=AF
However , The computational complexity is O ( d N 2 ) O(dN^2) O(dN2), This is a Attention A big drawback of the mechanism .
External attention (External Attention)
As shown in the figure below :

Two matrices are introduced M k ∈ R S × d M_k\in \mathbb{R}^{S\times d} Mk∈RS×d as well as $M_v \in\mathbb{R}^{S\times d} $, Instead of the original K , V K,V K,V
Here we give its formalization directly :
A = Norm ( F M k T ) F o u t = A M v A = \text{Norm}(FM_k^T)\\ F_{out} = AM_v A=Norm(FMkT)Fout=AMv
This design , Reduce the complexity to O ( d S N ) O(dSN) O(dSN), The work found that , When S ≪ N S\ll N S≪N When , Still able to maintain enough accuracy .
Among them Norm ( ⋅ ) \text{Norm}(\cdot) Norm(⋅) The operation is to start with the column Softmax, Then normalize the rows .
experimental analysis
First , The article will Transformer Medium Attention The mechanism replaced , And then test on all kinds of tasks , Include :
- Image classification
- Semantic segmentation
- Image generation
- Point cloud classification
- Point cloud segmentation
Only partial results are given here , Briefly explain the accuracy loss after replacement .
Image classification

Semantic segmentation

Image generation

You can see , On different tasks , There's basically no loss of accuracy .
边栏推荐
- Faster-ILOD、maskrcnn_benchmark训练coco数据集及问题汇总
- [Sparse to Dense] Sparse to Dense: Depth Prediction from Sparse Depth samples and a Single Image
- 【FastDepth】《FastDepth:Fast Monocular Depth Estimation on Embedded Systems》
- open3d学习笔记三【采样与体素化】
- PointNet原理证明与理解
- Yolov3 trains its own data set (mmdetection)
- MMDetection模型微调
- 《Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer》论文翻译
- 【学习笔记】Matlab自编高斯平滑器+Sobel算子求导
- 【Cutout】《Improved Regularization of Convolutional Neural Networks with Cutout》
猜你喜欢

【Mixup】《Mixup:Beyond Empirical Risk Minimization》

【FastDepth】《FastDepth:Fast Monocular Depth Estimation on Embedded Systems》

EKLAVYA -- 利用神经网络推断二进制文件中函数的参数

【Sparse-to-Dense】《Sparse-to-Dense:Depth Prediction from Sparse Depth Samples and a Single Image》

Timeout docking video generation

【MobileNet V3】《Searching for MobileNetV3》

Using compose to realize visible scrollbar

【DIoU】《Distance-IoU Loss:Faster and Better Learning for Bounding Box Regression》

【Random Erasing】《Random Erasing Data Augmentation》

程序的内存模型
随机推荐
[mixup] mixup: Beyond Imperial Risk Minimization
Comparison of chat Chinese corpus (attach links to various resources)
Yolov3 trains its own data set (mmdetection)
图片数据爬取工具Image-Downloader的安装和使用
【TCDCN】《Facial landmark detection by deep multi-task learning》
【BiSeNet】《BiSeNet:Bilateral Segmentation Network for Real-time Semantic Segmentation》
jetson nano安装tensorflow踩坑记录(scipy1.4.1)
Win10 solves the problem that Internet Explorer cannot be installed
TimeCLR: A self-supervised contrastive learning framework for univariate time series representation
生成模型与判别模型的区别与理解
【Sparse-to-Dense】《Sparse-to-Dense:Depth Prediction from Sparse Depth Samples and a Single Image》
Nacos service registration in the interface
Correction binoculaire
【Hide-and-Seek】《Hide-and-Seek: A Data Augmentation Technique for Weakly-Supervised Localization xxx》
Drawing mechanism of view (I)
【双目视觉】双目立体匹配
【MnasNet】《MnasNet:Platform-Aware Neural Architecture Search for Mobile》
Semi supervised mixpatch
Translation of the paper "written mathematical expression recognition with bidirectionally trained transformer"
TimeCLR: A self-supervised contrastive learning framework for univariate time series representation