当前位置：网站首页>Replace self attention with MLP

Replace self attention with MLP

2022-07-02 07:51:00 【MezereonXP】

List of articles

use MLP Instead of Self-Attention

use MLP Instead of Self-Attention

This is a job of Tsinghua University “Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks”

Replace... With two linear layers Self-Attention Mechanism , In the end, it can improve the speed while maintaining the accuracy .

What's surprising about this job is , We can use MLP Instead of Attention Mechanism , This makes it necessary for us to reconsider Attention The nature of the performance improvement .

Transformer Medium Self-Attention Mechanism

First , As shown in the figure below ：

self-attention

We give its formal result ：
$\text{softmax}(\frac{QK^T}{\sqrt{d_k}})\\ F_{out} = AV$
among , $\in \mathbb{R}^{N\times d'}$ meanwhile $V\in \mathbb{R}^{N\times d}$

here , We give a simplified version , As shown in the figure below ：

simplified self-attention

Also is to $Q, K, V$ All based on input features $F$ Instead of , It is formalized as ：
$\text{softmax}(FF^T)\\ F_{out} = AF$

However , The computational complexity is $O(dN^2)$ , This is a Attention A big drawback of the mechanism .

External attention (External Attention)

As shown in the figure below :

external-attention

Two matrices are introduced $M_k\in \mathbb{R}^{S\times d}$ as well as $M_v \in\mathbb{R}^{S\times d} $, Instead of the original $K, V$

Here we give its formalization directly ：
$\text{Norm}(FM_k^T)\\ F_{out} = AM_v$
This design , Reduce the complexity to $O (d S N)$ , The work found that , When $S\ll N$ When , Still able to maintain enough accuracy .