当前位置：网站首页>Transformer variants (routing transformer, linformer, big bird)

Transformer variants (routing transformer, linformer, big bird)

2022-07-25 12:02:00 【Shangshanxianger】

This blog post continues the previous two articles , The first two articles portal ：

Insert picture description here
Efficient Content-Based Sparse Attention with RoutingTransformers
The goal of the previous two blog posts is the same , How to make Standards Transformer The time complexity is reduced .Routing Transformer The problem is modeled as a routing problem , The purpose is to let the model learn to select Sparse Clustering of word instances , The so-called clustering is a function of the content of each key and query , It is not only related to their absolute or relative positions . In a nutshell , Words with similar functions can be clustered into a representation , This will speed up the calculation .

See the comparison between the above figure and other models , Each line in the diagram represents the output , Each column represents input , about a and b The picture says , Shaded squares represent the elements noticed by each output line . For routing attention mechanism , Different colors represent the members in the cluster of output word instances . The specific method is to use a common random weight matrix to project the values of keys and queries ： $R=[Q,K][W_R,W_R]^T$ And then put R The vector in is used k-means Clustering k A cluster of , Then in each cluster C_i The weighted summation context is embedded ： $X'_i=\sum_{j \in C_k} A_{ij}V_j$

Finally, the author uses $\sqrt{n}$ A cluster of , So the time complexity is reduced $\sqrt{n}）$ . See the original paper and code implementation for details ：

paper：https://arxiv.org/abs/2003.05997
code：https://storage.googleapis.com/routing_transformers_iclr20/

Insert picture description here
Linformer: Self-Attention with Linear Complexity
from O（n^2） To O（n）！ Firstly, the author proves theoretically and empirically that the random matrix formed by self attention mechanism can be approximated as a low rank matrix , Therefore, linear projection is directly introduced to decompose the original scaled point product concerns into multiple smaller concerns , That is to say, these small attention combinations are the low rank factorization of standard attention . As shown in the figure above , In the calculation key K And the value V Add two linear projection matrices E and F, namely $head=Attention(QW^Q_i,E_iKW^K_i,F_iVW^V_i)=softmax(\frac{QW^Q_i(E_iKW^K_i)^T}{\sqrt{d_k}}\cdot F_iVW^V_i)$

At the same time, it also provides three levels of parameter sharing ：

Headwise: All attention heads share projected sentence parameters , namely Ei=E,Fi=F.
Key-Value: All the key value mapping matrices of attention heads share the same parameter , namely Ei=Fi=E.
Layerwise: All layer parameters are shared . That is, for all layers , All share the projection matrix E.

The complete content can be seen in the original , The original text has theoretical proof of low rank sum analysis ：

paper：https://arxiv.org/abs/2006.04768
code：https://github.com/tatp22/linformer-pytorch

Insert picture description here
Big Bird: Transformers for Longer Sequences
It also uses the sparse attention mechanism , Reduce complexity to linear , namely O(N). Pictured above ,big bird It mainly includes three parts of attention ：

Random Attention ( Random attention ). Pictured a, For each of these token i, Random selection r individual token Calculate attention .
Window Attention ( Local attention ). Pictured b, Show attention calculation with sliding window token Local information of .
Global Attention ( Global attention ). Pictured c, Calculate global information . These in Longformer It is also said in , Refer to the corresponding paper .

Finally, combine these three parts of attention to get the attention matrix A, Pictured d Namely BIGBIRD The results of the , The formula is ： $ATTN(X)_i=x_i+\sum^H_{h=1} \sigma(Q_h(x_i)K_h(X_{N(i)})^T)\cdot V_h(X_{N(i)})$ H It's the number of heads ,N（i） Is all that needs to be calculated token, Here is the sparse part from three parts ,QKV Is an old friend .

paper：https://arxiv.org/abs/2007.14062

原网站

版权声明
本文为[Shangshanxianger]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251108384152.html

当前位置：网站首页>Transformer variants (routing transformer, linformer, big bird)

Transformer variants (routing transformer, linformer, big bird)

边栏推荐

猜你喜欢

随机推荐