当前位置:网站首页>Transformer variants (routing transformer, linformer, big bird)
Transformer variants (routing transformer, linformer, big bird)
2022-07-25 12:02:00 【Shangshanxianger】
This blog post continues the previous two articles , The first two articles portal :
- Transformer variant (Sparse Transformer,Longformer,Switch Transformer)
- Transformer variant (Star-Transformer,Transformer-XL)

Efficient Content-Based Sparse Attention with RoutingTransformers
The goal of the previous two blog posts is the same , How to make Standards Transformer The time complexity is reduced .Routing Transformer The problem is modeled as a routing problem , The purpose is to let the model learn to select Sparse Clustering of word instances , The so-called clustering is a function of the content of each key and query , It is not only related to their absolute or relative positions . In a nutshell , Words with similar functions can be clustered into a representation , This will speed up the calculation .
See the comparison between the above figure and other models , Each line in the diagram represents the output , Each column represents input , about a and b The picture says , Shaded squares represent the elements noticed by each output line . For routing attention mechanism , Different colors represent the members in the cluster of output word instances . The specific method is to use a common random weight matrix to project the values of keys and queries : R = [ Q , K ] [ W R , W R ] T R=[Q,K][W_R,W_R]^T R=[Q,K][WR,WR]T And then put R The vector in is used k-means Clustering k A cluster of , Then in each cluster C_i The weighted summation context is embedded : X i ′ = ∑ j ∈ C k A i j V j X'_i=\sum_{j \in C_k} A_{ij}V_j Xi′=j∈Ck∑AijVj
Finally, the author uses n \sqrt{n} n A cluster of , So the time complexity is reduced O ( n n ) O(n \sqrt{n}) O(nn). See the original paper and code implementation for details :
- paper:https://arxiv.org/abs/2003.05997
- code:https://storage.googleapis.com/routing_transformers_iclr20/

Linformer: Self-Attention with Linear Complexity
from O(n^2) To O(n)! Firstly, the author proves theoretically and empirically that the random matrix formed by self attention mechanism can be approximated as a low rank matrix , Therefore, linear projection is directly introduced to decompose the original scaled point product concerns into multiple smaller concerns , That is to say, these small attention combinations are the low rank factorization of standard attention . As shown in the figure above , In the calculation key K And the value V Add two linear projection matrices E and F, namely h e a d = A t t e n t i o n ( Q W i Q , E i K W i K , F i V W i V ) = s o f t m a x ( Q W i Q ( E i K W i K ) T d k ⋅ F i V W i V ) head=Attention(QW^Q_i,E_iKW^K_i,F_iVW^V_i)=softmax(\frac{QW^Q_i(E_iKW^K_i)^T}{\sqrt{d_k}}\cdot F_iVW^V_i) head=Attention(QWiQ,EiKWiK,FiVWiV)=softmax(dkQWiQ(EiKWiK)T⋅FiVWiV)
At the same time, it also provides three levels of parameter sharing :
- Headwise: All attention heads share projected sentence parameters , namely Ei=E,Fi=F.
- Key-Value: All the key value mapping matrices of attention heads share the same parameter , namely Ei=Fi=E.
- Layerwise: All layer parameters are shared . That is, for all layers , All share the projection matrix E.
The complete content can be seen in the original , The original text has theoretical proof of low rank sum analysis :
- paper:https://arxiv.org/abs/2006.04768
- code:https://github.com/tatp22/linformer-pytorch

Big Bird: Transformers for Longer Sequences
It also uses the sparse attention mechanism , Reduce complexity to linear , namely O(N). Pictured above ,big bird It mainly includes three parts of attention :
- Random Attention ( Random attention ). Pictured a, For each of these token i, Random selection r individual token Calculate attention .
- Window Attention ( Local attention ). Pictured b, Show attention calculation with sliding window token Local information of .
- Global Attention ( Global attention ). Pictured c, Calculate global information . These in Longformer It is also said in , Refer to the corresponding paper .
Finally, combine these three parts of attention to get the attention matrix A, Pictured d Namely BIGBIRD The results of the , The formula is : A T T N ( X ) i = x i + ∑ h = 1 H σ ( Q h ( x i ) K h ( X N ( i ) ) T ) ⋅ V h ( X N ( i ) ) ATTN(X)_i=x_i+\sum^H_{h=1} \sigma(Q_h(x_i)K_h(X_{N(i)})^T)\cdot V_h(X_{N(i)}) ATTN(X)i=xi+h=1∑Hσ(Qh(xi)Kh(XN(i))T)⋅Vh(XN(i))H It's the number of heads ,N(i) Is all that needs to be calculated token, Here is the sparse part from three parts ,QKV Is an old friend .
- paper:https://arxiv.org/abs/2007.14062
边栏推荐
- brpc源码解析(四)—— Bthread机制
- 【GCN-CTR】DC-GNN: Decoupled GNN for Improving and Accelerating Large-Scale E-commerce Retrieval WWW22
- 创新突破!亚信科技助力中国移动某省完成核心账务数据库自主可控改造
- JS中的函数
- 微星主板前面板耳机插孔无声音输出问题【已解决】
- 【高并发】我用10张图总结出了这份并发编程最佳学习路线!!(建议收藏)
- Multi-Label Image Classification(多标签图像分类)
- 擎创科技加入龙蜥社区,共建智能运维平台新生态
- 【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)
- PHP curl post x-www-form-urlencoded
猜你喜欢

阿里云技术专家秦隆:可靠性保障必备——云上如何进行混沌工程

LeetCode第303场周赛(20220724)

toString()与new String()用法区别

异构图神经网络用于推荐系统问题(ACKRec,HFGN)

【GCN-RS】Region or Global? A Principle for Negative Sampling in Graph-based Recommendation (TKDE‘22)

Make a reliable delay queue with redis

【GCN-RS】Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for RS (SIGIR‘22)

Experimental reproduction of image classification (reasoning only) based on caffe resnet-50 network

Attendance system based on w5500
![[untitled]](/img/83/9b9a0de33d48f7d041acac8cfe5d6a.png)
[untitled]
随机推荐
[untitled]
Hardware connection server TCP communication protocol gateway
微信公众号开发 入手
【AI4Code】《Pythia: AI-assisted Code Completion System》(KDD 2019)
return 和 finally的执行顺序 ?各位大佬请看过来,
Introduction to pl/sql, very detailed notes
什么是全局事件总线?
一文入门Redis
[MySQL learning 09]
Solutions to the failure of winddowns planning task execution bat to execute PHP files
【USB设备设计】--复合设备,双HID高速(64Byte 和 1024Byte)
【IMX6ULL笔记】--内核底层驱动初步探究
The bank's wealth management subsidiary accumulates power to distribute a shares; The rectification of cash management financial products was accelerated
There is no sound output problem in the headphone jack on the front panel of MSI motherboard [solved]
Review in the middle of 2022 | understand the latest progress of pre training model
brpc源码解析(八)—— 基础类EventDispatcher详解
JS中的对象
【多模态】《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
[GCN multimodal RS] pre training representations of multi modal multi query e-commerce search KDD 2022
Go 垃圾回收器指南