当前位置:网站首页>Transformer variants (spark transformer, longformer, switch transformer)
Transformer variants (spark transformer, longformer, switch transformer)
2022-07-25 12:02:00 【Shangshanxianger】

Before you know it Transformer It has gradually penetrated into all fields , On its own, it has also produced a considerable number of variants , Pictured above . The previous blog post of the blogger was updated Transformer variant (Star-Transformer,Transformer-XL), This article wants to sort out these two very important Transformer variant , Namely Sparse Transformer and Switch Transformer.

Explicit Sparse Transformer: : Concentrated Attention Through Explicit Selection
standard Transformer The complexity of is O(n^2), But is it necessary to pay attention to all the elements in the sequence , Is there a way to simplify this mechanism ? So this article's “Sparse” The point is that there are only a few token Participate in attention Calculation of distribution , To improve the concentration of attention mechanism . That is, originally, a word is only related to a few words , But standard self attention will assign weight to all words and then aggregate , A natural idea is through explicit selection , Just let the model focus on a few elements .
The model diagram is shown in the figure above , On the far left is the standard route for calculating attention , The middle is Sparse The implementation of the , You can see the difference is that there is a manually selected one in the middle Sparsification, On the far right is its execution diagram . Simply put, it's calculating softmax Score before top-k Select a few important elements . Specifically, first calculate the inner product : P = Q K T d P=\frac{QK^T}{\sqrt{d}} P=dQKT Then filter manually according to the inner product score top-k Elements , That is, in the following formula M operation , Other P Then set it directly to negative infinity , Such coercion only makes k Elements are concerned . A = s o f t m a x ( M ( P , k ) ) A=softmax(M(P,k)) A=softmax(M(P,k)) Finally, multiply the score back to V: C = A V C=AV C=AV Through this operation , It can make you pay more attention . This operation to reduce the amount of computation also makes GPT-3 When the model can be bigger and more violent, it has also achieved good results, but it doesn't work ...
- paper:https://arxiv.org/abs/1912.11637
- code:https://github.com/lancopku/Explicit-Sparse-Transformer

Longformer: The Long-Document Transformer
Longformer It is also a kind of classic Sparse The method of . A total of 3 Strategies :
Sliding Window: Pictured above (b) Shown , Follow CNN It's like , Given a fixed window size w, There is one on both sides w/2 individual token Instead of doing attention. The computational complexity is reduced to O(n x w), That is, the complexity is linear with the length of the sequence . And if you set different windows for each layer size It can well balance the efficiency and presentation ability of the model .
Dilated sliding window: As shown in the figure above , Similar expansion CNN, The receiving domain can be further expanded without changing the computational complexity . alike , If you set different expansion configurations on each head of the multi head attention mechanism , You can pay attention to different local contexts of the article , Especially through this Dilated Can expand even far away .
Global Attention : Pictured (d) Shown , Calculate the global token It may represent the overall characteristics of the sequence . such as BERT Medium [CLS] This function , The complexity is reduced to O(n)
The complete content can be seen in the original :
- paper:https://arxiv.org/pdf/2004.05150.pdf
- code:https://github.com/allenai/longformer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Compared with Sparse Attention It is difficult to use sparse operators GPU、TPU Hardware performance problems .Switch Transformer Sparse operators are not required , Can better adapt to GPU、TPU And other dense hardware . The main idea is to simplify sparse routing . In natural language MoE (Mixture of experts) Layer , Only will token The performance will be better if the characterization is sent to a single expert instead of multiple . The model architecture is shown in the figure above , The blue part in the middle is the key part of price comparison , You can see every time router They only send information to scores p The largest single FFN. And this operation can greatly reduce the amount of calculation .
Then on the other hand , The reason for its success is that it has a very good parallel strategy , It also combines data parallelism + Model parallel +expert parallel . The details are as follows: :
In fact, according to the model architecture ,experts Parallelism is the parallelism between operators , Their corresponding FFN There is operator level model parallelism inside , And the whole experts On the computational graph, it is a multi parallel FFN Branch , This is inter operator model parallelism , So we can get lower communication overhead , Improve the efficiency of parallelism .
- paper:https://arxiv.org/abs/2101.03961
- code:https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py
The next article will continue to sort out the relevant content :
边栏推荐
- Javescript loop
- 【GCN-CTR】DC-GNN: Decoupled GNN for Improving and Accelerating Large-Scale E-commerce Retrieval WWW22
- PHP curl post x-www-form-urlencoded
- 【GCN-RS】Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for RS (SIGIR‘22)
- I advise those students who have just joined the work: if you want to enter the big factory, you must master these concurrent programming knowledge! Complete learning route!! (recommended Collection)
- 奉劝那些刚参加工作的学弟学妹们:要想进大厂,这些并发编程知识是你必须要掌握的!完整学习路线!!(建议收藏)
- Zero-Shot Image Retrieval(零样本跨模态检索)
- 【高并发】SimpleDateFormat类到底为啥不是线程安全的?(附六种解决方案,建议收藏)
- [high concurrency] I summarized the best learning route of concurrent programming with 10 diagrams!! (recommended Collection)
- Hardware connection server TCP communication protocol gateway
猜你喜欢
随机推荐
LeetCode第303场周赛(20220724)
JS中的函数
浅谈低代码技术在物流管理中的应用与创新
toString()与new String()用法区别
【GCN-RS】Are Graph Augmentations Necessary? Simple Graph Contrastive Learning for RS (SIGIR‘22)
【AI4Code最终章】AlphaCode:《Competition-Level Code Generation with AlphaCode》(DeepMind)
Teach you how to configure S2E as the working mode of TCP server through MCU
brpc源码解析(四)—— Bthread机制
[high concurrency] a lock faster than read-write lock in high concurrency scenarios. I'm completely convinced after reading it!! (recommended Collection)
[high concurrency] I summarized the best learning route of concurrent programming with 10 diagrams!! (recommended Collection)
Meta-learning(元学习与少样本学习)
Solved files' name is invalid or doors not exist (1205)
JS作用域以及预解析
JS数据类型以及相互转换
Qin long, a technical expert of Alibaba cloud: a prerequisite for reliability assurance - how to carry out chaos engineering on the cloud
JS scope and pre parsing
创新突破!亚信科技助力中国移动某省完成核心账务数据库自主可控改造
Varest blueprint settings JSON
A beautiful gift for girls from programmers, H5 cube, beautiful, exquisite, HD
[MySQL 17] installation exception: could not open file '/var/log/mysql/mysqld log‘ for error logging: Permission denied









