当前位置:网站首页>Transformer variants (spark transformer, longformer, switch transformer)
Transformer variants (spark transformer, longformer, switch transformer)
2022-07-25 12:02:00 【Shangshanxianger】

Before you know it Transformer It has gradually penetrated into all fields , On its own, it has also produced a considerable number of variants , Pictured above . The previous blog post of the blogger was updated Transformer variant (Star-Transformer,Transformer-XL), This article wants to sort out these two very important Transformer variant , Namely Sparse Transformer and Switch Transformer.

Explicit Sparse Transformer: : Concentrated Attention Through Explicit Selection
standard Transformer The complexity of is O(n^2), But is it necessary to pay attention to all the elements in the sequence , Is there a way to simplify this mechanism ? So this article's “Sparse” The point is that there are only a few token Participate in attention Calculation of distribution , To improve the concentration of attention mechanism . That is, originally, a word is only related to a few words , But standard self attention will assign weight to all words and then aggregate , A natural idea is through explicit selection , Just let the model focus on a few elements .
The model diagram is shown in the figure above , On the far left is the standard route for calculating attention , The middle is Sparse The implementation of the , You can see the difference is that there is a manually selected one in the middle Sparsification, On the far right is its execution diagram . Simply put, it's calculating softmax Score before top-k Select a few important elements . Specifically, first calculate the inner product : P = Q K T d P=\frac{QK^T}{\sqrt{d}} P=dQKT Then filter manually according to the inner product score top-k Elements , That is, in the following formula M operation , Other P Then set it directly to negative infinity , Such coercion only makes k Elements are concerned . A = s o f t m a x ( M ( P , k ) ) A=softmax(M(P,k)) A=softmax(M(P,k)) Finally, multiply the score back to V: C = A V C=AV C=AV Through this operation , It can make you pay more attention . This operation to reduce the amount of computation also makes GPT-3 When the model can be bigger and more violent, it has also achieved good results, but it doesn't work ...
- paper:https://arxiv.org/abs/1912.11637
- code:https://github.com/lancopku/Explicit-Sparse-Transformer

Longformer: The Long-Document Transformer
Longformer It is also a kind of classic Sparse The method of . A total of 3 Strategies :
Sliding Window: Pictured above (b) Shown , Follow CNN It's like , Given a fixed window size w, There is one on both sides w/2 individual token Instead of doing attention. The computational complexity is reduced to O(n x w), That is, the complexity is linear with the length of the sequence . And if you set different windows for each layer size It can well balance the efficiency and presentation ability of the model .
Dilated sliding window: As shown in the figure above , Similar expansion CNN, The receiving domain can be further expanded without changing the computational complexity . alike , If you set different expansion configurations on each head of the multi head attention mechanism , You can pay attention to different local contexts of the article , Especially through this Dilated Can expand even far away .
Global Attention : Pictured (d) Shown , Calculate the global token It may represent the overall characteristics of the sequence . such as BERT Medium [CLS] This function , The complexity is reduced to O(n)
The complete content can be seen in the original :
- paper:https://arxiv.org/pdf/2004.05150.pdf
- code:https://github.com/allenai/longformer

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Compared with Sparse Attention It is difficult to use sparse operators GPU、TPU Hardware performance problems .Switch Transformer Sparse operators are not required , Can better adapt to GPU、TPU And other dense hardware . The main idea is to simplify sparse routing . In natural language MoE (Mixture of experts) Layer , Only will token The performance will be better if the characterization is sent to a single expert instead of multiple . The model architecture is shown in the figure above , The blue part in the middle is the key part of price comparison , You can see every time router They only send information to scores p The largest single FFN. And this operation can greatly reduce the amount of calculation .
Then on the other hand , The reason for its success is that it has a very good parallel strategy , It also combines data parallelism + Model parallel +expert parallel . The details are as follows: :
In fact, according to the model architecture ,experts Parallelism is the parallelism between operators , Their corresponding FFN There is operator level model parallelism inside , And the whole experts On the computational graph, it is a multi parallel FFN Branch , This is inter operator model parallelism , So we can get lower communication overhead , Improve the efficiency of parallelism .
- paper:https://arxiv.org/abs/2101.03961
- code:https://github.com/tensorflow/mesh/blob/master/mesh_tensorflow/transformer/moe.py
The next article will continue to sort out the relevant content :
边栏推荐
- Zero-Shot Image Retrieval(零样本跨模态检索)
- 什么是全局事件总线?
- php curl post Length Required 错误设置header头
- 'C:\xampp\php\ext\php_zip.dll' - %1 不是有效的 Win32 应用程序 解决
- brpc源码解析(三)—— 请求其他服务器以及往socket写数据的机制
- 11. Reading rumors spread with deep learning
- JS常用内置对象 数据类型的分类 传参 堆栈
- [leetcode brush questions]
- Hardware connection server TCP communication protocol gateway
- JS作用域以及预解析
猜你喜欢

Return and finally? Everyone, please look over here,
![[imx6ull notes] - a preliminary exploration of the underlying driver of the kernel](/img/0f/a0139be99c61fde08e73a5be6d6b4c.png)
[imx6ull notes] - a preliminary exploration of the underlying driver of the kernel

【GCN-RS】Towards Representation Alignment and Uniformity in Collaborative Filtering (KDD‘22)

pycharm连接远程服务器ssh -u 报错:No such file or directory

Go 垃圾回收器指南
![[electronic device notes 5] diode parameters and selection](/img/4d/05c60641dbdbfbfa6c3cc19a24fa03.png)
[electronic device notes 5] diode parameters and selection

Brpc source code analysis (V) -- detailed explanation of basic resource pool

The first C language program (starting from Hello World)

GPT plus money (OpenAI CLIP,DALL-E)

油猴脚本链接
随机推荐
JS 面试题:手写节流(throttle)函数
[cloud co creation] what is the role of AI in mathematics? What will be the disruptive impact on the mathematical world in the future?
【GCN-RS】MCL: Mixed-Centric Loss for Collaborative Filtering (WWW‘22)
Experimental reproduction of image classification (reasoning only) based on caffe resnet-50 network
【USB设备设计】--复合设备,双HID高速(64Byte 和 1024Byte)
30套中国风PPT/创意PPT模板
[GCN multimodal RS] pre training representations of multi modal multi query e-commerce search KDD 2022
Objects in JS
【AI4Code】CodeX:《Evaluating Large Language Models Trained on Code》(OpenAI)
'C:\xampp\php\ext\php_ zip. Dll'-%1 is not a valid Win32 Application Solution
GPT plus money (OpenAI CLIP,DALL-E)
Multi-Label Image Classification(多标签图像分类)
There is no sound output problem in the headphone jack on the front panel of MSI motherboard [solved]
JS data types and mutual conversion
PHP curl post x-www-form-urlencoded
【多模态】《HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval》ICCV 2021
11. Reading rumors spread with deep learning
LeetCode第303场周赛(20220724)
The bank's wealth management subsidiary accumulates power to distribute a shares; The rectification of cash management financial products was accelerated
OSPF综合实验