当前位置：网站首页>Transformer variants (spark transformer, longformer, switch transformer)

Transformer variants (spark transformer, longformer, switch transformer)

2022-07-25 12:02:00 【Shangshanxianger】

Insert picture description here
Before you know it Transformer It has gradually penetrated into all fields , On its own, it has also produced a considerable number of variants , Pictured above . The previous blog post of the blogger was updated Transformer variant （Star-Transformer,Transformer-XL）, This article wants to sort out these two very important Transformer variant , Namely Sparse Transformer and Switch Transformer.

Insert picture description here
Explicit Sparse Transformer: : Concentrated Attention Through Explicit Selection
standard Transformer The complexity of is O（n^2）, But is it necessary to pay attention to all the elements in the sequence , Is there a way to simplify this mechanism ？ So this article's “Sparse” The point is that there are only a few token Participate in attention Calculation of distribution , To improve the concentration of attention mechanism . That is, originally, a word is only related to a few words , But standard self attention will assign weight to all words and then aggregate , A natural idea is through explicit selection , Just let the model focus on a few elements .

The model diagram is shown in the figure above , On the far left is the standard route for calculating attention , The middle is Sparse The implementation of the , You can see the difference is that there is a manually selected one in the middle Sparsification, On the far right is its execution diagram . Simply put, it's calculating softmax Score before top-k Select a few important elements . Specifically, first calculate the inner product ： $P=\frac{QK^T}{\sqrt{d}}$ Then filter manually according to the inner product score top-k Elements , That is, in the following formula M operation , Other P Then set it directly to negative infinity , Such coercion only makes k Elements are concerned . $A = s o f t m a x (M (P, k))$ Finally, multiply the score back to V： $C = A V$ Through this operation , It can make you pay more attention . This operation to reduce the amount of computation also makes GPT-3 When the model can be bigger and more violent, it has also achieved good results, but it doesn't work ...

paper：https://arxiv.org/abs/1912.11637
code：https://github.com/lancopku/Explicit-Sparse-Transformer

Insert picture description here
Longformer: The Long-Document Transformer
Longformer It is also a kind of classic Sparse The method of . A total of 3 Strategies ：

Sliding Window： Pictured above (b) Shown , Follow CNN It's like , Given a fixed window size w, There is one on both sides w/2 individual token Instead of doing attention. The computational complexity is reduced to O（n x w）, That is, the complexity is linear with the length of the sequence . And if you set different windows for each layer size It can well balance the efficiency and presentation ability of the model .
Dilated sliding window： As shown in the figure above , Similar expansion CNN, The receiving domain can be further expanded without changing the computational complexity . alike , If you set different expansion configurations on each head of the multi head attention mechanism , You can pay attention to different local contexts of the article , Especially through this Dilated Can expand even far away .
Global Attention ： Pictured (d) Shown , Calculate the global token It may represent the overall characteristics of the sequence . such as BERT Medium [CLS] This function , The complexity is reduced to O（n）

The complete content can be seen in the original ：

paper：https://arxiv.org/pdf/2004.05150.pdf
code：https://github.com/allenai/longformer

Insert picture description here
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Compared with Sparse Attention It is difficult to use sparse operators GPU、TPU Hardware performance problems .Switch Transformer Sparse operators are not required , Can better adapt to GPU、TPU And other dense hardware . The main idea is to simplify sparse routing . In natural language MoE （Mixture of experts） Layer , Only will token The performance will be better if the characterization is sent to a single expert instead of multiple . The model architecture is shown in the figure above , The blue part in the middle is the key part of price comparison , You can see every time router They only send information to scores p The largest single FFN. And this operation can greatly reduce the amount of calculation .

Then on the other hand , The reason for its success is that it has a very good parallel strategy , It also combines data parallelism + Model parallel +expert parallel . The details are as follows: ：
Insert picture description here
In fact, according to the model architecture ,experts Parallelism is the parallelism between operators , Their corresponding FFN There is operator level model parallelism inside , And the whole experts On the computational graph, it is a multi parallel FFN Branch , This is inter operator model parallelism , So we can get lower communication overhead , Improve the efficiency of parallelism .