当前位置:网站首页>Transformer -- Analysis and application of attention model
Transformer -- Analysis and application of attention model
2022-07-28 04:56:00 【gongyuandaye】
One 、 Summary
characteristic :Transformer The attention model solves RNN Medium attention Low efficiency 、 The problem of deformation during training .
Purpose : Machine translation
Two 、 Basic composition
encoder and decoder It is stackable .
The decoder gets the characteristics of the encoder , Combine the translated words to complete the translation .
As shown in the figure below , Red box encoder, The blue box indicates decoder,N=6.
Input : Words to be translated (L A single hot code ) + Translated words (M A single hot code )
Output : The probability of words
Embedded layer : Through a transformation, the word's one-hot Represents mapping to a continuous space , Its dimension and model dimension 512 Agreement , You can use nn.Embedding Function implementation . Thus there are : Sentences to be translated (L512) + Translated words (M512).
Location code :i Indicates the dimension ,pos Indicates the position of the word 
3、 ... and 、 Encoder

As shown above , The difference in RNN The attention mechanism in , Each word vector (512 dimension ) Through three linear transformations q、k、v.
Let's first decompose the multi head attention mechanism into a single one .
The calculation process is as follows :
a11=q1k1,a12=q1k2,……,a1L=q1*kL
Re pass softmax Normalize to get a new a1j, Then for each pair a1j * vj Sum by accumulation , That is, the final output z.
Divided by the root sign dk(key Dimensions ) To solve the problem of large variance , take softmax Push to the problem of low gradient .
As shown below , It reflects the interdependence between features .
q、k、v The production process of is as follows ,512 Word vector of dimension and W matrix multiplication :
Again , Output z It can also be realized by matrix operation , As shown below :
thus , The multi head attention mechanism in the encoder gives h=8 Group W matrix , Will output 8 Group z Connect , Multiply by a weight matrix , The final output Z.
The formula in the paper is as follows :

softmax Pre inclusion mask, Remove the end of the sentence padding Influence in the training process .
Encoding Add&Norm The operation is as follows , After the residual, use layernorm:
In the encoder FFN Layer 2 network ,512->2048->512, The formula is as follows :
Four 、 decoder

The decoder also contains mask, in the light of decoder The input of , Not only to remove padding Influence , At the same time, in order to prevent decoder See the future information , Make an upper triangle of input mask, Maintain autoregressive properties .
5、 ... and 、 Self supervised learning

Here is a brief introduction cv Applications in the field .
6、 ... and 、Non-local
( To be added )
7、 ... and 、ViT
( To be added )
8、 ... and 、MAE
be based on ViT+BERT
Cover more picture blocks , Leave no redundancy
When encoding, only the uncovered
use Transformer decode 
边栏推荐
- [daily one] visual studio2015 installation in ancient times
- Interview fraud: there are companies that make money from interviews
- Mysql database -- first knowledge database
- Test report don't step on the pit
- Wang Shuang assembly language detailed learning notes 3: registers (memory access)
- 提升学生群体中的STEAM教育核心素养
- The first artificial intelligence security competition starts. Three competition questions are waiting for you to fight
- HDU 1530 maximum clique
- Redis类型
- Nat fundamentals and private IP
猜你喜欢

Flink mind map

Domain name (subdomain name) collection method of Web penetration

Rendering process, how the code becomes a page (I)

驾驭EVM和XCM的强大功能,SubWallet如何赋能波卡和Moonbeam

启发国内学子学习少儿机器人编程教育
![[idea] check out master invalid path problem](/img/83/d36362ba314177cd6f1f74f3e922cd.png)
[idea] check out master invalid path problem

Observable time series data downsampling practice in Prometheus

What SaaS architecture design do you need to know?

What is the reason why the easycvr national standard protocol access equipment is online but the channel is not online?

Data security is gradually implemented, and we must pay close attention to the source of leakage
随机推荐
Leetcode 454. Adding four numbers II
如何在 FastReport VCL 中通过 Outlook 发送和接收报告?
Depth traversal and breadth traversal of tree structure in JS
flink思维导图
Driving the powerful functions of EVM and xcm, how subwallet enables Boca and moonbeam
Warning: file already exists but should not: c:\users\workmai\appdata\local\temp appears when Python packages exe\_ MEI13
What SaaS architecture design do you need to know?
set与list性能对比
FPGA: use PWM wave to control LED brightness
[Sylar] framework -chapter11 socket module
Cmake usage base summary
全方位分析STEAM和创客教育的差异化
C语言ATM自动取款机系统项目的设计与开发
猿辅导技术进化论:助力教与学 构想未来学校
(3.1) [Trojan horse synthesis technology]
FPGA:使用PWM波控制LED亮度
Leetcode 15. sum of three numbers
Look at the experience of n-year software testing summarized by people who came over the test
excel实战应用案例100讲(十一)-Excel插入图片小技巧
Rendering process, how the code becomes a page (2)