当前位置:网站首页>Classic model transformer
Classic model transformer
2022-06-28 15:23:00 【On the right is my goddess】
List of articles
MLP、CNN、RNN The fourth largest model after .
Abstract
Sequence transcription model mainly adopts RNN perhaps CNN. It often contains a structure of encoder and decoder .
Rely solely on attention mechanisms .
This article is mainly aimed at machinetranslation . Later it was applied in different fields .
Introduction
problem :
- RNN It is a step-by-step calculation process , Can't be parallel , Poor computational performance ;
- The timing information will be lost as the sequence moves .
Attention mechanisms have long been associated with RNN Some combination , Better realize the data interaction between codecs .
But this article discards RNN Structure , Complete with attention mechanism .
Background
It is difficult to model long sequences with convolutional neural networks , Many layers of convolution are needed to expand the receptive field . The advantage of convolution is that it has multiple output channels , Each channel can learn a pattern .
therefore , In this paper, we propose a multi headed attention model .
Model Architecture
For the sequence model , Encoder - The decoder structure performs well in sequence tasks .
For the decoder , In a recurrent neural network , Words are output one by one , The output of the past time will be used as the input of the current time , This is called autoregression .
For the decoder , You can see all the sentences . The whole sequence obtained by the encoder is delivered to the decoder .
Encoder and Decoder Stacks
Encoder: Contains six stacked modules , Each module has two sublayers . In each module , Both sublayers have corresponding residual connections , Then it is standardized (LayerNorm).
To avoid residual connection , Inconsistent channel sizes ( Projection is required ), In this paper, its dimensions are uniformly set to 512.
BatchNorm And LayerNorm
Internal Covariate Shift: In the process of training , The distribution of data is constantly changing , It brings difficulties to the learning of the next layer network .
During training , For a two-dimensional matrix , Lines represent samples , Column representative feature ,BatchNorm Is to standardize each column ( Calculate the mean value 、 Standard deviation ,Z-score). During the test ,
generally , Finally, I will use learnable parameters γ , β \gamma,\beta γ,β, Make a linear transformation on the standardized results , So we change the mean and variance of this distribution .
This is because , If all are unified into a standard normal distribution , Then the feature distribution learned by the model is completely eliminated . So it is necessary to give him a chance to fine tune .
In my submission BN The role of layers should be to limit their distribution on the one hand , There is a basic prototype , On the other hand, they don't want to be carved in the same mold .
At testing time , The parameters of mean and variance used are calculated during training .
Formula for μ = m μ + ( 1 − m ) μ b a t c h , σ Same as The reason is \mu = m\mu+(1-m)\mu_{batch},\sigma Empathy μ=mμ+(1−m)μbatch,σ Same as The reason is .
If the input is [B,C,H,W] Words , The output is [C,H,W].

LayerNorm It's about a Batch One of them Sample In terms of the . It calculates all channel The mean and variance of each parameter in , Normalize , It's just in C On the dimension .
Formula for y = x − E [ x ] V a r [ x ] + ϵ ⋅ γ + β y=\frac{x-E[x]}{\sqrt{Var[x]+\epsilon}}\cdot\gamma+\beta y=Var[x]+ϵx−E[x]⋅γ+β
LN Usually used in NLP in , because NLP In a Sample It's a sentence , Every dimension in a sentence is a word , Therefore, there is no common characteristic relationship between words in the same dimension , At the same time, in order to do BN still more padding Useless block , therefore BN The effect is very bad .
Therefore, the normalized object in the training process is a word .
Because it is done for each sample , So it is not necessary to calculate the global mean and variance in the training process .
Decoder: Masked . Make the behavior of training and prediction consistent . The input is Encoder All the output of .
Attention
The attention function is a query And a series of key Mapping to an output function .
Scaled Dot-Product Attention
There are two common attention mechanisms , They are addition and dot product . Here we use the form of dot product , Because it is more efficient .
But there's another one here d k \sqrt{d_k} dk, The purpose is to avoid softmax Such a function is saturated during training ( Of course, it is good to finally appear ).
This is also the origin of the name of attention .
Now let's talk about the attention function A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V Attention(Q,K,V)=softmax(dkQKT)V
among Q、K、V Every one of shape All are ( n u m _ s a m p l e , n u m _ f e a t u r e ) (num\_sample,num\_feature) (num_sample,num_feature), As one can imagine , Q K T QK^T QKT The elements of ( i , j ) (i,j) (i,j) The meaning is , My number i Samples need to be tested on the j Notice how many samples .
key,value Of shape It should be consistent ,query Of shape Can be different from them .

How to do it mask?
In decoder , You can't see the following content , So let the t It's a matter of time query Just look at the front key. Here or normal , Is the result of the final multiplication t+1 Then the data becomes a large negative number ,softmax It will become 0.
Muti-Head Attention

Simulate multiple output channels , Divide the input into many equal sized channels .

Applications of Attention in our Model
There are three different layers of attention .

The first attention layer of the decoder has a Mask Things that are .
Input of the second attention layer of the decoder :key and value From encoder ,query From the previous level of attention .
Position-wise Feed-Fordward Networks
To put it bluntly, it's just one MLP, On the last layer .
Single hidden layer MLP, The hidden layer in the middle turns the dimension into 2048, Then change back .
Formula for :
pytorch The input is 3d Words , The default is to calculate in the last dimension .
Attention The function is to capture the information in the sequence , Do a convergence .MLP The function of is to map to the semantic space I want , Because every word has complete sequence information , therefore MLP It can be done alone .
RNN Also used MLP Make a conversion , To ensure the acquisition of sequence information , Input the output of the previous time to the output of the next time MLP.
Embedding and Softmax
Is to map words into vectors .
Positional Encoding
Attention There is no timing information , So it is necessary to encode the position .
P ( p o s , 2 i ) = sin p o s 1000 0 2 i d m o d e l P(pos,2i)=\sin \frac{pos}{10000^{\frac{2i}{d_{model}}}} P(pos,2i)=sin10000dmodel2ipos
P ( p o s , 2 i + 1 ) = cos p o s 1000 0 2 i d m o d e l P(pos,2i+1)=\cos \frac{pos}{10000^{\frac{2i}{d_{model}}}} P(pos,2i+1)=cos10000dmodel2ipos
Why Self-Attention

The first comparison is computational complexity 、 The second is sequence events ( A measure of parallelism )、 The third is the maximum path length ( How long does it take to go from the first position to the last position in the sequence , Reflect the blending of information ).
Q K^T Words ,n A sample and n Multiply samples , Every time I take d Time , So it's this complexity .
The words of a recurrent neural network , Come in one d Samples of dimensions ,MLP For each dimension d Times operation , Together with n Time , So it's this number .
At present, there is no difference in the computational complexity between the two . Mainly the follow-up content :Attention Information is not easy to lose and the parallelism is high .
however Attention There are fewer constraints on the model. It takes a larger model and more constraints to train .
Conclusion
Used encoder-decoder Structure , But one of them recurrent layers Instead of multi-headed self-attention.
边栏推荐
- 化学制品制造业智慧供应商管理系统深度挖掘供应商管理领域,提升供应链协同
- After QQ was stolen, a large number of users "died"
- Flutter dart语言特点总结
- With 120billion yuan, she will ring the bell for IPO again
- C语言学习-20-归并排序
- How does Seata server 1.5.0 support mysql8.0?
- BatchNorm2d原理、作用及其pytorch中BatchNorm2d函数的参数讲解
- C#/VB. Net to convert PDF to excel
- R语言ggplot2可视化:使用patchwork包(直接使用加号+)将一个ggplot2可视化结果和一个plot函数可视化结果横向组合起来形成最终结果图、将两个可视的组合结果对齐
- 华为能成“口红一哥”,或者“带货女王”吗?
猜你喜欢
随机推荐
深度学习基础汇总
MIPS汇编语言学习-01-两数求和以及环境配置、如何运行
Facebook! Adaptive gradient defeats manual parameter adjustment
MIPS汇编语言学习-03-循环
论文解读(GCC)《Efficient Graph Convolution for Joint Node RepresentationLearning and Clustering》
How to solve the following problems in the Seata database?
Fleet |「后台探秘」第 3 期:状态管理
当下不做元宇宙,就像20年前没买房!
ROS知识点——使用VScode搭建ROS开发环境
ROS knowledge points - ROS create workspace
教育行业SaaS应用管理平台解决方案:助力企业实现经营、管理一体化
After QQ was stolen, a large number of users "died"
Send2vec tutorial
币圈大地震:去年赚100万,今年亏500万
[C language] nextday problem
Spacetutorial (continuous updating...)
Realization of a springboard machine
Fleet |「後臺探秘」第 3 期:狀態管理
Steve Jobs of the United States, died; China jobs, sold
ROS知识点——ROS创建工作空间









