当前位置：网站首页>Classic model transformer

Classic model transformer

2022-06-28 15:23:00 【On the right is my goddess】

List of articles

Abstract
Introduction
Background
Model Architecture
Why Self-Attention
Conclusion

MLP、CNN、RNN The fourth largest model after .

Abstract

Sequence transcription model mainly adopts RNN perhaps CNN. It often contains a structure of encoder and decoder .

Rely solely on attention mechanisms .

This article is mainly aimed at machinetranslation . Later it was applied in different fields .

Introduction

problem ：

RNN It is a step-by-step calculation process , Can't be parallel , Poor computational performance ;
The timing information will be lost as the sequence moves .

Attention mechanisms have long been associated with RNN Some combination , Better realize the data interaction between codecs .

But this article discards RNN Structure , Complete with attention mechanism .

Background

It is difficult to model long sequences with convolutional neural networks , Many layers of convolution are needed to expand the receptive field . The advantage of convolution is that it has multiple output channels , Each channel can learn a pattern .

therefore , In this paper, we propose a multi headed attention model .

Model Architecture

For the sequence model , Encoder - The decoder structure performs well in sequence tasks .

For the decoder , In a recurrent neural network , Words are output one by one , The output of the past time will be used as the input of the current time , This is called autoregression .

For the decoder , You can see all the sentences . The whole sequence obtained by the encoder is delivered to the decoder .

Encoder and Decoder Stacks

Encoder： Contains six stacked modules , Each module has two sublayers . In each module , Both sublayers have corresponding residual connections , Then it is standardized （LayerNorm）.

To avoid residual connection , Inconsistent channel sizes （ Projection is required ）, In this paper, its dimensions are uniformly set to 512.

BatchNorm And LayerNorm

Internal Covariate Shift： In the process of training , The distribution of data is constantly changing , It brings difficulties to the learning of the next layer network .

During training , For a two-dimensional matrix , Lines represent samples , Column representative feature ,BatchNorm Is to standardize each column （ Calculate the mean value 、 Standard deviation ,Z-score）. During the test ,

generally , Finally, I will use learnable parameters $\gamma,\beta$ , Make a linear transformation on the standardized results , So we change the mean and variance of this distribution .

This is because , If all are unified into a standard normal distribution , Then the feature distribution learned by the model is completely eliminated . So it is necessary to give him a chance to fine tune .

In my submission BN The role of layers should be to limit their distribution on the one hand , There is a basic prototype , On the other hand, they don't want to be carved in the same mold .

At testing time , The parameters of mean and variance used are calculated during training .

Formula for $\mu = m\mu+(1-m)\mu_{batch},\sigma Empathy$ .

If the input is [B,C,H,W] Words , The output is [C,H,W].

Insert picture description here

LayerNorm It's about a Batch One of them Sample In terms of the . It calculates all channel The mean and variance of each parameter in , Normalize , It's just in C On the dimension .

Formula for $y=\frac{x-E[x]}{\sqrt{Var[x]+\epsilon}}\cdot\gamma+\beta$

LN Usually used in NLP in , because NLP In a Sample It's a sentence , Every dimension in a sentence is a word , Therefore, there is no common characteristic relationship between words in the same dimension , At the same time, in order to do BN still more padding Useless block , therefore BN The effect is very bad .

Therefore, the normalized object in the training process is a word .

Because it is done for each sample , So it is not necessary to calculate the global mean and variance in the training process .

Decoder： Masked . Make the behavior of training and prediction consistent . The input is Encoder All the output of .

Attention

The attention function is a query And a series of key Mapping to an output function .

Scaled Dot-Product Attention

There are two common attention mechanisms , They are addition and dot product . Here we use the form of dot product , Because it is more efficient .

But there's another one here $\sqrt{d_k}$ , The purpose is to avoid softmax Such a function is saturated during training （ Of course, it is good to finally appear ）.

This is also the origin of the name of attention .

Now let's talk about the attention function $Attention(Q,K,V)=softmax(\frac{QK^T}{\sqrt{d_k}})V$

among Q、K、V Every one of shape All are $num\_sample,num\_feature)$ , As one can imagine , $QK^T$ The elements of $(i, j)$ The meaning is , My number i Samples need to be tested on the j Notice how many samples .

key,value Of shape It should be consistent ,query Of shape Can be different from them .

Please add a picture description
How to do it mask？
In decoder , You can't see the following content , So let the t It's a matter of time query Just look at the front key. Here or normal , Is the result of the final multiplication t+1 Then the data becomes a large negative number ,softmax It will become 0.

Muti-Head Attention

Please add a picture description
Simulate multiple output channels , Divide the input into many equal sized channels .

Please add a picture description

Applications of Attention in our Model

There are three different layers of attention .

Please add a picture description

The first attention layer of the decoder has a Mask Things that are .

Input of the second attention layer of the decoder ：key and value From encoder ,query From the previous level of attention .

Position-wise Feed-Fordward Networks

To put it bluntly, it's just one MLP, On the last layer .

Single hidden layer MLP, The hidden layer in the middle turns the dimension into 2048, Then change back .

Formula for ：
Please add a picture description

pytorch The input is 3d Words , The default is to calculate in the last dimension .

Attention The function is to capture the information in the sequence , Do a convergence .MLP The function of is to map to the semantic space I want , Because every word has complete sequence information , therefore MLP It can be done alone .

RNN Also used MLP Make a conversion , To ensure the acquisition of sequence information , Input the output of the previous time to the output of the next time MLP.

Embedding and Softmax

Is to map words into vectors .

Positional Encoding

Attention There is no timing information , So it is necessary to encode the position .

$P(pos,2i)=\sin \frac{pos}{10000^{\frac{2i}{d_{model}}}}$
$P(pos,2i+1)=\cos \frac{pos}{10000^{\frac{2i}{d_{model}}}}$

Why Self-Attention

Please add a picture description
The first comparison is computational complexity 、 The second is sequence events （ A measure of parallelism ）、 The third is the maximum path length （ How long does it take to go from the first position to the last position in the sequence , Reflect the blending of information ）.

Q K^T Words ,n A sample and n Multiply samples , Every time I take d Time , So it's this complexity .

The words of a recurrent neural network , Come in one d Samples of dimensions ,MLP For each dimension d Times operation , Together with n Time , So it's this number .

At present, there is no difference in the computational complexity between the two . Mainly the follow-up content ：Attention Information is not easy to lose and the parallelism is high .

however Attention There are fewer constraints on the model. It takes a larger model and more constraints to train .