当前位置:网站首页>Nanny level explains Transformer

Nanny level explains Transformer

2022-08-03 07:39:00 WGS.

原创不易,转载请注明出处

一、模型背景

  • paper:Attention Is All You Need

  • 论文中给出Transformer的定义是:

    Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence aligned RNNs or convolution.

  • Development motive:

    • Currently in sequence modeling and transformation problems,如语言建模和机器翻译,Adopted for the mainstream of the frameworkEncoder-Decoder框架.传统的Encoder-Decoder一般采用RNNAs a main method,基于RNN所发展出来的LSTM和GRUIt was also considered to be the most advanced method to solve the problem.而RNN模型的计算Is restricted to order,This mechanism hinders the并行化,It will lead to the loss of information in the calculation process and thusThe long-term dependence problem.RNNAnd its derivative network fault is slow,The problem is the dependency of the hidden state before and after,无法实现并行.
  • transformer为何优于RNN及RNN的一系列变体?

    • Transformer中抛弃了传统的CNN和RNN,整个网络结构完全是由Attention机制组成. 作者采用Attention机制的原因是考虑到RNN(或者LSTM,GRU等)的Calculation is limited to order,也就是说RNN相关算法只能从左向右依次计算或者从右向左依次计算,这种机制带来了两个问题:
    • 时间片 t t t的计算依赖 t − 1 t-1 t1时刻的计算结果,这样限制了模型的并行能力.
    • 顺序计算的过程中信息会丢失,尽管LSTM等门机制的结构一定程度上缓解了长期依赖的问题,但是对于特别长期的依赖现象,LSTM依旧无能为力.

在这里插入图片描述
在这里插入图片描述

如上图所示,No matter use among isRNN还是GRU,None of them can avoid a situation:they are interdependent(图中箭头),For the back of the output,It depends on the previous state and the current input.

这就意味着,we want to get this output,Then you have to get the previous state.So you have to walk the one step,才能走下一步.So say its computation is restricted to order,Hindered the sampleParallel training.

  • transformer特点
    • 基于Attention机制,将序列中任意两个位置之间的距离缩小为一个常量.
    • Transformer利用self-attention机制实现快速并行,改进了RNN/LSTM最被人诟病的训练慢的缺点,At the same time also accord with the existingGPUA matrix training framework.
    • Transformer可以增加到非常深的深度跳跃连接,充分发掘DNN模型的特性,提升模型准确率.

二、Introduce explained task scenarios

方便讲解,And the original paper isMachine translation scene.

在这里插入图片描述

三、模型整体结构

3.1 模型结构图

在这里插入图片描述

  • 这是论文中的原图,You can see that there are multipliers on the left and right sidesN,左半部分就是encoders,右半部分就是decoders
  • Bottom-up see PMV:
    • 1.首先是encoderPart of the input anddecoder部分的输入,经过Embedding;
    • 2.其次是Positional Encoding(位置编码);
    • 3.其次是Multi-Head Attention(多头注意力机制);
    • 4.其次是Add&Norm(跳跃连接&LayerNorm);
    • 5.其次是Feed Forward(前馈神经网络);
    • 6.The second is the output layer(Linear、Softmax).

3.2 Macro for understanding

Such as input a French,经过transformer,Translated into the corresponding English.

在这里插入图片描述

(https://note.youdao.com/yws/res/46842/WEBRESOURCEd97478d3e07f1cd339040ff20c795581)]

在这里插入图片描述

3.3 Micro dismantling understand

(https://note.youdao.com/yws/res/46855/WEBRESOURCE37b6e4bfe3afa1c9905a2fc93560f558)]

3.4 An example of a machine translation workflow

在这里插入图片描述

  • 第一步:输入要翻译的英文,即encoder端的输入:“Why do we work?”;
  • 第二步:经过encoderPart of the operation after,The hidden layer and output todecoder端;
  • 第三步:decoder部分输入,It should be noted here that because the prediction is made word by word,So the first input is the start character"start"
  • 第四步:经过decoderPart of the operation output"为";
  • 第五步:有了"为"之后,可以用"start 为"预测"什";
  • 然后重复执行,Until a terminator is encountered to indicate the end of the prediction.

If it is the first time for students who are exposed to machine translation, they may feel confused,Then the blogger will take Chinese-English translation as an example to help friends understand:

在这里插入图片描述

  • Each sentence contains the starting characters、结束符e
  • Look at the below process to explain:

在这里插入图片描述

  • Lost in this sentence will be inencoderPart of parallel arithmetic,将encoderThe semantic information passed in anddecoderInitialize the starting character,Predict who will be the next word to be generated.比如输出"I".

在这里插入图片描述

  • 第2步的时候,decoderThe input into two,i.e. it will take the output of the previous step,As the next step input.用’start’和’I’预测出’like’.

在这里插入图片描述

  • 上一步的输出’like’becomes the next input,也是先将I和like屏蔽掉,用start预测出I,用’start I’预测出like,在用’start I like’预测出you.

在这里插入图片描述

四、输入部分&Embedding

The data preprocessing part in a machine translation scenario generally includes the following steps:

  • 1.Convert string to numeric encoding;
  • 2.According to the sentence length filtering;
  • 3.Add the starting character and terminator;
  • 4.mini-batch、padding填充.

这里没什么好讲的,直接上示例代码

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
import math

'''句子的输入部分:Encode the input、Decoding the input、解码端的真实标签'''
# P指padding、SRefers to the starting character、ERefers to the end
sentences = [['我 喜 欢 你 P', 'S i like you', 'i like you E'],
             ['我 喜 欢 你 P', 'S i like you', 'i like you E'],
             ['我 喜 欢 你 P', 'S i like you', 'i like you E']]

'''构建词表,The encoder and decoder can share a vocabulary,Here is a convenient demo to build separately'''
# encoderThe glossary and the size of the word
src_vocab = {
    'P': 0, '我': 1, '喜': 2, '欢': 3, '你': 4}
src_vocab_size = len(src_vocab)

# decoderThe glossary and the size of the word
tgt_vocab = {
    'P': 0, 'i': 1, 'like': 2, 'you': 3, 'S': 4, 'E': 5}
tgt_vocab_size = len(tgt_vocab)

src_len = 5  # length of source
tgt_len = 4  # length of target

'''The word and the word table map,这里可以理解为label编码'''
enc_inputs, dec_inputs, target_batch = make_batch(sentences)

print(enc_inputs)
print(dec_inputs)
print(target_batch)
tensor([[1, 2, 3, 4, 0],
        [1, 2, 3, 4, 0],
        [1, 2, 3, 4, 0]])
tensor([[4, 1, 2, 3],
        [4, 1, 2, 3],
        [4, 1, 2, 3]])
tensor([[1, 2, 3, 5],
        [1, 2, 3, 5],
        [1, 2, 3, 5]])
'''The word and the word table map,这里可以理解为label编码,把单词序列转换为数字序列'''
def make_batch(sentences):
    input_batch, output_batch, target_batch = [], [], []
    for sentence in sentences:
        input_batch.append([src_vocab[n] for n in sentence[0].split()])
        output_batch.append([tgt_vocab[n] for n in sentence[1].split()])
        target_batch.append([tgt_vocab[n] for n in sentence[2].split()])

    return torch.LongTensor(input_batch), torch.LongTensor(output_batch), torch.LongTensor(target_batch)

Here is a brief description of what isembedding,Use a vector to represent a word or a sentence,这就是embedding,解决了传统onehotThe sparsity problem caused by discretization.embeddingis input as the first layer of the network.

比如上述例子,“我喜欢你”,In this sentence to do a512维的embedding,如下图:

在这里插入图片描述

# 其中src_vocab_sizeRefers to the vocabulary size,d_model为emb维度
self.src_emb = nn.Embedding(num_embeddings=src_vocab_size, embedding_dim=d_model)

forward里,我们需要知道emb后的维度为:[batch_size, src_len, d_model],上述例子中就是:[3, 5, 512]

enc_outputs = self.src_emb(enc_inputs)
tensor([[[-0.6629,  0.7175, -1.2013,  ...,  1.3913, -0.7109,  0.4084],
         [-0.0628, -0.0403, -0.0125,  ..., -0.7531,  1.2500, -0.6480],
         [-1.0983, -0.2127, -0.1055,  ..., -0.2792, -1.1022, -1.6856],
         [ 0.7872, -0.0841, -0.0297,  ...,  0.1816, -0.4747, -0.6163],
         [ 0.8541, -0.0226,  0.3261,  ..., -0.8943, -1.6848,  1.0269]],

        [[-0.6629,  0.7175, -1.2013,  ...,  1.3913, -0.7109,  0.4084],
         [-0.0628, -0.0403, -0.0125,  ..., -0.7531,  1.2500, -0.6480],
         [-1.0983, -0.2127, -0.1055,  ..., -0.2792, -1.1022, -1.6856],
         [ 0.7872, -0.0841, -0.0297,  ...,  0.1816, -0.4747, -0.6163],
         [ 0.8541, -0.0226,  0.3261,  ..., -0.8943, -1.6848,  1.0269]],

        [[-0.6629,  0.7175, -1.2013,  ...,  1.3913, -0.7109,  0.4084],
         [-0.0628, -0.0403, -0.0125,  ..., -0.7531,  1.2500, -0.6480],
         [-1.0983, -0.2127, -0.1055,  ..., -0.2792, -1.1022, -1.6856],
         [ 0.7872, -0.0841, -0.0297,  ...,  0.1816, -0.4747, -0.6163],
         [ 0.8541, -0.0226,  0.3261,  ..., -0.8943, -1.6848,  1.0269]]],
       grad_fn=<EmbeddingBackward0>) torch.Size([3, 5, 512])

五、位置编码(Positional Encoding)

在这里插入图片描述

5.1 Why introduce positional coding?

从图中我们发现,embeddingAfter add location coding,So why assume location encoding??

这就要引入Problems caused by parallel computing,也就是传统RNNThe characteristics of the circulation network:

在这里插入图片描述

Also mentioned at the beginning of this article,It has a semantic relationship of pre- and post-dependency,在生成”爱“Must be generated before”我“,生成”爱“之后才能生成”你“.

具体的,Due to the sequence of the words in the sentence,所以后一个timestepThe input must be equal to the previoustimestep的输出,It is not capable of parallel processing;

并且RNN中的每个timestep共享一套参数 u , w , v u,w,v u,w,v,所以会出现梯度消失或梯度爆炸的问题;

相对于attention,It has the ability of parallel processing,but does not have the ability to display location information.

ps:Here is an interview question,RNN的梯度消失有什么不同?
The multiplicative effect causes the gradient to disappearRNNThis is not very accurate,RNNThe gradient of is a total gradient and,Its gradient disappears not only the gradient sum becomes0,Rather, the overall gradient is dominated by the close-range gradient,被远距离梯度忽略不计,这才是RNNThe real meaning of gradient disappear.

  • transformerBecause there is attention, it is a parallelized calculation,So there is no semantic location information.,There is no way to understand”我
    “在”爱“前面.
  • 比如输入是”你礼貌吗“,If the location information is scrambled,变成”Polite to you“的时候,Because the model does not have location information
    so there is no way to recognize the semantics,But from the people⻆I understand that these are two completely different sentences..

所以就有了位置编码,解决方案:Introduce absolute position information and relative position information.

在这里插入图片描述

The picture above is the use of even-numbered positions sin 函数,奇数位置使用 cos 函数.

According to the location code formula,可以将embVector of each value to distinguish,You can also separate each sentence,比如告诉模型 ”我“ 是在 “喜” 之前的.

The position-encoded vector sumembeddingPara summing,作为整个transformer的输入

在这里插入图片描述

在这里插入图片描述

  • posrefers to every position in the sentence,比如起始符s的pos就是1,“我”的pos就是2,It can also be thought of as the order of this vector.
  • 2i代表偶数位置,2i+1代表奇数位置.
    • i=embVector of indices of2求模,If have as shown512维的emb向量,当embVector index for0的位置时,它的i就是 0 / / 2 = 0 0 // 2 = 0 0//2=0,同理1的位置是 1 / / 2 = 0 1 // 2 = 0 1//2=0,以此类推…
  • d_model指的是emb向量的维度,也是个定值.

所以根据公式,就能计算出embThe location of the vector encoding tensor.

观察可以发现:
Location coding with the sentence⻓度seq_len有关,它由pos决定,Sentences have more⻓就决定了pos从多少到多少;Secondly also andembVector dimensions related to.
Once the two confirmed,Then the position code is uniquely determined according to the formula.它跟embThe values ​​inside the vector don't matter.只和seq_len和d_model有关系.

总结来说,当d_modelseq_length确定,Location code determined.

While introducing positional coding,也引入了Absolute position information and relative position information.

5.2 Absolute position how?

如下图:

(https://note.youdao.com/yws/res/46971/WEBRESOURCE12856560d93bec30fb321b66fa45b114)]

def plot_position_embedding(position_embedding):
# Map location coding
    plt.pcolormesh(position_embedding[0],cmap='RdBu') # 【50*512】
    plt.xlabel('Depth')
    plt.xlim((0,512))
    plt.ylabel('Position')
    plt.colorbar()
    plt.show()

position_embedding = positional_encoding(50,512)
plot_position_embedding(position_embedding)
position_embedding
  • d_model=512,seq_length=50;
  • Ordinate ispos,横坐标就是emb的维度;
  • The depth of the color on the right represents the size of the value;

(https://note.youdao.com/yws/res/46978/WEBRESOURCE27830ef80cf32300e02ef9fefe373673)]

可以这样认为,从下往上看:

  • The first is the stripes“起始符”The location of this vector coding;
  • 第二个是“我”The location of this vector coding;
  • 第三个是“喜”The location of this vector coding…;
  • We can see that each vector is unique,那么这个就是一个绝对位置信息;because it is uniquely identified,The position encoding vector of each word is unique.仔细观 Observe that these stripes also have periodic changes..

5.3 The relative position how?

根据三角函数公式:

s i n ( α + β ) = s i n ( α ) c o s ( β ) + c o s ( α ) s i n ( β ) c o s ( α + β ) = c o s ( α ) c o s ( β ) + s i n ( α ) s i n ( β ) sin(α + \beta) = sin(\alpha)cos(\beta) + cos(\alpha)sin(\beta) \\ cos(\alpha+\beta) = cos(\alpha)cos(\beta) + sin(\alpha)sin(\beta) sin(α+β)=sin(α)cos(β)+cos(α)sin(β)cos(α+β)=cos(α)cos(β)+sin(α)sin(β)

The above formula:对于词汇之间的位置偏移 k k k, P E ( p o s + k ) PE(pos+k) PE(pos+k)可以表示成 P E ( p o s ) PE(pos) PE(pos) P E ( k ) PE(k) PE(k)的组合形式,It is ability to express the relative position.

{ P E ( p o s + k , 2 i ) = P E ( p o s , 2 i ) ∗ P E ( k , 2 i + 1 ) + P E ( p o s , 2 i + 1 ) ∗ P E ( k , 2 i ) P E ( p o s + k , 2 i + 1 ) = P E ( p o s , 2 i + 1 ) ∗ P E ( k , 2 i + 1 ) − P E ( p o s , 2 i ) ∗ P E ( k , 2 i ) \begin{cases} PE(pos + k, 2i) = PE(pos, 2i) * PE(k, 2i + 1) + PE(pos, 2i + 1) * PE(k, 2i) \\ PE(pos + k, 2i + 1) = PE(pos, 2i + 1) * PE(k, 2i + 1) - PE(pos, 2i) * PE(k, 2i) \end{cases} { PE(pos+k,2i)=PE(pos,2i)PE(k,2i+1)+PE(pos,2i+1)PE(k,2i)PE(pos+k,2i+1)=PE(pos,2i+1)PE(k,2i+1)PE(pos,2i)PE(k,2i)

在这里插入图片描述

说的更直白一点,上图中,The position-coded value of the red box can be represented by the green box、紫框、粉框、Blue box driven by,That is expressedThe effect of relative position.

ps:这里有一个小trick,is to add a zoom before the position encoding:
当embAfter adding the position code,我们希望emb占多数,比如将emb放大10倍,In addition after zhang In the amount of,embWill account for most of the.
Because the main semantic information is contained inemb当中的,We hope that the impact of location coding will not exceedemb.所以对 embScaled and added with positional encoding.

在这里插入图片描述

5.4 Position encoding code sample

(https://note.youdao.com/yws/res/47020/WEBRESOURCEf4e7f80a73442c3cca85602a5ced09c4)]

By the location code formula known,a common part of them: p o s / 1000 0 2 i / d m o d e l pos / 10000^{2i / d_{model}} pos/100002i/dmodel

我们用logTake the power down for easy calculation:

p o s / 1000 0 2 i / d m o d e l = p o s ∗ 1000 0 − 2 i / d m o d e l = p o s ∗ e − 2 i / d m o d e l ∗ l n ( 10000 ) pos / 10000^{2i / d_{model}} \\ = pos * 10000^{-2i / d_{model}} \\ = pos * e^{-2i / d_{model} * ln(10000)} pos/100002i/dmodel=pos100002i/dmodel=pose2i/dmodelln(10000)

''' 3. PositionalEncoding 代码实现 '''
class PositionalEncoding(nn.Module):
    ''' 位置编码的实现其实很简单,直接对照着公式去敲代码就可以,下面这个代码只是其中一种实现方式; 从理解来讲,需要注意的就是偶数和奇数在公式上有一个共同部分,我们使用log函数把次方拿下来,方便计算; pos代表的是单词在句子中的索引,这点需要注意;比如max_len是128个,那么索引就是从0,1,2,...,127 假设我的d_model是512,2i那个符号中i从0取到了255,那么2i对应取值就是0,2,4...510 '''
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()

        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)
        # 生成0~max_len-1The index of the location tensor,[max_len] -> [max_len, 1]
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        # 这里需要注意的是pe[:, 0::2]这个用法,就是从0开始到最后面,步长为2,其实代表的就是偶数位置
        pe[:, 0::2] = torch.sin(position * div_term)
        # 这里需要注意的是pe[:, 1::2]这个用法,就是从1开始到最后面,步长为2,其实代表的就是奇数位置
        pe[:, 1::2] = torch.cos(position * div_term)
        # 上面代码获取之后得到的pe:[max_len * d_model]
        # 下面这个代码之后,我们得到的pe形状是:[max_len * 1 * d_model]
        pe = pe.unsqueeze(0).transpose(0, 1)

        # 定一个缓冲区,其实简单理解为这个参数不更新就可以
        self.register_buffer('pe', pe)

    def forward(self, x):
        """ x: [seq_len, batch_size, d_model] """
        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)

Being limited by the length and layout,The following parts of the code are no longer posted in this article,At the end of the article will be sent outGit连接,You can download it yourself if you need it.

六、Encoder部分

在这里插入图片描述

Encoder部分输入是单词的Embedding,再加上位置编码,然后进入一个统一的结构,这个结构可以循环很多次(N次),也就是说有很多层(N层).每一层又可以分成Attention层全连接层,再额外加了一些处理,比如Skip Connection,做跳跃连接,然后还加了Normalization层.其实它本身的模型还是很简单的.

6.1 多头注意力机制(Multi-Head Attention)

关于Attention部分(Attention、Self-Attention、Multi-Head Attention)这里不再详细讲解,Have spoken directly,Students who haven't seen it can watch it:https://mp.weixin.qq.com/s?__biz=Mzk0MzIzODM5MA==&mid=2247484067&idx=1&sn=cae143a546985413507d3bc750f5f7d6&chksm=c337bf3af440362c67f9ac26e82a5a537c1ea09c9041dfc7cfeae35fe93a9b797700bafe7db4#rd

6.1.1 理解self-attention

  • 没有self-attention的情况下,Each word has only its own meaning,不包含语义信息.
  • 含有self-attention,Therefore, vector refactoring,Make the word vector not only contain himself,而是Considering the global,In the context.

这里只是对于AttentionFor example briefly,If you don't understand it, you can go to the link above to learn moreAttention

Here we take three vectors as an example to understandAttention:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ke1UvUi9-1654004570009)(https://note.youdao.com/yws/res/8/WEBRESOURCE2ee65cea93707784ef7dcc4873504558)]

  • 假设我们现在有3A word vector respectively:“我”、“宣”、“你”;
  • 对于每一个emb向量,都会通过 w q 、 w k 、 w v w_q、w_k、w_v wqwkwvto linearly transform to generate the respectiveq、k、v;
    • The linear transformation is the matrix multiplication,比如 X 1 X_1 X1的维度是(1, 4),初始化一个 W Q W_Q WQ矩阵维度为(4, 3),那么 Q Q Q的维度就是(1, 3);
    • 需要注意的是,本文以 Q 、 q Q、q Qq来进行区分,On behalf of the matrix and vector, respectively,It is convenient to understand the vector example used here;
  • wMatrix is the need to learn;

在这里插入图片描述

  • 首先将q和k进行点乘操作,来计算The correlation coefficient,It can be understood as calculating how much they are related;
  • 在上图例子中,对于每一个q,All want to and eachkTo do the correlation matching.We want to see the semantic information of a certain word in this sentence,So be integrated in the整句话中的作用,So you have to do it with other words互;
  • 需要注意的是,"我"“宣”之间的关系,和“宣”“我”之间的关系,是不相等的.
    • Like I see a girl smile at me,I might think I'm cool then,Handsome sliding sideways,Is this the girl like me,So smile at me,所以 q 我 k 宣 q_我k_宣 qk You'll get a high score should be;
    • But in fact the girls see me smile could be laughing at me fat,So she gave me the score of q 宣 k 我 q_宣k_ 我 qk may be a low score;
  • The score is then sent to the attention layer for normalization,映射到0-1之间,生成注意力权重,
    • 比如“我”“我”may be more relevant,所以 a 我 我 a_{我我} a可能会比较大,
    • 假设 a 我 我 , a 我 宣 , a 我 你 = [ 0.8 , 0.1 , 0.1 ] a_{我我}, a_{My mission}, a_{我你} = [0.8, 0.1, 0.1] a,a,a=[0.8,0.1,0.1] ;
  • softmaxAfter the weight again andv做加权,可以理解为“我”这个向量,需要从“我”How much semantic information is extracted from、从“宣”How much semantic information is extracted from、从“你”How much semantic information is extracted from.After extraction, all are added together to form a new vector to represent“我”.那么这样一来,“我”The vector combines the semantic information of the context.就不再是原来的embVector only on behalf of“我”这个字的含义.

6.1.2 scaled dot-product attention 缩放点积注意力

公式如下:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-BcWfEVVt-1654004570009)(https://note.youdao.com/yws/res/2/WEBRESOURCE2afc2f00f006fec846e81bb8b9f68ba2)]

再啰嗦一句,The above examples are all vectors,用了小写.Here is the form of matrix,是张量,So is the capital.

Zoom is divided by d k \sqrt d_k dk, d k d_k dk指的是embedding的维度.

在这里插入图片描述

6.1.3 The forward propagation perspective understands why divide by d k \sqrt d_k dk

首先我们要明白的一点是:softmax是一种非常明显的⻢太效应 强者越强,弱者越弱.

而缩放后,Attention value分散些,In this way, better generalization ability can be obtained;

举个例子:

print(tf.math.softmax([[1.0, 2.0, 3.0]]))
print(tf.math.softmax([[10.0, 20.0, 30.0]]))
print(tf.math.softmax([[100.0, 200.0, 300.0]]))
tf.Tensor([[0.09003057 0.24472848 0.66524094]], shape=(1, 3), dtype=float32)
tf.Tensor([[2.061060e-09 4.539787e-05 9.999546e-01]], shape=(1, 3), dtype=float32)
tf.Tensor([[0.0e+00 3.8e-44 1.0e+00]], shape=(1, 3), dtype=float32)

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-0aQnLcnW-1654004570010)(https://note.youdao.com/yws/res/7/WEBRESOURCE598a7825ffb47b3d74e2080b6b70fff7)]

  • If the dimension of the vector is relatively large,那么qkAfter the dot product of the results will be relatively large, q 我 q 我 、 q 我 q 宣 . . . q_我q_我、 q_我q_宣... qqqq... The product of these numbers will be larger,
  • if not scaled,softmaxIs likely to leave[0, 0, 1]的结果了,It will say“我”“我”“我”“宣”一点关系都没有,“我”“你”却有100%的关系.很明显这是不合理的,Contrary to our introduction of contextual semantic information.

The reason for this is also given here:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Z0jWQxuH-1654004570011)(https://note.youdao.com/yws/res/a/WEBRESOURCE5b4890bcfbbd415278b8e6cf2157fb3a)]

为什么非要The mean and variance to0和1呢?

这是ICS内部协变量偏移问题:

  • Machine learning has a premise that the data conform to a standard normal distribution,
  • The sample afteremb之后,Also conforms to the fall,
  • q、kThe linear transformation of is also in line with the normal distribution.
  • 当到qk的时候,就发生了变化,Because the operating distribution of the dot product is no longer a standard normal distribution,It will also affect the distribution of all subsequent data.

所以我们现在的问题是:

  • 有两个向量:
    q = [ q 1 , q 2 , . . . , q d k ] k = [ k 1 , k 2 , . . . , k d k ] 随 机 变 量 q i 、 k i 的 取 值 均 服 从 标 准 正 态 分 布 其 中 d k 为 e m b 维 度 q = [q_1, q_2, ..., q_{dk}] \\ k = [k_1, k_2, ..., k_{dk}] \\ \\ 随机变量 q_i、k_iThe values ​​are subject to a standard normal distribution \\ 其中dk为emb维度 q=[q1,q2,...,qdk]k=[k1,k2,...,kdk]qikidkemb

  • 求:
    ( 1 ) q ⊙ k = [ q 1 k 1 , q 2 k 2 , . . . , q d k k d k ] 中 随 机 变 量 q i k i 所 服 从 的 分 布 的 期 望 与 方 差 (1)q \odot k = [q_1k_1, q_2k_2, ..., q_{dk}k_{dk}] \\ 中随机变量 q_ik_i Obey the distribution of 期望 与 方差 \\ 1qk=[q1k1,q2k2,...,qdkkdk]qiki
    ( 2 ) 设 Z = q ⋅ k T = q 1 k 1 + q 2 k 2 + . . . + q d k k d k 的 E ( Z ) 、 D ( Z ) (2)设 Z = q \cdot k^T = q_1k_1 + q_2k_2 + ... + q_{dk}k_{dk} \\ 的 E(Z)、D(Z) 2Z=qkT=q1k1+q2k2+...+qdkkdkE(Z)D(Z)

  • 1.获取条件,设置定义

对 于 ∀ i ∈ d k 设 随 机 变 量 X = q i , Y = k i , X Y = q i k i 有 : { E ( X ) = 0 E ( Y ) = 0 { D ( X ) = 1 D ( Y ) = 1 对于 ∀_i ∈ dk \\ 设随机变量 X=q_i,Y=k_i,XY=q_ik_i \\ 有:\\ \begin{cases} E(X) = 0 \\ E(Y) = 0 \end{cases} \quad \quad \begin{cases} D(X) = 1 \\ D(Y) = 1 \end{cases} idkX=qi,Y=ki,XY=qiki{ E(X)=0E(Y)=0{ D(X)=1D(Y)=1

  • 2.求均值

则 有 : E ( X Y ) = E ( X ) ⋅ E ( Y ) = 0 含 义 : 随 机 变 量 q i k i 服 从 均 值 为 0 的 分 布 即 : q ⊙ k = [ q 1 k 1 , q 2 k 2 , . . . , q d k k d k ] 只 要 d k 足 够 大 , m e a n ( q ⊙ k ) = 0 则有:\\ E(XY) = E(X) \cdot E(Y) = 0 \\ 含义:随机变量 q_ik_i 服从均值为0的分布 \\ 即:q \odot k = [q_1k_1, q_2k_2, ..., q_{dk}k_{dk}] \\ 只要dk足够大,mean(q \odot k) = 0 E(XY)=E(X)E(Y)=0qiki0qk=[q1k1,q2k2,...,qdkkdk]dk,mean(qk)=0

  • 3.求方差

由 公 式 D ( X ) = E ( X 2 ) − E 2 ( X ) 得 : D ( X Y ) = E ( X 2 Y 2 ) − [ E ( X Y ) ] 2 = E ( X 2 ) E ( Y 2 ) − [ E ( X ) E ( Y ) ] 2 = [ E ( X 2 ) − 0 ] [ E ( Y 2 ) − 0 ] − [ E ( X ) E ( Y ) ] 2 = [ E ( X 2 ) − E 2 ( X ) ] [ E ( Y 2 ) − E 2 [ Y ] ] − [ E ( X ) E ( Y ) ] 2 = D ( X ) D ( Y ) − [ E ( X ) E ( Y ) ] 2 = 1 ∗ 1 − 0 = 1 含 义 : 随 机 变 量 q i k i 服 从 方 差 为 1 的 分 布 即 : q ⊙ k = [ q 1 k 1 , q 2 k 2 , . . . , q d k k d k ] 只 要 d k 足 够 大 , 方 差 ( q ⊙ k ) = 1 由公式 \quad D(X) = E(X^2) - E^2(X) \\ 得:\\ \begin{aligned} D(XY) =& E(X^2Y^2) - [E(XY)]^2 \\ =& E(X^2)E(Y^2) - [E(X)E(Y)]^2 \\ =& [E(X^2) - 0][E(Y^2) - 0] - [E(X)E(Y)]^2 \\ =& [E(X^2) - E^2(X)][E(Y^2) - E^2[Y]] - [E(X)E(Y)]^2 \\ =& D(X)D(Y) - [E(X)E(Y)]^2 \\ =& 1 * 1 - 0 \\ =& 1 \end{aligned} \\ 含义:随机变量q_ik_i服从方差为1的分布 \\ 即:q \odot k = [q_1k_1, q_2k_2, ..., q_{dk}k_{dk}] \\ 只要dk足够大,方差(q \odot k) = 1 D(X)=E(X2)E2(X)D(XY)=======E(X2Y2)[E(XY)]2E(X2)E(Y2)[E(X)E(Y)]2[E(X2)0][E(Y2)0][E(X)E(Y)]2[E(X2)E2(X)][E(Y2)E2[Y]][E(X)E(Y)]2D(X)D(Y)[E(X)E(Y)]21101qiki1qk=[q1k1,q2k2,...,qdkkdk]dk,(qk)=1

  • 4.求E(X)和D(Z)

E ( Z ) = E ( X Y ) = E ( q 1 k 1 ) + E ( q 2 k 2 ) + . . . + E ( q d k k d k ) = 0 + 0 + . . . + 0 = 0 D ( Z ) = D ( X Y ) = D ( q 1 k 1 ) + D ( q 2 k 2 ) + . . . + D ( q a k k a k ) = 1 + 1 + . . . + 1 = d k 因 为 有 d k 项 \begin{aligned} E(Z) =& E(XY) \\ =& E(q_1k_1) +E(q_2k_2) + ... + E(q_{dk}k_{dk}) \\ =& 0 + 0 + ... + 0 \\ =& 0 \\ D(Z) =& D(XY)\\ =& D(q_1k_1) + D(q_2k_2) + ... + D(q_{ak}k_{ak}) \\ =& 1 + 1 + ... + 1 \\ =& dk \\ & 因为有dk项 \end{aligned} E(Z)====D(Z)====E(XY)E(q1k1)+E(q2k2)+...+E(qdkkdk)0+0+...+00D(XY)D(q1k1)+D(q2k2)+...+D(qakkak)1+1+...+1dkdk

经过推导,So the values ​​here correspond to the mean0,方差为dk.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AzEXehrY-1654004570011)(https://note.youdao.com/yws/res/2/WEBRESOURCE6a9447414fe6ef8c0365921789d0b722)]

We need to change the variance back to1

  • 5.将 Q K T QK^T QKT转回 E ( X ) = 0 、 D ( X ) = 1 E(X)=0、D(X)=1 E(X)=0D(X)=1The standard in the fall

由 D ( Z ) = d k 设 α 为 线 性 变 换 因 子 ( 常 数 ) D ( α Z ) = 1 则 有 D ( α Z ) = α 2 D ( Z ) = α 2 d k = 1 得 α = 1 d k 因 此 , 只 需 将 Z = Q K T , 乘 上 一 个 1 d k 即 可 ! ! ! \begin{aligned} & 由 \quad D(Z) = dk \\ & 设 \quad α \quad For the linear transformation factor(常数) \\ & D(αZ) = 1 \\ & 则有 \quad D(αZ) = α^2D(Z) = α^2dk = 1 \\ & 得 \quad α = \frac{1}{\sqrt dk} \\ & 因此,只需将 \quad Z=QK^T,乘上一个 \quad \frac{1}{\sqrt dk} \quad 即可!!! \end{aligned} D(Z)=dkα线()D(αZ)=1D(αZ)=α2D(Z)=α2dk=1α=dk1,Z=QKT,dk1

The above is the derivation from the perspective of forward propagation why to divide by d k \sqrt dk dk,当然MarkdownThe formula is more laborious,Layout is not good,Ladies and gentlemen, please understand.

6.1.4 Understand why divided by the back propagation Angle d k \sqrt d_k dk

其实很容易理解:Not divided by,注意力得分score是一个很大的值,softmax在反向传播时,容易造成梯度消失.

Take the example of the Matthew effect above.

print(tf.math.softmax([[100.0, 200.0, 300.0]]))
# tf.Tensor([[0.0e+00 3.8e-44 1.0e+00]], shape=(1, 3), dtype=float32)

根据softmax公式,求偏导可得:

a j = s o f t m a x ( x j ) = e x j ∑ i = 1 e x j ∂ a j ∂ x j = a j ( 1 − a j ) \begin{aligned} & a_j = softmax(x_j) = \frac{e^{x_j}}{\sum_{i=1}e^{x_j}} \\ & \frac{\partial a_j}{\partial x_j} = a_j(1-a_j) \end{aligned} aj=softmax(xj)=i=1exjexjxjaj=aj(1aj)

  • x j x_j xj为最大值时,softmax的结果 a j = 1 a_j = 1 aj=1, 梯 度 值 = 1 ∗ ( 1 − 1 ) = 0 梯度值=1*(1-1)=0 =1(11)=0
  • x j x_j xj为其它值时,softmax的结果 a j = 0 a_j = 0 aj=0, 梯 度 值 = 0 ∗ ( 1 − 0 ) = 0 梯度值=0*(1-0)=0 =0(10)=0

所以说,If it is not divided by,由于softmax的马太效应,When calculating the gradient of the partial derivative,梯度值为0,导致参数无法更新,即梯度消失.

After scaling, a j a_j aj就不再是0或是1了,Gradient value updating parameters can be normal.

6.1.5 并行化处理(矩阵运算)

其实就是矩阵运算,The above example is a vector for convenience,The following figure shows the corresponding matrix operation:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-K155NNgF-1654004570012)(https://note.youdao.com/yws/res/4/WEBRESOURCE12af9865725c71207c417b2e0a0f7884)]

Of course there is a better picture:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Y6Ggh4oT-1654004570013)(https://note.youdao.com/yws/res/1/WEBRESOURCEbc0be499d26eb693e4d0202db166c7d1)]

6.1.6 多头 Multi-head

将上面的self-attention弄懂了,The bulls will understand more than half,而encoder和decoder部分是一样的,The name of the paper is alsoattention is all you need,attention就是精华.

在这里插入图片描述

理论上,Bull means more thanQ、K、V.Such as the original paper is8个头,那就是8套Q、K、V.

  • 多套参数相当于把原始信息放到了多个空间中,也就是捕捉了多个信息,多头保证了transformer可以注意到不同子空间的信息,Capable of capturing richer feature information.
  • 换句话说就是After attention, the matrix will have semantic information that it understands,That so finally8个Z就会有8Different understanding.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-5AqiOudE-1654004570014)(https://note.youdao.com/yws/res/4/WEBRESOURCEce6d6ada46823144ebb4374037c57504)]

Many heads can have multiple outputs:

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DXoUZYrp-1654004570015)(https://note.youdao.com/yws/res/1/WEBRESOURCE4f0f6f58cb649f6d3d4f4cf152b5fb41)]

多头信息输出,由于多套参数得到了多个信息,然而我们还是只需要一个信息,因此可以通过某种方法(例如矩阵相乘)把多个信息汇总为一个信息:

(https://note.youdao.com/yws/res/0/WEBRESOURCEd8b431abe3f647b86cc06cd7331bdb40)]

Friends who have read the source code will find that we don't actually do this

如果按照上面的方法,那么8Size you need8套 W W W,但是一套 W W W包含了 W Q 、 W K 、 W V W_Q、W_K、W_V WQWKWV,In this case, the training cost will be very large,And the online inference time will be very long.

所以实际用的时候,We will matrix in accordance with the quotas for segmentation,Such as the following example is2个头

  • 首先生成Q、K、V,比如是2个头,那就将Q拆成2个部分.

在这里插入图片描述

  • That is only a W Q 、 W K 、 W V W_Q、W_K、W_V WQWKWV参数来生成 Q 、 K 、 V Q、K、V QKV,Then split on this basis.

在这里插入图片描述

  • 假如Q的维度是(seq_len, d_model),are the sentence length andemb维度;
  • If two heads,As intermediate points a knife,The left is a head q 1 q_1 q1,The right is a head q 2 q_2 q2;
    • 需要注意的是:embDimensions must be divisible to quotas .

在这里插入图片描述

  • Then each head does its own thing,分头行动;
  • Then do zoom and click operations separately;
  • 做完之后,如上图,You can see that the left is the semantic information of the first header,On the right is the semantic information of the second header;

在这里插入图片描述

  • Concatenate the results of multiple heads,To restore to before dimension;
  • This is equivalent to merging the features extracted from multiple heads..

The above example is for the convenience of demonstration,The following figure shows the legend of the actual split,
假设inpt维度为(batch_size, seq_len, d_model)=(2, 6, 4)
After being divided into two parts, the dimension is(batch_size, head_num, seq_len, depth)=(2, 2, 6, 2)

在这里插入图片描述

6.1.7 Why use multi-head attention?

It was briefly mentioned earlier,这里总结下,加深理解:

  • 多头保证了transformer可以注意到不同子空间的信息,捕捉到更加丰富的特征信息.
  • The authors found that lack of effective to do so.
  • Capture the characteristics of diversity,说人话就是”Because of the long,So from multiple⻆Degree to understand the content“. Take an example to understand:
    • Senior knowing
    • 它可以理解为:学姐-查寝
    • 也可以理解为:学-姐-查寝
  • Multi-head attention can fully interpret the semantic information of the context,In other words, it is fully brought into a scene to understand.

6.1.8 Padding mask

简单说padding mask的作用就是标记padding项的位置.

目的是消除paddingThe effects of a.

Let's return to the aboveattention的例子中,在最后加了padding项,See what can be affected:

在这里插入图片描述

The green part is andpadding相关的

  • Q、K、VThere is no question of the generation of the,paddingItem also needQ、K、V.
  • QKMultiplication is also no problem.只要paddingIt's fine without affecting the operation of valid information..
    • 比如在执行QKWhen the multiplication of,以”我“为例,不影响 q 我 k 我 、 q 我 k 宣 、 q 我 k 你 q_我k_我、q_我k_宣、q_我k_你 qkqkqk的运算.只多了个padding,It does not affect the generation of other items.
  • 当执行softmax这里的时候,就有问题了:
    • 由softmax公式得知
      s o f t m a x ( x i ) = e x i ∑ i = 1 e x i softmax(x_i) = \frac{e^{x_i}}{\sum_{i=1} e^{x_i}} softmax(xi)=i=1exiexi
    • 当执行softmax的时候,padding项作为 x i x_i xi也会参与softmax的运算;
    • In this way, the equivalent green part will also generate weights!

在这里插入图片描述

  • 比如上图,它会认为”我“和padding存在某种联系,这是不合理的.因为padding本身就是没有意义的, Just our padding.
  • So turn these parts into0.
  • So go up要在QKMultiplication place to havepadding的变成0,即在softmaxBefore the operation to containpaddingTo get rid of.

在这里插入图片描述

  • 这时就用到了padding mask: 将padding项变成1,Other items into0.
  • 当走到QK运算的时候,就可以通过1来定位到padding的部分.

在这里插入图片描述

  • 这里为什么不将padding项变为0,然后和QK的结果相乘?

    • An effect it was justsoftmax的时候有影响,使得 a 我 0 a_{我0} a0The value;
    • 如果将padding项变为0,然后和QK相乘的话,那么 q 我 k 0 q_{我}k_0 qk0的结果为0;
    • 计算softmax的时候, e 0 e^0 e0是1,也就是生成的 a 我 0 a_{我0} a0Is still a not as0的值,There is no way to eliminate;
    • 所以 q 我 k 0 q_{我}k_0 qk0等于0是没有意义的.
  • we won't multiply them,Is to make them together,我们会让1A very small number,比如 − 1 0 9 -10^9 109.

在这里插入图片描述

  • 相加之后,和paddingThe value associated with the item becomes a very small number.The other part is because the addition is0所以不会改变.What we do is put andpaddingthe term in question becomes an extremely small number.

在这里插入图片描述

  • x i x_i xiVery, very small, e x i e^{x_i} exi会无限接近于0,Can approximate as0来看待.
  • 最后的a就会变成0,从而将paddingEliminate the effect of.

在这里插入图片描述

  • As for the blue part,padding项的QKIt doesn't matter how much,because it does not affect the previous calculation.
  • padding项的a是0,那么和VMultiplication is0,即 a 我 0 V 0 = 0 a_{我0}V_0 = 0 a0V0=0,Is equivalent to not frompaddingItems to extract any information.

至此,padding mask讲完了,它的作用就是将paddingTo eliminate the influence of.

6.2 Add&Norm

6.2.1 Add(Skip Connection)

AddSome nothing to tell,Is the residual network跳跃连接思想.

The function is to deepen the number of network layers,Gradient vanishing mitigation via skip connections.

在这里插入图片描述

  • 因为 y = x A + x C y = x_A + x_C y=xA+xC,所以 ∂ y ∂ x A = 1 \frac{\partial y}{\partial x_A} = 1 xAy=1;
  • 缓解了梯度消失,No matter how many LianCheng items are,In front of the at least one1,Can guarantee the gradient echo,即 ∂ y ∂ x A \frac{\partial y}{\partial x_A} xAy不为0.

6.2.2 什么是BatchNorm?

  • BatchNorm是对一批样本进行处理,For a number of samples每个特征Separately normalized.
  • 举个简单的例子,If I have a batch of samples,每个样本有三个特征,
    • 分别是身高,体重,年龄,So when I do the normalized,is to normalize the weight,Do normalization of height,Do normalization of age,
    • There will be no cross-influence between the three.

在这里插入图片描述

  • This looks very intuitive,It can be seen as reducing the influence of each feature dimension,我们也经常会在CTRSuch as depth modelMLP部分见到BatchNorm操作.
  • 也正因为如此,
    • 所以BatchNorm会受到Batch size的影响;
    • 当BatchsizeWhen I was young, the effect was often not very stable.

6.2.3 什么是LayerNorm?

  • LayerNorm是对一个样本进行处理,
  • 对一个样本的所有特征进行归一化,At first glance is not reasonable,
  • Because if you find a mean variance for height, weight and age together,I don't know these values have what meaning,But there are some scenarios that are very effective–NLP领域.

在这里插入图片描述

  • 在NLP中,Neach feature may represent different words,This time we still useBatchNorm的话,operate on the first word,Obviously it doesn't make much sense.,
  • because any word can be placed in the first position,And many times word order doesn't have that much effect on our sentences.,
  • 而此时我们对N个词进行NormOperations may well reflect the distribution of the sentence.
  • (LN一般用在第三维度,[batchsize,seq_len,dims]),Because the dimension of the dimension feature is the same,so not much difference.

6.2.4 为什么使用LayerNorm,不用BatchNorm?

It has already been explained above,Here's the example above to deepen your understanding:

在这里插入图片描述

LayerNorm简称LN,BatchNorm简称BN.
Both of them work to eliminate dimensional effects.,加快模型收敛.

  • BNDo is thisbatch内的,Take the mean of each feature individually、标准差,Then subtract the mean and divide by the standard deviation.so that the distribution changes to the mean0,方差为1The standard of fall.
  • LNCalculate the mean and standard deviation of the sample.

在这里插入图片描述

  • 在nlp领域使用BN效果不好,BNThe calculation is based on abatchto calculate the mean and variance of the sample data in.
    • Such calculation it is有padding的影响的,并且does not represent the entire mean and variance.In the example of Xiaoming and Xiaohong just now,The meaning of the column where the height feature is located is the same.但是在nlp里,第一行是”我“字的emb词向量,第二行是”宣“字的emb词向量,经过attention之后,Formation are vector with semantic information,Each column represents a different meaning.
  • It cannot be said that each dimension of each word vector represents the same meaning.
  • So from this⻆度来理解,这里采用BN是不合适的.
  • 所以LN用的是比较多的.Just normalize each word vector,So as not to introducepadding不相关的信息.
  • batch_size太小时,一个batch的样本,其均值和方差,Inadequately represent the mean and variance of the overall sample.NLPField of resBN.

6.3 前馈神经网络(Feed Forword)

There's nothing to talk about here,It is a two-layer full connection.Introduce nonlinear transformations through activation functions,变换了Attention output的空间, 从而增加了模型的表现能力.

把FFN去掉模型也是可以用的,但是效果差了很多.

在这里插入图片描述

七、Decoder部分

在这里插入图片描述

encoder和decoder结构相似,只需要关注decoder部分的attention就可以了,The rest will not be repeated.

其中DecoderPart can be divided into2An interpretation of the:

  • Masked Multi-Head Attention
    • 带mask的多头注意力机制,The purpose is to prevent the model from seeing the data it is trying to predict,防止泄露.
    • If not clear classmate,可以往上翻,Take a look at the machine translation workflowDecoder部分.
  • Encoder-Decoder Multi-Head Attention
    • 用于Encoder部分和Decoder部分的交互.
    • Careful students can see that two of the arrows here are fromEncoder,One is fromDecoder.

7.1 带mask的多头注意力机制(Masked Multi-Head Attention)

在这里插入图片描述

Encoder和Decoder部分都有padding mask,在DecoderSome still have one morelook ahead mask,就是masked Multi head attention的mask的含义.

篇幅太长了,I don't want to two piece,So let's reviewdecoderPart of the workflow:

在这里插入图片描述

  • The first is to predict the first position,图中红框,To ensure that the prediction of the first position only matches the start characterstart有关,所以要把startCover the back part.

在这里插入图片描述

  • To predict the second position,To ensure that the second predicted position is onlystart、I有关,So the back of the part tomask掉.
  • 很好理解,based on the previous information,Predict the next possible value.

假设这是decoder部分的多头注意力,Let's take a look at if you don't do itmask会是什么样:

在这里插入图片描述

  • 如果不使用mask,在qk的时候,”我“Words also interact with other words,但是decoderIn predict”宣“的时候,I can't see the information behind.

在这里插入图片描述

  • 所以 q 我 k 宣 、 q 我 k 你 、 q 我 k 0 q_我k_宣、q_我k_你、q_我k_0 qkqkqk0是不合理的,要mask掉.
  • If this value is notmask,那么它会在softmaxWhen produce values,对 a 我 我 a_{我我} a产生影响.
  • 同理,"宣"、“你”和paddingItems produced by the interaction has to bemask.

在这里插入图片描述

paddingitem will not affect the previous calculation so it is not considered

  • Hide here andpadding maskWhere is the same processing,To a small number of − 1 0 9 -10^9 109;
  • So the question now becomes how to locate these needsmask的位置;
  • 它其实是一个倒三⻆的形状,分别是3个需要mask、2个需要mask、1个需要mask.

在这里插入图片描述

  • In the matrix,It's like the picture above,第一行有3个需要被mask、第2行有2个需要被mask、第3行有1个需要被mask. Is an inverted three⻆的形状.

  • 所以look ahead mask的生成,Just place the places that need to be masked as1,Other places that do not need to be covered are0即可.

在这里插入图片描述

  • At the same time, we also consider the previouspadding mask,打个比方:

在这里插入图片描述

  • 假设”你“也是padding项,那么和paddingAbout will use green said ;
    • 倒三⻆Will be inmask掉.
  • 同时我们还要将padding mask给考虑进去;
  • 这个时候,In the matrix的padding mask,The last two columns is,下图:

在这里插入图片描述

  • What we have to do is deal with them并集,If only deals with three⻆的话,So the last two columns,In addition to pour three⻆Other places are ignored,如图 q 我 k 你 、 q 0 k 你 、 q 0 k 0 q_我k_你、q_0k_你、q_0k_0 qkq0kq0k0;
  • These places is andpadding相关的,需要用到padding mask来处理.
  • 换句话说,如果只看padding mask的话,So except for the last two columnspadding maskPart of the outside didn't also the way,如图 q 我 k 宣 q_我k_宣 qk,它是look ahead mask处理的部分.

在这里插入图片描述

  • So what we do is compute the union of them,That generated the mask for1的部分如图,其它部分为0.
  • 这样就能把paddingAnd in⻅The negative effects brought by the negative situation to eliminate.

在这里插入图片描述

  • 首先要定位到padding mask,同时生成一个look ahead mask.
  • look ahead mask的生成和padding也没有关系,Just need to know the sentence⻓度即可,Mr Into square,To become down three⻆. And then ask a set,Become the following sample:

Here involves the automatic radio,Students who do not understand can check it out by themselves,此处不再介绍.

在这里插入图片描述

DecoderLong attention andEncoderThe difference in the multi-head attention lies in the use ofmask不同.

7.2 Encoder与DecoderThe long attention mechanism between(Multi-Head Attention)

在这里插入图片描述

我们发现,Other more attentionQ、K、VIs from the same.But in the red boxQ、K、Vthe source is no longer the same thing.

  • 它的Q是Decoder部分Add&NormAfter the linear transformation.
  • K、V是EncoderPart of the output will pass through linear transformation.
  • 同时也会有padding mask去处理padding部分,和Encoder部分的padding mask一致.

We are to understand through the examples below:

在这里插入图片描述

在这里插入图片描述

需要说明的是,
The example does not contain start and end characters
paddingItems are green
q是decoder部分提供、k和v是encoder部分输出

  • 计算qk,以 q I q_I qI为例:

    • Can be understood as the interaction of correlation between two different languages,This is the effect of the translation.
    • In other words, the previous attention calculation was”My mission你“This sentence between calculation,Calculate the related each word and other words性.
    • 这里因为q是来源于Decoder部分,以 q I q_I qI为例,qk计算的是”I“分别和”我“”宣“”你“的相关性.
  • 经过softmaxWill score normalization,Mapping attention weight.

  • 然后和v相乘,Meaning is the same,从”我“”宣“”你“How much information to extract.

  • 这样一来By matching the correlation between different languages, the translation information can be extracted from it,To achieve the effect of a translation.

  • 为什么这里的padding mask和encoder部分的padding mask处理方法一致?

    • 看上图的qk结果,每个词向量的paddingEffects are the last one,That is to say, influence comes from k 0 k_0 k0,即Encoder部分的输出k.
    • So to remove the influence here,就要以EncoderPart as a benchmark to generatepadding mask.

在这里插入图片描述

  • 也是用1代表padding项,0On behalf of the other way to locate the itempadding项.
  • The specific treatment method is the same,在softmax之前,给paddingItems with a small number of,如下图.

在这里插入图片描述

这里可能会有疑问,为什么这里使用padding mask,而不使用look ahead mask?

  • In the input starting characters的时候,IForecast is the goal of.这个时候sThe latter are not yet generated,所以要遮盖掉.

在这里插入图片描述

  • 所以当qk计算的时候,还是以“我”字为例,“宣”“你”,都还没有出现,所以 q 我 k 宣 、 q 我 k 你 q_我k_宣、q_我k_你 qkqk是不合理的.

在这里插入图片描述

  • 因为还没有生成,所以需要mask掉,如下图所示:

在这里插入图片描述

  • 而在这里,Do attention mechanism isEncoder和Decoder之间的交互

在这里插入图片描述

  • 我们可以发现,IAsk the object( q I q_I qI),Are all Chinese,即“我”“宣”“你”.
  • 换句话说,对于q而言,k都是可见的,因为它是由EncoderPart directly to come over.
    • 即EncoderSome are coming,对于DecoderPart is visible.
  • 所以不需要做look ahead mask了,只需要针对EncoderDo a partpadding mask.

The little friends who insist on seeing here believe that they will definitely gain a lot,A move to get rich small focus on blogger a wave,一起进步.

八、输出部分

以上就讲完了transformer的编码、Decoding the two modules,那么如何将“我喜欢你”,翻译成“I Like You”呢?

换个问题,解码器DecoderThe output was originally a tensor of float type,怎么转化成“I Like You”这几个词呢?

概况的来讲:

  • 最后的输出要通过Linear层(全连接层),它将DecoderThe resulting vector is projected onto a higher dimensional vector(logits);
  • If our vocabulary has1W个词,那么这个 logits 就有1W个维度,每个维度对应一个唯一的词的得分;
  • 之后经过softmaxConvert such fractions into probabilities;
  • 选择概率最大的维度,And correspondingly generate the word associated with it as the output of the time step is the final output.

在这里插入图片描述

九、有关训练

9.1 损失计算

Loss is still cross entropy,Why do I talk about it here,因为有padding.

That is to say, we need to pay attention to,Loss calculation also bemask,去消除paddingItem loss.

在计算损失的时候,需要做:

  • Get the tag value(真实值).
  • 消除padding带来的影响.

Predictions矩阵,是经过softmaxAfter the prediction probability value

在这里插入图片描述

  • 我们不应该将paddingItem loss,added to the overall loss.
  • 因此需要maskTo eliminate.

在这里插入图片描述

  • 与之前不一样的是,这里的paddingClick on a0来处理,非paddingAccording to the item1来处理.
  • Then put the label andmaskThe corresponding elements multiplication can be.
    • 如果不mask的话,paddingAlso can produce damage,paddingJust fill the item,没有实际的意义.

在这里插入图片描述

9.2 自定义学习率

This is a custom learning rate method mentioned in the paper:

在这里插入图片描述

  • d_model:emb维度
  • step_num:The current training has reached the first step
  • warm_up_steps:A value for the custom

The final effect is a learning rate that increases first and then decreases(Consistent with the idea of ​​learning rate decay)

在这里插入图片描述

十、All code and data

https://github.com/WGS-note/transformer-note

如果觉得文章还不错,When you've finished fruitful,Small hands help bloggers who move small partners to make a fortunestart一下~

References

https://s3.us-west-2.amazonaws.com/secure.notion-static.com/501fb338-a6b0-484a-8a16-713dd40251de/Attention_is_All_You_Need.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Content-Sha256=UNSIGNED-PAYLOAD&X-Amz-Credential=AKIAT73L2G45EIPT3X45%2F20220522%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Date=20220522T014504Z&X-Amz-Expires=86400&X-Amz-Signature=180db501219c968fdd116b27d6b44bed0eed6e912755d300cd7db8e957937e1b&X-Amz-SignedHeaders=host&response-content-disposition=filename%20%3D%22Attention%2520is%2520All%2520You%2520Need.pdf%22&x-id=GetObject

https://ugirc.blog.csdn.net/article/details/120394042

https://luweikxy.gitbook.io/machine-learning-notes/self-attention-and-transformer#multi-headed-attention

https://zhuanlan.zhihu.com/p/353381965

https://www.bilibili.com/video/BV1pu411o7BE?spm_id_from=333.337.search-card.all.click

https://www.bilibili.com/video/BV1Kq4y1H7FL?spm_id_from=333.337.search-card.all.click

https://zhuanlan.zhihu.com/p/153183322




原创不易,转载请注明出处

原网站

版权声明
本文为[WGS.]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/215/202208030527206338.html