当前位置:网站首页>Some attention code explanations
Some attention code explanations
2022-07-28 17:13:00 【Name filling】
List of articles
When writing the experiment, I came across a library , There are several implementations attention Mechanism , Here's a summary , It's convenient to find it later .
1. Linear layer
Let's start with a linear layer , Generally in the thesis W The equal weight matrix is generally realized by the full connection layer .
class Linear(nn.Module):
''' Simple Linear layer with xavier init '''
def __init__(self, d_in, d_out, bias=True):
super(Linear, self).__init__()
self.linear = nn.Linear(d_in, d_out, bias=bias)
init.xavier_normal(self.linear.weight)
def forward(self, x):
return self.linear(x)
init.xavier_normal(self.linear.weight) A way of weight initialization , Generally, what we need to learn in the paper is parameter weight
2. DotProductAttention layer
Dot and multiply attention mechanism . This is a transformers Proposed , It is put forward in the dot multiplication attention Q 、 K 、 V Q、K、V Q、K、V The concept of .
Q Dot matrix K Matrix get the weight matrix of attention score , here d_model The average is 0, The variance of 1 Matrix element of the matrix to perform matrix multiplication m a t m u l ( q ( m b s i z e × l e n q × d k ) , k ( m b s i z e × l e n k × d k ) ) matmul(q (mb_{size} \times len_q \times d_k), k (mb_{size} \times len_k \times d_k) ) matmul(q(mbsize×lenq×dk),k(mbsize×lenk×dk))
In order to keep the variance of each layer 1, except s q r t d k sqrt{d_k} sqrtdk
notes : Keep the variance as 1 Because softmax In back propagation , When there are elements x i x_i xi Far or far smaller than other elements , There will be softmax The gradient of the function is too low ( Go to zero ), Make the model error back-propagation (back-propagation) after softmax Function cannot be propagated to the parameters in the front part of the model , These parameters cannot be updated , Finally, it will affect the training efficiency of the model .
class DotProductAttention(nn.Module):
''' Dot-Product Attention '''
def __init__(self, d_model, attn_dropout=0.1):
super(DotProductAttention, self).__init__()
self.temper = 1 # np.power(d_model, 0.5)
self.dropout = nn.Dropout(attn_dropout)
self.softmax = nn.Softmax(dim=2)
def forward(self, q, k, v, attn_mask=None):
# q (mb_size x len_q x d_k)
# k (mb_size x len_k x d_k)
# v (mb_size x len_v x d_v)
attn = torch.bmm(q, k.transpose(1, 2)) / self.temper
if attn_mask is not None:
assert attn_mask.size() == attn.size(), \
'Attention mask shape {} mismatch with Attention logit tensor shape ' \
'{}.'.format(attn_mask.size(), attn.size())
attn.data.masked_fill_(attn_mask, -float('inf'))
attn = self.softmax(attn)
attn = self.dropout(attn)
output = torch.bmm(attn, v)
return output, attn
Q Dot matrix K Matrix get the weight matrix of attention score , When obtaining the weight matrix attn after , after mask, Again softmax After normalizing the weight of , Multiply again V V V matrix , namely Attention score × \times × The word vector , Is the weighted word vector
Return value : Weighted word vector , The weight
3. SingleLayerAttention layer
From the name, it is a single-layer attention mechanism . But this layer is not very clear , Why did the author not divide d k \sqrt{d_k} dk. I haven't encountered this application of attention anywhere else , Fill the pit when you encounter it .
class SingleLayerAttention(nn.Module):
def __init__(self, d_model, d_k, attn_dropout=0.1):
super(SingleLayerAttention, self).__init__()
self.dropout = nn.Dropout(attn_dropout)
self.softmax = nn.Softmax(dim=2)
# self.linear = nn.Linear(2*d_k, d_k)
self.weight = nn.Parameter(torch.FloatTensor(d_k, 1)) # Use the batch processing mechanism to process elements
self.act = nn.LeakyReLU()
init.xavier_normal(self.weight)
def forward(self, q, k, v, attn_mask=None):
# q (mb_size x len_q x d_k)
# k (mb_size x len_k x d_k)
# v (mb_size x len_v x d_v)
mb_size, len_q, d_k = q.size()
mb_size, len_k, d_k = k.size()
q = q.unsqueeze(2).expand(-1, -1, len_k, -1)
k = k.unsqueeze(1).expand(-1, len_q, -1, -1)
x = q - k
attn = self.act(torch.matmul(x, self.weight).squeeze(3)) # mb_size * len_q * len_k
if attn_mask is not None: # mb_size * len_q * len_k
assert attn_mask.size() == attn.size()
attn_mask = attn_mask.eq(0).data
attn.data.masked_fill_(attn_mask, -float('inf')) # Broadcast mask
attn = self.softmax(attn)
attn.data.masked_fill_(attn_mask, 0)
attn = self.dropout(attn)
output = torch.bmm(attn, v)
return output, attn
4. MultiHeadAttention layer
The long attention mechanism , Call the dot multiplication attention mechanism
class MultiHeadAttention(nn.Module):
''' Multi-Head Attention module '''
def __init__(self, n_head, d_input, d_model, d_input_v=None, dropout=0.1):
super(MultiHeadAttention, self).__init__()
self.n_head = n_head
d_k, d_v = d_model//n_head, d_model//n_head
self.d_k = d_k
self.d_v = d_v
if d_input_v is None:
d_input_v = d_input
self.w_qs = nn.Parameter(torch.FloatTensor(n_head, d_input, d_k))
self.w_ks = nn.Parameter(torch.FloatTensor(n_head, d_input, d_k))
self.w_vs = nn.Parameter(torch.FloatTensor(n_head, d_input_v, d_v))
self.attention = DotProductAttention(d_model)
self.proj = Linear(n_head*d_v, d_model)
self.dropout = nn.Dropout(dropout)
init.xavier_normal(self.w_qs)
init.xavier_normal(self.w_ks)
init.xavier_normal(self.w_vs)
def forward(self, q, k, v, attn_mask=None):
d_k, d_v = self.d_k, self.d_v
n_head = self.n_head
# residual = q
mb_size, len_q, d_input = q.size()
mb_size, len_k, d_input = k.size()
mb_size, len_v, d_input_v = v.size()
# treat as a (n_head) size batch. Process the data form according to the number of multiple heads -> n_head x (mb_size*len_q) x d_model
q_s = q.repeat(n_head, 1, 1).view(n_head, -1, d_input) # n_head x (mb_size*len_q) x d_model
k_s = k.repeat(n_head, 1, 1).view(n_head, -1, d_input) # n_head x (mb_size*len_k) x d_model
v_s = v.repeat(n_head, 1, 1).view(n_head, -1, d_input_v) # n_head x (mb_size*len_v) x d_model
# treat the result as a (n_head * mb_size) size batch -> understand d_model//n_head The last dimension is d_k
q_s = torch.bmm(q_s, self.w_qs).view(-1, len_q, d_k) # (n_head*mb_size) x len_q x d_k
k_s = torch.bmm(k_s, self.w_ks).view(-1, len_k, d_k) # (n_head*mb_size) x len_k x d_k
v_s = torch.bmm(v_s, self.w_vs).view(-1, len_v, d_v) # (n_head*mb_size) x len_v x d_v
# perform attention, result size = (n_head * mb_size) x len_q x d_v After handling multiple operations , Calculate the attention mechanism
outputs, attns = self.attention(q_s, k_s, v_s, attn_mask=attn_mask.repeat(n_head, 1, 1))
# back to original mb_size batch, result size = mb_size x len_q x (n_head*d_v)
outputs = torch.cat(torch.split(outputs, mb_size, dim=0), dim=-1)
# project back to residual size, Return to the original dimension
outputs = self.proj(outputs)
outputs = self.dropout(outputs)
# return self.layer_norm(outputs + residual), attns Return to layer regularization (outputs+residual)
return outputs, attns
5. BiAttention layer
class BiAttention(nn.Module):
def __init__(self, input_size, dropout):
super().__init__()
self.dropout = nn.Dropout(dropout)
self.input_linear = nn.Linear(input_size, 1, bias=False)
self.memory_linear = nn.Linear(input_size, 1, bias=False)
self.dot_scale = nn.Parameter(torch.Tensor(input_size).uniform_(1.0 / (input_size ** 0.5)))
self.softmax = nn.Softmax(dim=-1)
def forward(self, input, memory, q_mask):
''' Args: input: batch_size * doc_word_len * emb_size memory: h_question_word batch_size * ques_len * emb_size q_mask: Returns: '''
bsz, input_len, memory_len = input.size(0), input.size(1), memory.size(1)
input = self.dropout(input)
memory = self.dropout(memory)
input_dot = self.input_linear(input)
memory_dot = self.memory_linear(memory).view(bsz, 1, memory_len)
cross_dot = torch.bmm(input * self.dot_scale, memory.permute(0, 2, 1).contiguous())
# input Zoom first --> [batch_size * doc_word_len * ques_len]
att = input_dot + memory_dot + cross_dot # Attention matrix
att = att - 1e30 * (1 - q_mask[:, None]) # None You can have one more dimension in your dimension , Dealing with problems padding character
weight_one = self.softmax(att) # Normalize the query , Get the weight matrix of the document's attention to the problem
output_one = torch.bmm(weight_one, memory) # Get the weight of document words on the question
weight_two = self.softmax(att.max(dim=-1)[0]).view(bsz, 1, input_len) # Get the attention weight matrix of the problem to the document
output_two = torch.bmm(weight_two, input) # The weight of the problem on each vector
return torch.cat([input, output_one, input*output_one, output_two*output_one], dim=-1)
# input*output_one, Weight of each word * word , output_two*output_one The problem is the weight of each word * Every word
# Splicing Original document , Weight of each word , Weight of each word * word , The problem is the weight of each word * Every word
边栏推荐
- In 2020q2, shipments in the global tablet market soared by 26.1%: Huawei ranked third and Lenovo increased the most!
- Educational codeforces round 126 (rated for Div. 2) f.teleporters (two sets and two points)
- Fine-grained Fact Verification with Kernel GA Network
- Unity shader transparent effect
- Jsonarray traversal
- Codeforces Round #750 (Div. 2) F.Korney Korneevich and XOR (easy&&hard version)(dp)
- Rsync service deployment and parameter details
- Re11:读论文 EPM Legal Judgment Prediction via Event Extraction with Constraints
- Summary of kubenertes 1.16 cluster deployment problems
- Read excel xlsx format file in unity
猜你喜欢

如何在构建阶段保护镜像安全

技术分享 | 误删表以及表中数据,该如何恢复?
![[deep learning]: day 5 of pytorch introduction to project practice: realize softmax regression from 0 to 1 (including source code)](/img/19/18d6e94a1e0fa4a75b66cf8cd99595.png)
[deep learning]: day 5 of pytorch introduction to project practice: realize softmax regression from 0 to 1 (including source code)

HTAP comes at a price

Jsonarray traversal

Classroom attendance system based on QT design (using RDS for MySQL cloud database)
![[deep learning]: day 9 of pytorch introduction to project practice: dropout implementation (including source code)](/img/19/18d6e94a1e0fa4a75b66cf8cd99595.png)
[deep learning]: day 9 of pytorch introduction to project practice: dropout implementation (including source code)

浏览器解码过程分析

Re14:读论文 ILLSI Interpretable Low-Resource Legal Decision Making

在AD中添加差分对及连线
随机推荐
The maximum recommended number of rows for MySQL is 2000W. Is it reliable?
打造自组/安全/可控的LoRa网!Semtech首度回应“工信部新规”影响
总数据量超万亿行,玉溪卷烟厂通过正确选择时序数据库轻松应对
Re10:读论文 Are we really making much progress? Revisiting, benchmarking, and refining heterogeneous gr
How should I understand craft
Re12:读论文 Se3 Semantic Self-segmentation for Abstractive Summarization of Long Legal Documents in Low
华为Mate 40系列曝光:大曲率双曲面屏,5nm麒麟1020处理器!还将有天玑1000+的版本
Educational codeforces round 126 (rated for Div. 2) f.teleporters (two sets and two points)
Ugui learning notes (IV) ugui event system overview and Usage Summary
Codeforces round 770 (Div. 2) e. fair share
Unity3d simple implementation of water surface shader
Leetcode 2022.04.10 China Merchants Bank special competition D. store promotion (DP)
Round 1A 2022 - Code jam 2022 c.weightlifting (interval DP)
go语言慢速入门——流程控制语句
Re11: read EPM legal judgment prediction via event extraction with constraints
Realize the reset function of steering wheel UI with touch rotation and finger departure in unity
全链路灰度在数据库上我们是怎么做的?
Fine-grained Fact Verification with Kernel GA Network
Some notes on how unity objects move
RE14: reading paper illsi interpretable low resource legal decision making