当前位置：网站首页>Some attention code explanations

Some attention code explanations

2022-07-28 17:13:00 【Name filling】

List of articles

1. Linear layer
2. DotProductAttention layer
3. SingleLayerAttention layer
4. MultiHeadAttention layer
5. BiAttention layer

When writing the experiment, I came across a library , There are several implementations attention Mechanism , Here's a summary , It's convenient to find it later .

1. Linear layer

Let's start with a linear layer , Generally in the thesis W The equal weight matrix is generally realized by the full connection layer .

class Linear(nn.Module):
    ''' Simple Linear layer with xavier init '''
    def __init__(self, d_in, d_out, bias=True):
        super(Linear, self).__init__()
        self.linear = nn.Linear(d_in, d_out, bias=bias)
        init.xavier_normal(self.linear.weight)

    def forward(self, x):
        return self.linear(x)

init.xavier_normal(self.linear.weight) A way of weight initialization , Generally, what we need to learn in the paper is parameter weight

2. DotProductAttention layer

Dot and multiply attention mechanism . This is a transformers Proposed , It is put forward in the dot multiplication attention $Q 、 K 、 V$ The concept of .

Q Dot matrix K Matrix get the weight matrix of attention score , here d_model The average is 0, The variance of 1 Matrix element of the matrix to perform matrix multiplication $(mb_{size} \times len_q \times d_k), k (mb_{size} \times len_k \times d_k) )$

In order to keep the variance of each layer 1, except $sqrt{d_k}$

notes ： Keep the variance as 1 Because softmax In back propagation , When there are elements $x_i$ Far or far smaller than other elements , There will be softmax The gradient of the function is too low （ Go to zero ）, Make the model error back-propagation （back-propagation） after softmax Function cannot be propagated to the parameters in the front part of the model , These parameters cannot be updated , Finally, it will affect the training efficiency of the model .

class DotProductAttention(nn.Module):
    ''' Dot-Product Attention '''

    def __init__(self, d_model, attn_dropout=0.1):
        super(DotProductAttention, self).__init__()
        self.temper = 1 # np.power(d_model, 0.5)
        self.dropout = nn.Dropout(attn_dropout)
        self.softmax = nn.Softmax(dim=2)

    def forward(self, q, k, v, attn_mask=None):
        # q (mb_size x len_q x d_k)
        # k (mb_size x len_k x d_k)
        # v (mb_size x len_v x d_v)
        attn = torch.bmm(q, k.transpose(1, 2)) / self.temper

        if attn_mask is not None:
            assert attn_mask.size() == attn.size(), \
                    'Attention mask shape {} mismatch with Attention logit tensor shape ' \
                    '{}.'.format(attn_mask.size(), attn.size())
            attn.data.masked_fill_(attn_mask, -float('inf'))

        attn = self.softmax(attn)
        attn = self.dropout(attn)
        output = torch.bmm(attn, v)

        return output, attn

Q Dot matrix K Matrix get the weight matrix of attention score , When obtaining the weight matrix attn after , after mask, Again softmax After normalizing the weight of , Multiply again $V$ matrix , namely Attention score $\times$ The word vector , Is the weighted word vector

Return value ： Weighted word vector , The weight

3. SingleLayerAttention layer

From the name, it is a single-layer attention mechanism . But this layer is not very clear , Why did the author not divide $\sqrt{d_k}$ . I haven't encountered this application of attention anywhere else , Fill the pit when you encounter it .

class SingleLayerAttention(nn.Module):

def __init__(self, d_model, d_k, attn_dropout=0.1):
    super(SingleLayerAttention, self).__init__()
    self.dropout = nn.Dropout(attn_dropout)
    self.softmax = nn.Softmax(dim=2)
    # self.linear = nn.Linear(2*d_k, d_k)
    self.weight = nn.Parameter(torch.FloatTensor(d_k, 1))    #  Use the batch processing mechanism to process elements 
    self.act = nn.LeakyReLU()

    init.xavier_normal(self.weight)

def forward(self, q, k, v, attn_mask=None):
    # q (mb_size x len_q x d_k)
    # k (mb_size x len_k x d_k)
    # v (mb_size x len_v x d_v)
    mb_size, len_q, d_k = q.size()
    mb_size, len_k, d_k = k.size()
    q = q.unsqueeze(2).expand(-1, -1, len_k, -1) 
    k = k.unsqueeze(1).expand(-1, len_q, -1, -1)
    x = q - k

    attn = self.act(torch.matmul(x, self.weight).squeeze(3))    # mb_size * len_q * len_k

    if attn_mask is not None:    # mb_size * len_q * len_k
        assert attn_mask.size() == attn.size()
        attn_mask = attn_mask.eq(0).data
        attn.data.masked_fill_(attn_mask, -float('inf'))    #  Broadcast mask 

    attn = self.softmax(attn)
    attn.data.masked_fill_(attn_mask, 0)
    attn = self.dropout(attn)
    output = torch.bmm(attn, v)

    return output, attn

4. MultiHeadAttention layer

The long attention mechanism , Call the dot multiplication attention mechanism

class MultiHeadAttention(nn.Module):
    ''' Multi-Head Attention module '''

    def __init__(self, n_head, d_input, d_model, d_input_v=None, dropout=0.1):
        super(MultiHeadAttention, self).__init__()

        self.n_head = n_head
        d_k, d_v = d_model//n_head, d_model//n_head
        self.d_k = d_k
        self.d_v = d_v

        if d_input_v is None:
            d_input_v = d_input

        self.w_qs = nn.Parameter(torch.FloatTensor(n_head, d_input, d_k))
        self.w_ks = nn.Parameter(torch.FloatTensor(n_head, d_input, d_k))
        self.w_vs = nn.Parameter(torch.FloatTensor(n_head, d_input_v, d_v))

        self.attention = DotProductAttention(d_model)
        self.proj = Linear(n_head*d_v, d_model)

        self.dropout = nn.Dropout(dropout)

        init.xavier_normal(self.w_qs)
        init.xavier_normal(self.w_ks)
        init.xavier_normal(self.w_vs)

    def forward(self, q, k, v, attn_mask=None):

        d_k, d_v = self.d_k, self.d_v
        n_head = self.n_head

        # residual = q

        mb_size, len_q, d_input = q.size()
        mb_size, len_k, d_input = k.size()
        mb_size, len_v, d_input_v = v.size()

        # treat as a (n_head) size batch.  Process the data form according to the number of multiple heads  -> n_head x (mb_size*len_q) x d_model
        q_s = q.repeat(n_head, 1, 1).view(n_head, -1, d_input) # n_head x (mb_size*len_q) x d_model
        k_s = k.repeat(n_head, 1, 1).view(n_head, -1, d_input) # n_head x (mb_size*len_k) x d_model
        v_s = v.repeat(n_head, 1, 1).view(n_head, -1, d_input_v) # n_head x (mb_size*len_v) x d_model

        # treat the result as a (n_head * mb_size) size batch ->  understand  d_model//n_head  The last dimension is  d_k
        q_s = torch.bmm(q_s, self.w_qs).view(-1, len_q, d_k)   # (n_head*mb_size) x len_q x d_k
        k_s = torch.bmm(k_s, self.w_ks).view(-1, len_k, d_k)   # (n_head*mb_size) x len_k x d_k
        v_s = torch.bmm(v_s, self.w_vs).view(-1, len_v, d_v)   # (n_head*mb_size) x len_v x d_v

        # perform attention, result size = (n_head * mb_size) x len_q x d_v  After handling multiple operations ,  Calculate the attention mechanism 
        outputs, attns = self.attention(q_s, k_s, v_s, attn_mask=attn_mask.repeat(n_head, 1, 1))

        # back to original mb_size batch, result size = mb_size x len_q x (n_head*d_v)
        outputs = torch.cat(torch.split(outputs, mb_size, dim=0), dim=-1) 

        # project back to residual size,  Return to the original dimension 
        outputs = self.proj(outputs)
        outputs = self.dropout(outputs)

        # return self.layer_norm(outputs + residual), attns  Return to layer regularization (outputs+residual)
        return outputs, attns

5. BiAttention layer

class BiAttention(nn.Module):
 	def __init__(self, input_size, dropout):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.input_linear = nn.Linear(input_size, 1, bias=False)
        self.memory_linear = nn.Linear(input_size, 1, bias=False)
    	self.dot_scale = nn.Parameter(torch.Tensor(input_size).uniform_(1.0 / (input_size ** 0.5)))
    	self.softmax = nn.Softmax(dim=-1)
def forward(self, input, memory, q_mask):
    ''' Args: input: batch_size * doc_word_len * emb_size memory: h_question_word batch_size * ques_len * emb_size q_mask: Returns: '''
    bsz, input_len, memory_len = input.size(0), input.size(1), memory.size(1)

    input = self.dropout(input)
    memory = self.dropout(memory)

    input_dot = self.input_linear(input)
    memory_dot = self.memory_linear(memory).view(bsz, 1, memory_len)

    cross_dot = torch.bmm(input * self.dot_scale, memory.permute(0, 2, 1).contiguous())
    # input Zoom first  --> [batch_size * doc_word_len * ques_len]
    att = input_dot + memory_dot + cross_dot    #  Attention matrix 
    att = att - 1e30 * (1 - q_mask[:, None])    # None You can have one more dimension in your dimension ,  Dealing with problems padding character 

    weight_one = self.softmax(att)    #  Normalize the query ,  Get the weight matrix of the document's attention to the problem 
    output_one = torch.bmm(weight_one, memory)    #  Get the weight of document words on the question 
    weight_two = self.softmax(att.max(dim=-1)[0]).view(bsz, 1, input_len)    #  Get the attention weight matrix of the problem to the document 
    output_two = torch.bmm(weight_two, input)     #  The weight of the problem on each vector 

    return torch.cat([input, output_one, input*output_one, output_two*output_one], dim=-1)
    # input*output_one,  Weight of each word * word , output_two*output_one  The problem is the weight of each word * Every word 
    #  Splicing   Original document ,  Weight of each word ,  Weight of each word * word ,  The problem is the weight of each word * Every word