当前位置：网站首页>Code implementation additive attention

Code implementation additive attention

2022-07-28 17:11:00 【InfoQ】

import math
import torch
from torch import nn
from d2l import torch as d2l

python The guide bag that people must understand , There's no need to explain .

def masked_softmax(X, valid_lens):
 if valid_lens is None:
 return nn.functional.softmax(X, dim=-1)
 else:
 shape = X.shape
 if valid_lens.dim() == 1:
 valid_lens = torch.repeat_interleave(valid_lens, shape[1])
 else:
 valid_lens = valid_lens.reshape(-1)
 X = d2l.sequence_mask(X.reshape(-1, shape[-1]), valid_lens,
 value=-1e6)
 return nn.functional.softmax(X.reshape(shape), dim=-1)

A shelter softmax The operation of . stay

nadaraya-waston Kernel regression code implementation

We did a similar mask operation . It's the position of the penultimate code , Every

And other than yourself

Calculate , And then we use

X_tile[(1 - torch.eye(n_train)).type(torch.bool)]

Cover itself up . That is to say mask operation .

The function of this function is ： That is, the whole tensor we pass in may only be useful , So the useless part mask fall , Only the remaining parts are softmax Calculation . For example, we pass in a length of 5 Vector , We only need the first two data , Then after this function , The last three numbers add up to 0, The first two numbers add up to 1.

Function two parameters
X
and
valid_lens
,x Yes. softmax Tensor ,valid_lens Store the effective length on each dimension , Whether it's one-dimensional or two-dimensional , We must ensure that the broadcasting mechanism can be carried out .

It's a function if sentence
if valid_lens is None
It means if you don't give
valid_lens
, That is, the whole tensor is effective , There is no need for mask After that softmax, therefore if Statement directly returns an ordinary softmax operation , Function run finished .

When it comes to
valid_lens
When you enter else

The first is to use
shape
Store to be mask Tensor
X
Of shape.

Another if-else sentence , This is used to deal with
valid_lens
Of length , take
valid_lens
Number of rows of length transformation matrix .

When
valid_lens
It's a one-dimensional time to enter if, Convert it to a mask vector . Explain it. , because mini-batch The existence of , So the incoming
X
It's usually three dimensional , The first dimension is batch size, The two or three dimensions are the size of the matrix . I used to use
shape
Storage
X
Of shape, Now use
shape[1]
Fetch
X
The matrix in is a few lines , Then the valid elements of each line correspond to
valid_lens
The value in .

Want to know
torch.repeat_interleave
Look here →
pytorch Medium repeat Operation comparison

When
valid_lens
Not in one dimension else in . Directly convert it from a matrix to a vector .

about mask The operation is to use d2l The functions in , I won't pick it up , For the processing of dimensions, remember ：

If the incoming
valid_lens
It's one-dimensional , that
valid_lens
The length should be the same as
X
The second dimension of （
shape[1]
） equally .

If the incoming
valid_lens
It's two-dimensional , that
valid_lens
The first dimension of should be the same as batch size equally , The second dimension should be consistent with
X
The number of rows in the matrix is the same .

Specific examples can be seen in
Code implementation Zoom in and out and focus | scaled dot-product attention

class AdditiveAttention(nn.Module):
 def __init__(self, key_size, query_size, num_hiddens, dropout, **kwargs):
 super(AdditiveAttention, self).__init__(**kwargs)
 self.W_k = nn.Linear(key_size, num_hiddens, bias=False)
 self.W_q = nn.Linear(query_size, num_hiddens, bias=False)
 self.w_v = nn.Linear(num_hiddens, 1, bias=False)
 self.dropout = nn.Dropout(dropout)

 def forward(self, queries, keys, values, valid_lens):
 queries, keys = self.W_q(queries), self.W_k(keys)
 features = queries.unsqueeze(2) + keys.unsqueeze(1)
 features = torch.tanh(features)
 scores = self.w_v(features).squeeze(-1)
 self.attention_weights = masked_softmax(scores, valid_lens)
 return torch.bmm(self.dropout(self.attention_weights), values)

Additive attention code part ：

Because this involves a tensor rising to four dimensions , So be sure to stroke it yourself .

Three main parameters ,
key_size
keys The length of ,
query_size
query The length of ,
num_hiddens
The size of the hidden layer . Because additive attention is dealing with keys and queries In case of different lengths .

Three small linear layers .
self.W_k
and
self.W_q
It's a key and query Convert to hidden layer ,
self.W_v
From hidden layer to single output .

All settings here do not need bias

Finally, I did something dropout

Then the forward propagation function , It's calculation

The process of ：

take queries and keys Throw it into the first two linear layers to get queries and keys, Make dimension adjustment .

queries
The shape of the ：(
batch_size
, Number of queries , 1,
num_hidden
)

key
The shape of the ：(
batch_size
, 1, “ key － value ” The number of right ,
num_hiddens
)

Calculate the formula .

scores
The calculation of
self.w_v
There is only one output , So remove the last dimension from the shape .
scores
The shape of the ：(
batch_size
, Number of queries , “ key - value ” The number of right )

Last
values
The shape of the ：(
batch_size
, “ key － value ” The number of right , Dimension of value )

queries, keys = torch.normal(0, 1, (2, 1, 20)), torch.ones((2, 10, 2))
# `values`  Small batch dataset , The two value matrices are the same 
values = torch.arange(40, dtype=torch.float32).reshape(1, 10, 4).repeat(
 2, 1, 1)
valid_lens = torch.tensor([2, 6])

attention = AdditiveAttention(key_size=2, query_size=20, num_hiddens=8, dropout=0.1)
attention.eval()
attention(queries, keys, values, valid_lens)

Bring in a sample test for a while .

Note that... Is used here

.eval()

, Is not enabled BatchNormalization and Dropout.

d2l.show_heatmaps(attention.attention_weights.reshape((1, 1, 2, 10)),
 xlabel='Keys', ylabel='Queries')

Because and

Code implementation Zoom in and out and focus | scaled dot-product attention

The data used are the same , So I won't analyze the heat map in detail , If you don't understand, you can read the heat map analysis of the article .

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/196/202207130915458169.html

当前位置：网站首页>Code implementation additive attention

Code implementation additive attention

边栏推荐

猜你喜欢

随机推荐