当前位置：网站首页>Self attention and multi head attention

Self attention and multi head attention

2022-06-10 20:50:00 【Binary artificial intelligence】

List of articles

Self attention
Long attention

We met before A general attention model . This article will introduce self attention and multi attention , For the follow-up introduction Transformer Do matting .

Self attention

If attention in the attention model is calculated completely based on eigenvectors , Then call this kind of attention self attention ：

Insert picture description here
Picture changed from ：[1]

for example , We can use the weight matrix $\boldsymbol{W}_K∈\mathbb{R}^{d_k×d_f}$ 、 $\boldsymbol{W}_V∈\mathbb{R}^{d_v×d_f}$ and $\boldsymbol{W}_{Q} \in \mathbb{R}^{d_{q} \times d_{f}}$ For the characteristic matrix $\boldsymbol{F}=[\boldsymbol{f}_{1}, \ldots, \boldsymbol{f}_{n_{f}}] \in \mathbb{R}^{d_{f}\times n_f}$ Make a linear transformation , obtain

key (Key) matrix
$\begin{aligned} \boldsymbol{K}&=\boldsymbol{W}_{K}\boldsymbol{F}\\ &=\boldsymbol{W}_{K}[\boldsymbol{f}_{1}, \ldots, \boldsymbol{f}_{n_{f}}]\\ &=\left[\boldsymbol{k}_{1}, \ldots, \boldsymbol{k}_{n_{f}}\right] \in \mathbb{R}^{d_{k} \times n_{f}} \end{aligned}$

value (Value) matrix
$\begin{aligned} \boldsymbol{V}&=\boldsymbol{W}_{V}\boldsymbol{F}\\ &=\boldsymbol{W}_{V}[\boldsymbol{f}_{1}, \ldots, \boldsymbol{f}_{n_{f}}]\\ &=\left[\boldsymbol{v}_{1}, \ldots, \boldsymbol{v}_{n_{f}}\right] \in \mathbb{R}^{d_{v} \times n_{f}} \end{aligned}$

Inquire about (Query) matrix
$\begin{aligned} \boldsymbol{Q}&=\boldsymbol{W}_{Q}\boldsymbol{F}\\ &=\boldsymbol{W}_{Q}[\boldsymbol{f}_{1}, \ldots, \boldsymbol{f}_{n_{f}}]\\ &=\left[\boldsymbol{q}_{1}, \ldots, \boldsymbol{q}_{n_{f}}\right] \in \mathbb{R}^{d_{q} \times n_{f}} \end{aligned}$

$\boldsymbol{Q}$ Each column of $\boldsymbol{q}_i$ Are used as queries for the attention model . When using query vectors $\boldsymbol{q}_i$ When calculating attention , Generated context vector $\boldsymbol{c}_i$ Will summarize the eigenvectors in the query $\boldsymbol{q}_i$ For important information .

First , Query pair $\boldsymbol{q}_i,i=1,2,...,n_f$ Calculate key vector $\boldsymbol{k}_{j}$ The attention score of ：

$\underset{1 \times 1}{e_{i,j}}=\operatorname{score}\left(\underset{d_{q} \times 1}{\boldsymbol{q}_i}, \underset{d_{k} \times 1}{\boldsymbol{k}_{j}}\right),j=1,2,...,n_f$

Inquire about $\boldsymbol{q}_i$ Indicates a request for information . Attention score $e_{i,j}$ Indicates that according to the query $\boldsymbol{q}_i$ , Bond vector $\boldsymbol{k}_j$ How important is the information contained in . Calculate the score of each value vector , Get information about the query $\boldsymbol{q}_i$ Attention score vector ：

$\boldsymbol{e_i}=[e_{i1},e_{i2},...,e_{i,n_f}]^T$

then , Use the alignment function $\operatorname{align}()$ Align ：
$\underset{1 \times 1}{a_{i,j}}=\operatorname{align}\left(\underset{1 \times 1}{e_{i,j} ;} \underset{n_{f} \times 1}{\boldsymbol{e_i}}\right),j=1,2,...,n_f$

Get the attention weight vector ： $\boldsymbol{a}_i=[a_{i,1},a_{i,2},...,a_{i,n_f}]^T$ .

Finally, the context vector is calculated ：
$\underset{d_{v} \times 1}{\boldsymbol{c}_i}=\sum_{j=1}^{n_{f}}\underset{1\times 1} {a_{i,j}} \times \underset{d_v\times 1}{\boldsymbol{v}_{j}}$

Summarize the above steps , The self attention calculation expression is ：

$\boldsymbol{c}_i=\text { self-att }(\boldsymbol{q}_i, \boldsymbol{K}, \boldsymbol{V})\in \mathbb{R}^{d_v}$
because $\boldsymbol{q}_i=\boldsymbol{W}_Q\boldsymbol{f}_i$ , So we can say that the context vector $\boldsymbol{c}_i$ Include all eigenvectors （ Include $\boldsymbol{f}_i$ ） For a particular eigenvector $\boldsymbol{f}_i$ For important information . for example , For language , This means that self attention can extract the relationship between word features （ Verbs and nouns ; Pronouns and nouns, etc ）, If $\boldsymbol{f}_i$ Is the eigenvector of a word , Then self attention can be calculated from other words $\boldsymbol{f}_i$ For important information . For the image , Self attention can get the relationship between the features of each image region .

Calculation $\boldsymbol{Q}$ Context vectors for all query vectors in , Get the output from the attention layer ：
$\boldsymbol{C}=\text { self-att }(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V})=[\boldsymbol{c}_1,\boldsymbol{c}_2,...,\boldsymbol{c}_{n_f}]\in \mathbb{R}^{d_{v}\times n_f}$

Long attention

Insert picture description here

Multi head attention works by using multiple different versions of the same query to implement multiple attention modules in parallel . The idea is to use different weight matrices to query $\boldsymbol{q}$ Get multiple queries by linear transformation . Each newly formed query essentially requires different types of relevant information , This allows the attention model to introduce more information into the context vector computation .

Bulls pay attention to $d$ Each header has its own multiple query vectors 、 Key matrix and value matrix ： $\boldsymbol{q}^{(l)}, \boldsymbol{K}^{(l)}$ and $\boldsymbol{V}^{(l)}$ , $\ldots, d$ .

Inquire about $\boldsymbol{q}^{(l)}$ By the original query $\boldsymbol{q}$ After linear transformation, we get , and $\boldsymbol{K}^{(l)}$ and $\boldsymbol{V}^{(l)}$ It is $\boldsymbol{F}$ After linear transformation, we get . Each attention head has its own learnable weight matrix $\boldsymbol{W}^{(l)}_q、\boldsymbol{W}^{(l)}_K and \boldsymbol{W}^{(l)}_V$ . The first $l$ Head of the query 、 The keys and values are calculated as follows ：

$\underset{d_{q} \times 1}{\boldsymbol{q}^{(l)}}=\underset{d_{q} \times d_{q}}{\boldsymbol{W}_{q}^{(l)}} \times \underset{d_{q} \times 1}{\boldsymbol{q}},$

$\underset{d_{k} \times n_{f}}{\boldsymbol{K}^{(l)}}=\underset{d_{k} \times d_{f}}{\boldsymbol{W}_K^{(l)}} \times \underset{d_{f} \times n_{f}}{\boldsymbol{F}}$

$\underset{d_{v} \times n_{f}}{\boldsymbol{V}^{(l)}}=\underset{d_{v} \times d_{f}}{\boldsymbol{W}_V^{(l)}} \times \underset{d_{f} \times n_{f}}{\boldsymbol{F}}$

Each header creates its own pair of queries $\boldsymbol{q}$ And input matrix $\boldsymbol{F}$ It means , This allows the model to learn more information . for example , When training language models , An attentional head can learn to pay attention to certain verbs （ For example, walking 、 Drive 、 Buy ） And noun （ for example , Student 、 automobile 、 Apple ） The relationship between , The other attentional head learns to focus on pronouns （ for example , He 、 she 、it） The relationship with nouns .

Each head will also create its own attention score vector $\boldsymbol{e}_i^{(l)}=\left[e_{i,1}^{(l)}, \ldots, e_{i,n_{f}}^{(l)}\right]^T \in \mathbb{R}^{n_{f}}$ , And the corresponding attention weight vector $\boldsymbol{a}_i^{(l)}=\left[a_{i,1}^{(l)}, \ldots, a_{i,n_{f}}^{(l)}\right]^T \in \mathbb{R}^{n_{f}}$

then , Each header generates its own context vector $\boldsymbol{c}_i^{(l)}\in \mathbb{R}^{d_{v}}$ , As shown below ：

$\underset{d_{v} \times 1}{\boldsymbol{c}_i^{(l)}}=\sum_{j=1}^{n_f} \underset{1\times 1}{a_{i,j}^{(l)}}\times \underset{d_{v}\times 1}{\boldsymbol{v}_j^{(l)}}$

Our goal is still to create a context vector as the output of the attention model . therefore , The context vectors generated by each attention head are connected into a vector . then , Use the weight matrix $\boldsymbol{W}_{O} \in \mathbb{R}^{d_{c} \times d_{v} d}$ Perform a linear transformation on it ：
$\underset{d_{c} \times 1}{\boldsymbol{c}_i}=\underset{d_{c} \times d_{v} d}{\boldsymbol{W}_{O}} \times \operatorname{concat}\left(\underset{d_v \times 1}{\boldsymbol{c}_i^{(1)}} ; \ldots; \underset{d_v \times 1}{\boldsymbol{c}_i^{(d)}} \right)$