当前位置：网站首页>How to understand query, key and value in transformer

How to understand query, key and value in transformer

2022-06-28 01:33:00 【coast_ s】

-------------------------------------

Reprint ： The original author yafee123

-------------------------------------

Transformer Come of 2017 An article in google brain Another divine article of 《Attention is all you need》, So far, you have led the way in NLP and CV Another research hotspot .

stay Transformer One of the most critical contributions of the is self-attention. It is to use the relationship between input samples to build an attention model .

self-attention Three very important elements have been introduced in ： Query 、Key and Value.

hypothesis $\bf{X} \in \mathbb{R}^{n \times d}$ Is the characteristic of an input sample sequence , among n Enter the number of samples for ( Sequence length ),d Is the latitude of a single sample .

Query、Key & Value The definition is as follows ：

Query： $\bf{Q} = \bf{X} \cdot W^Q$ , among $\bf{W}^Q \in \mathbb{R}^{d \times d_q}$ , This matrix can be considered as proof of spatial transformation , The same below

Key： $\bf{K} = \bf{X} \cdot W^K$ , among $\bf{W}^K \in \mathbb{R}^{d \times d_k}$

Value： $\bf{V} = \bf{X} \cdot W^K$ , among $\bf{W}^V \in \mathbb{R}^{d \times d_v}$

For many people , Seeing these three concepts, I was confused . What are the three concepts and self-attention By what relationship , Why did you choose this name ？

【 Be careful ： It's important to be careful here X 、Q、K、V Each line of represents an input sample , This is different from the definition that each column of a sample matrix is a sample , This is very important for understanding the following content .】

So this blog is to briefly explain the reasons for these three names .

To understand the meaning of these three concepts , First of all, understand self-attention What do you want in the end ？

The answer is ： Given the current input sample $\bf{x}_i \in \mathbb{R}^{1 \times d}$ （ Just to understand , We disassemble the input ）, Produce an output , This output is the weighted sum of all samples in the sequence . Because it is assumed that this output can see all the input sample information , Then choose your own attention points according to different weights .

If you agree with this answer , Then it's easy to explain .

query 、 key & value The concept of is actually derived from the recommendation system . The basic principle is ： Given a query, Calculation query And key The relevance of , And then according to query And key To find the most appropriate value. for instance ： In the movie recommendation .query It's someone's preference for movies （ For example, points of interest 、 Age 、 Gender, etc ）、key It's the type of film （ comedy 、 Age, etc ）、value It's the movie to be recommended . In this case ,query, key and value Each attribute of the is in a different space , In fact, they have a certain potential relationship , That is to say, through some kind of transformation , It can make the attributes of the three in a similar space .

stay self-attention In the principle of , Current input sample $\bf{x}_i$ , Through spatial transformation, it becomes a query, $\bf{q}_i = \bf{x}_i \cdot W^Q$ , $\bf{q}_i \in \mathbb{R}^{1 \times d_q}$ . Search items in analogy and recommendation system , We have to rely on query And key The relevance to retrieve what is needed value. that $\bf{K} = \bf{X} \cdot W^K$ Why key Well ？

Because according to the process of the recommended system , We are going to find query and key The relevance of , The simplest way is to dot product , Get the current sample and relation vector . And in the self-attention In operation , Will do the following $\bf{r_i} = \bf{q}_i \cdot K^T$ , such $\bf{r}_i \in \mathbb{R}^{1 \times n}$ Each element can be regarded as the current sample $\bf{x}_i$ And other samples in the sequence .

After obtaining the relationship between samples , It's natural , Only need to $\bf{r}_i$ Normalized and multiplied by V matrix , You can get self-attention The final weighted output of ： $O_i = softmax(r_i)\cdot V$ .

V Medium Every line Is a sample of the sequence . $\bf{O}_i \in \mathbb{R} ^{1 \times d_v}$ , among O Output per dimension of , It is equivalent to the weighted sum of the corresponding latitudes of all input sequence samples , And weight is the relation vector softmax(r_i) .（ This matrix multiplication can be drawn by yourself ）.

Since then , It can be concluded that ：

1. self-attention The reason for this is that... In the recommendation system query、key 、value Three concepts , It uses a process similar to the recommendation system . however self-attention Not for query Look for value, But according to the present query obtain value Weighted sum of . This is a self-attention The task of , Want to find a better weighted output for the current input , The output should contain all visible input sequence information , Attention is controlled by weight .

2. self-attention Middle here key and value Is a transformation of the input sequence itself , Maybe it's also self-attention Another meaning of ： Act at the same time as key and value. Actually, it's very reasonable , Because in the recommendation system , although key and value The original feature space of attributes is different , But they are strongly related , So they go through certain spatial transformations , Can be unified into a feature space . That's why self-attention To multiply by W One of the reasons .

Above contents , Continuous modification and optimization , Welcome to exchange and discussion

---

Reference material ：

Attention is all you need：https://arxiv.org/pdf/1706.03762.pdf

Transformers in Vision: A Survey: https://arxiv.org/abs/2101.01169 [ Note in this article , About W^Q , W^K and W^V The definition of latitude is wrong , Don't be misled ]

A Survey on Visual Transformer：https://arxiv.org/abs/2012.12556

Recommendation system and Attention Mechanism —— Detailed explanation Attention Mechanism _caizd2009 The blog of -CSDN Blog _attention Recommendation system

neural networks - What exactly are keys, queries, and values in attention mechanisms? - Cross Validated

原网站

版权声明
本文为[coast_ s]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/179/202206272310043428.html