当前位置:网站首页>How to understand query, key and value in transformer
How to understand query, key and value in transformer
2022-06-28 01:33:00 【coast_ s】
-------------------------------------
Reprint : The original author yafee123
-------------------------------------
Transformer Come of 2017 An article in google brain Another divine article of 《Attention is all you need》, So far, you have led the way in NLP and CV Another research hotspot .
stay Transformer One of the most critical contributions of the is self-attention. It is to use the relationship between input samples to build an attention model .
self-attention Three very important elements have been introduced in : Query 、Key and Value.
hypothesis
Is the characteristic of an input sample sequence , among n Enter the number of samples for ( Sequence length ),d Is the latitude of a single sample .
Query、Key & Value The definition is as follows :
Query:
, among
, This matrix can be considered as proof of spatial transformation , The same below
Key:
, among 
Value:
, among 
For many people , Seeing these three concepts, I was confused . What are the three concepts and self-attention By what relationship , Why did you choose this name ?
【 Be careful : It's important to be careful here X 、Q、K、V Each line of represents an input sample , This is different from the definition that each column of a sample matrix is a sample , This is very important for understanding the following content .】
So this blog is to briefly explain the reasons for these three names .
To understand the meaning of these three concepts , First of all, understand self-attention What do you want in the end ?
The answer is : Given the current input sample
( Just to understand , We disassemble the input ), Produce an output , This output is the weighted sum of all samples in the sequence . Because it is assumed that this output can see all the input sample information , Then choose your own attention points according to different weights .
If you agree with this answer , Then it's easy to explain .
query 、 key & value The concept of is actually derived from the recommendation system . The basic principle is : Given a query, Calculation query And key The relevance of , And then according to query And key To find the most appropriate value. for instance : In the movie recommendation .query It's someone's preference for movies ( For example, points of interest 、 Age 、 Gender, etc )、key It's the type of film ( comedy 、 Age, etc )、value It's the movie to be recommended . In this case ,query, key and value Each attribute of the is in a different space , In fact, they have a certain potential relationship , That is to say, through some kind of transformation , It can make the attributes of the three in a similar space .
stay self-attention In the principle of , Current input sample
, Through spatial transformation, it becomes a query,
,
. Search items in analogy and recommendation system , We have to rely on query And key The relevance to retrieve what is needed value. that
Why key Well ?
Because according to the process of the recommended system , We are going to find query and key The relevance of , The simplest way is to dot product , Get the current sample and relation vector . And in the self-attention In operation , Will do the following
, such
Each element can be regarded as the current sample
And other samples in the sequence .
After obtaining the relationship between samples , It's natural , Only need to
Normalized and multiplied by V matrix , You can get self-attention The final weighted output of :
.
V Medium Every line Is a sample of the sequence .
, among O Output per dimension of , It is equivalent to the weighted sum of the corresponding latitudes of all input sequence samples , And weight is the relation vector
.( This matrix multiplication can be drawn by yourself ).
Since then , It can be concluded that :
1. self-attention The reason for this is that... In the recommendation system query、key 、value Three concepts , It uses a process similar to the recommendation system . however self-attention Not for query Look for value, But according to the present query obtain value Weighted sum of . This is a self-attention The task of , Want to find a better weighted output for the current input , The output should contain all visible input sequence information , Attention is controlled by weight .
2. self-attention Middle here key and value Is a transformation of the input sequence itself , Maybe it's also self-attention Another meaning of : Act at the same time as key and value. Actually, it's very reasonable , Because in the recommendation system , although key and value The original feature space of attributes is different , But they are strongly related , So they go through certain spatial transformations , Can be unified into a feature space . That's why self-attention To multiply by W One of the reasons .
--
Above contents , Continuous modification and optimization , Welcome to exchange and discussion
---
Reference material :
Attention is all you need:https://arxiv.org/pdf/1706.03762.pdf
Transformers in Vision: A Survey: https://arxiv.org/abs/2101.01169 [ Note in this article , About W^Q , W^K and W^V The definition of latitude is wrong , Don't be misled ]
A Survey on Visual Transformer:https://arxiv.org/abs/2012.12556
边栏推荐
- #795 Div.2 E. Number of Groups set *
- Is it safe for Xiaobai in the stock market to open an account on the Internet?
- 美团动态线程池实践思路已开源
- 【开源】开源系统整理-考试问卷等
- Is it safe to open an account online now? Novice is just on the road, ask for the answer
- Is it reliable to invest in exchange traded ETF funds? Is it safe to invest in exchange traded ETF funds
- 最新MySQL高级SQL语句大全
- JVM的内存模型简介
- Informatics Olympiad all in one 1359: enclosed area
- 现在炒股网上开户安全吗?新手刚上路,求答案
猜你喜欢

Interviewer asked: Inheritance of JS

N methods of data De duplication using SQL

深入解析kubernetes controller-runtime

electron窗口背景透明无边框(可用于启动页面)

MySQL 18: execution of write statements

GFS 分布式文件系统概述与部署

Meituan dynamic thread pool practice idea has been open source

Squid proxy server (Web cache layer for cache acceleration)

美团动态线程池实践思路已开源

【无标题】
随机推荐
面试官问:JS的继承
How about the market application strength of large-size conductive slip rings
The contents of the latex table are left, middle and right
#795 Div.2 E. Number of Groups set *
国外LEAD赚钱的一些习惯
Transformer论文逐段精读
MySQL十种锁,一篇文章带你全解析
golang 猴子吃桃子,求第一天桃子的数量
N methods of data De duplication using SQL
给女朋友看的消息中间件
Is it reliable to invest in exchange traded ETF funds? Is it safe to invest in exchange traded ETF funds
Ten thousand words long article understanding business intelligence (BI) | recommended collection
Download, configuration and installation of MySQL
Solon 1.8.3 发布,云原生微服务开发框架
Why stainless steel swivel
SQL Server 2016 detailed installation tutorial (with registration code and resources)
大尺寸导电滑环市场应用强度如何
什么是过孔式导电滑环?
Is the stock investment exchange group safe? Is it reliable to open an account for free?
796 div.2 C. managing history thinking