当前位置:网站首页>Self attention learning notes
Self attention learning notes
2022-07-28 06:06:00 【Alan and fish】
1. introduce Slef-Attention Why

In natural language processing , Use RNN( This refers to LSTM) When processing input and output data ,LSTM It can solve long text dependency , Because he can rely on the previous text , And can't do parallel computing , Resulting in very slow operation .
So many scholars will use CNN To replace RNN,CNN You need to stack many layers , You can see all the sequence information , And can calculate in parallel . But there is a problem , You need to stack many layers , This also indirectly leads to low efficiency .
So the introduction of self-attention Mechanism , We can solve these two problems :
- 1. See the dependence of each node on all nodes
- 2. Superposition operation can be carried out
As shown on the right ,b1 Can depend on a1,a2,a3,a4,b2 So it is with .
2.self-attention Principle explanation
2.1 Explain the general principle

- 1. Calculation a
x1,x2,x3,x4 Will multiply by a matrix W obtain a1,a2,a2,a3. - 2. Calculation q,k,v
adopt a With a matrix w The calculation shows that q,k,v, Three values
The function and calculation process of each value are as follows :
q:query( Used to match other values ), qi=Wqai
k:key( Used to be matched ), ki=Wkai
v: Extracted information , vi=Wvai - 3. Calculation α \alpha α

Every one of them query q To every key k do attention, In fact, that is q1 And ki Do dot multiplication
among : α \alpha α1,i=q1ki d \sqrt{d} d - 4. Calculation α ^ \widehat{\alpha} α
This algorithm is to put all α < s u b > 1 , i < / s u b > \alpha<sub>1,i</sub> α<sub>1,i</sub> Add them together , And then there's a soft-max Output , Get every one α \alpha α Probability distribution of .
- 5. Calculation b
Will be α ^ \widehat{\alpha} α With each vi Do a dot product , And then add it up , Got it. b, That is, the final output .
The whole process is self-attention Mechanism , Calculate the dependencies between each node and other nodes .
2.2 Mathematical calculation
- q,k,v Matrix calculation of

because q yes wq With every one a The result of dot multiplication , So you can put all a As a matrix , Namely wq And a The result of matrix calculation , In this way, parallel computing is achieved .
k,v So is the calculation process of . - Calculation α \alpha α

α \alpha α By q1 With every one k The result of the calculation ( Ignore d \sqrt{d} d), So you can put all k As a matrix , This is the k Matrix and q Matrix calculation of . - Calculation α ^ \widehat{\alpha} α

Put the previous calculation α \alpha α Put one in soft-max Function to get α ^ \widehat{\alpha} α - Calculation b

take α ^ \widehat{\alpha} α And v Matrix dot multiplication , Then add up all the dot multiplication results to get b
The whole process is abstracted as shown in the figure below :
边栏推荐
猜你喜欢
随机推荐
2:为什么要读写分离
小程序商城制作一个需要多少钱?一般包括哪些费用?
分布式锁-数据库实现
Structured streaming in spark
Briefly understand MVC and three-tier architecture
self-attention学习笔记
Single line function, aggregate function after class exercise
Flink CDC (Mysql为例)
Flume installation and use
小程序搭建制作流程是怎样的?
服务可靠性保障-watchdog
【三】redis特点功能
小程序开发系统有哪些优点?为什么要选择它?
trino函数随记
更新包与已安装应用签名不一致
CertPathValidatorException:validity check failed
高端大气的小程序开发设计有哪些注意点?
第八章 聚合函数
Centos7 installing MySQL
如何选择小程序开发企业









