当前位置：网站首页>Self attention learning notes

Self attention learning notes

2022-07-28 06:06:00 【Alan and fish】

1. introduce Slef-Attention Why

In natural language processing , Use RNN（ This refers to LSTM） When processing input and output data ,LSTM It can solve long text dependency , Because he can rely on the previous text , And can't do parallel computing , Resulting in very slow operation .

So many scholars will use CNN To replace RNN,CNN You need to stack many layers , You can see all the sequence information , And can calculate in parallel . But there is a problem , You need to stack many layers , This also indirectly leads to low efficiency .

So the introduction of self-attention Mechanism , We can solve these two problems ：

1. See the dependence of each node on all nodes
2. Superposition operation can be carried out
As shown on the right ,b¹ Can depend on a¹,a²,a³,a⁴,b² So it is with .

2.self-attention Principle explanation

2.1 Explain the general principle

1. Calculation a
x¹,x²,x³,x⁴ Will multiply by a matrix W obtain a¹,a²,a²,a³.
2. Calculation q,k,v
adopt a With a matrix w The calculation shows that q,k,v, Three values
The function and calculation process of each value are as follows ：
q：query（ Used to match other values ）, qⁱ=W^qaⁱ
k：key（ Used to be matched ）, kⁱ=W^kaⁱ
v： Extracted information , vⁱ=W^vaⁱ
3. Calculation $\alpha$

Every one of them query q To every key k do attention, In fact, that is q¹ And kⁱ Do dot multiplication
among ： $\alpha$ _1,i=q¹kⁱ $\sqrt{d}$
4. Calculation $\widehat{\alpha}$
This algorithm is to put all $\alpha<sub>1,i</sub>$ Add them together , And then there's a soft-max Output , Get every one $\alpha$ Probability distribution of .
5. Calculation b
Will be $\widehat{\alpha}$ With each v_i Do a dot product , And then add it up , Got it. b, That is, the final output .
The whole process is self-attention Mechanism , Calculate the dependencies between each node and other nodes .

2.2 Mathematical calculation

q,k,v Matrix calculation of

because q yes w_q With every one a The result of dot multiplication , So you can put all a As a matrix , Namely w_q And a The result of matrix calculation , In this way, parallel computing is achieved .
k,v So is the calculation process of .
Calculation $\alpha$

$\alpha$ By q¹ With every one k The result of the calculation ( Ignore $\sqrt{d}$ ), So you can put all k As a matrix , This is the k Matrix and q Matrix calculation of .
Calculation $\widehat{\alpha}$

Put the previous calculation $\alpha$ Put one in soft-max Function to get $\widehat{\alpha}$
Calculation b

take $\widehat{\alpha}$ And v Matrix dot multiplication , Then add up all the dot multiplication results to get b
The whole process is abstracted as shown in the figure below :