当前位置:网站首页>A convolution substitution of attention mechanism
A convolution substitution of attention mechanism
2022-07-06 08:57:00 【cyz0202】
Reference from :fairseq
background
Common attention mechanisms are as follows
And generally use multi heads To perform the above calculation ;
time-consuming ;
Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;
Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ?
One idea is depthwise convolution( Reduce computation ), And imitate attention softmax Mechanism ( weighted mean );
The improved scheme
common depthwise convolution as follows
W by kernel,, The parameter size is : When d=1024,k=7,d*k=7168
-------
In order to further reduce the amount of calculation , Can make , In general , Such as H=16,d=1024;
H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;
There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;
Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;( It's a little winding , Refer to the following formula 3, It's more intuitive )
The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head ( The size is d/H) Can calculate independently ; The second way is equivalent to x Every d/H That's ok (d Every head on the dimension ) Use separate kernel Parameters ;
Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;
Through the above design ,W The parameter quantity of is reduced to H*k( Such as H=16,k=7, Then the parameter quantity is only 112)
-------
In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;
Observation formula (1), Function and Attention softmax Output very similar , Namely weighted average ;
So why don't we let Also a distribution Well ?
So make
-------
The final convolution is calculated as follows
notes :W Rounding on the bottom corner is the default , You can also change it to round down and make
Concrete realization
The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;
In order to use matrix calculation , You can think of a compromise
Make , Make , perform BMM(Batch MatMul), Available
The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ?
It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;
consider , It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension ( namely channel) It represents a head, That is, the size of a calculation head ;
In convolution calculation ,
Above x' Each head of ( The last dimension size d/H, common H individual ), Corresponding W Parameter is W in A row of parameters in order ( size k,H The head corresponds to W Of H That's ok ); meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;
therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one (B=1 Time is H The head corresponds to H That's ok kernel Parameters );
W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;
W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;
The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;
Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ? So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;
Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;
Of course , The above filling method Every time I get Fill sequence (fake kernel) It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;
Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;
The above statement may still be a little misleading , Readers think matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;
Wait for me to add a better legend ...
CUDA Realization
The above calculation method consumes resources , Consider customizing CUDA operator ;
I will write another article to talk about ; Custom convolution attention operator CUDA Realization
experimental result
To be continued
summary
- This paper introduces a convolution substitution of attention mechanism ;
- Through a certain design, the convolution can be lightweight , At the same time, imitate Attention To achieve better results ;
边栏推荐
- LeetCode:41. Missing first positive number
- JVM quick start
- Alibaba cloud server mining virus solution (practiced)
- Delay initialization and sealing classes
- Chapter 1 :Application of Artificial intelligence in Drug Design:Opportunity and Challenges
- LeetCode:39. 组合总和
- What are the common processes of software stress testing? Professional software test reports issued by companies to share
- Fairguard game reinforcement: under the upsurge of game going to sea, game security is facing new challenges
- Detailed explanation of dynamic planning
- Sublime text using ctrl+b to run another program without closing other runs
猜你喜欢
随机推荐
ESP8266-RTOS物联网开发
Alibaba cloud server mining virus solution (practiced)
Simple use of promise in uniapp
CSP first week of question brushing
Booking of tourism products in Gansu quadrupled: "green horse" became popular, and one room of B & B around Gansu museum was hard to find
Intel Distiller工具包-量化实现3
如何有效地进行自动化测试?
Leetcode: Sword Finger offer 42. Somme maximale des sous - tableaux consécutifs
LeetCode:26. Remove duplicates from an ordered array
[OC-Foundation框架]---【集合数组】
力扣每日一题(二)
vb.net 随窗口改变,缩放控件大小以及保持相对位置
The harm of game unpacking and the importance of resource encryption
Tdengine biweekly selection of community issues | phase III
How to effectively conduct automated testing?
Charging interface docking tutorial of enterprise and micro service provider platform
[MySQL] limit implements paging
LeetCode:498. 对角线遍历
UnsupportedOperationException异常
R language uses the principal function of psych package to perform principal component analysis on the specified data set. PCA performs data dimensionality reduction (input as correlation matrix), cus