当前位置:网站首页>A convolution substitution of attention mechanism
A convolution substitution of attention mechanism
2022-07-06 08:57:00 【cyz0202】
Reference from :fairseq
background
Common attention mechanisms are as follows
And generally use multi heads To perform the above calculation ;
time-consuming ;
Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;
Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ?
One idea is depthwise convolution( Reduce computation ), And imitate attention softmax Mechanism ( weighted mean );
The improved scheme
common depthwise convolution as follows
W by kernel,, The parameter size is : When d=1024,k=7,d*k=7168
-------
In order to further reduce the amount of calculation , Can make , In general
, Such as H=16,d=1024;
H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;
There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;
Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;( It's a little winding , Refer to the following formula 3, It's more intuitive )
The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head ( The size is d/H) Can calculate independently ; The second way is equivalent to x Every d/H That's ok (d Every head on the dimension ) Use separate kernel Parameters ;
Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;
Through the above design ,W The parameter quantity of is reduced to H*k( Such as H=16,k=7, Then the parameter quantity is only 112)
-------
In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;
Observation formula (1), Function and Attention softmax Output very similar , Namely weighted average ;
So why don't we let Also a distribution Well ?
So make
-------
The final convolution is calculated as follows
notes :W Rounding on the bottom corner is the default , You can also change it to round down and make
Concrete realization
The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;
In order to use matrix calculation , You can think of a compromise
Make , Make
, perform BMM(Batch MatMul), Available
The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ?
It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;
consider , It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension ( namely channel) It represents a head, That is, the size of a calculation head ;
In convolution calculation ,
Above x' Each head of ( The last dimension size d/H, common H individual ), Corresponding W Parameter is W in A row of parameters in order ( size k,H The head corresponds to W Of H That's ok ); meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;
therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one (B=1 Time is H The head corresponds to H That's ok kernel Parameters );
W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;
W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;
The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;
Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ? So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;
Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;
Of course , The above filling method Every time I get Fill sequence (fake kernel) It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;
Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;
The above statement may still be a little misleading , Readers think matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;
Wait for me to add a better legend ...
CUDA Realization
The above calculation method consumes resources , Consider customizing CUDA operator ;
I will write another article to talk about ; Custom convolution attention operator CUDA Realization
experimental result
To be continued
summary
- This paper introduces a convolution substitution of attention mechanism ;
- Through a certain design, the convolution can be lightweight , At the same time, imitate Attention To achieve better results ;
边栏推荐
- What are the common processes of software stress testing? Professional software test reports issued by companies to share
- Warning in install. packages : package ‘RGtk2’ is not available for this version of R
- [OC-Foundation框架]-<字符串And日期与时间>
- Indentation of tabs and spaces when writing programs for sublime text
- After reading the programmer's story, I can't help covering my chest...
- Swagger setting field required is mandatory
- Problems in loading and saving pytorch trained models
- LeetCode:41. Missing first positive number
- The ECU of 21 Audi q5l 45tfsi brushes is upgraded to master special adjustment, and the horsepower is safely and stably increased to 305 horsepower
- LeetCode:214. Shortest palindrome string
猜你喜欢
Alibaba cloud server mining virus solution (practiced)
The ECU of 21 Audi q5l 45tfsi brushes is upgraded to master special adjustment, and the horsepower is safely and stably increased to 305 horsepower
[MySQL] multi table query
BN折叠及其量化
[embedded] cortex m4f DSP Library
Mongodb installation and basic operation
UML圖記憶技巧
【嵌入式】使用JLINK RTT打印log
Simple use of promise in uniapp
MYSQL卸载方法与安装方法
随机推荐
LeetCode:剑指 Offer 04. 二维数组中的查找
After PCD is converted to ply, it cannot be opened in meshlab, prompting error details: ignored EOF
Problems in loading and saving pytorch trained models
Navicat premium create MySQL create stored procedure
UML图记忆技巧
LeetCode:836. Rectangle overlap
在QWidget上实现窗口阻塞
[OC-Foundation框架]-<字符串And日期与时间>
Cesium draw points, lines, and faces
ESP8266-RTOS物联网开发
LeetCode:39. 组合总和
After reading the programmer's story, I can't help covering my chest...
LeetCode:39. Combined sum
Warning in install. packages : package ‘RGtk2’ is not available for this version of R
Marathon envs project environment configuration (strengthen learning and imitate reference actions)
CUDA实现focal_loss
Indentation of tabs and spaces when writing programs for sublime text
Swagger setting field required is mandatory
BMINF的後訓練量化實現
Bitwise logical operator