当前位置:网站首页>A convolution substitution of attention mechanism
A convolution substitution of attention mechanism
2022-07-06 08:57:00 【cyz0202】
Reference from :fairseq
background
Common attention mechanisms are as follows
And generally use multi heads To perform the above calculation ;
time-consuming ;
Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;
Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ?
One idea is depthwise convolution( Reduce computation ), And imitate attention softmax Mechanism ( weighted mean );
The improved scheme
common depthwise convolution as follows
W by kernel,, The parameter size is : When d=1024,k=7,d*k=7168
-------
In order to further reduce the amount of calculation , Can make , In general
, Such as H=16,d=1024;
H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;
There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;
Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;( It's a little winding , Refer to the following formula 3, It's more intuitive )
The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head ( The size is d/H) Can calculate independently ; The second way is equivalent to x Every d/H That's ok (d Every head on the dimension ) Use separate kernel Parameters ;
Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;
Through the above design ,W The parameter quantity of is reduced to H*k( Such as H=16,k=7, Then the parameter quantity is only 112)
-------
In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;
Observation formula (1), Function and Attention softmax Output very similar , Namely weighted average ;
So why don't we let Also a distribution Well ?
So make
-------
The final convolution is calculated as follows
notes :W Rounding on the bottom corner is the default , You can also change it to round down and make
Concrete realization
The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;
In order to use matrix calculation , You can think of a compromise
Make , Make
, perform BMM(Batch MatMul), Available
The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ?
It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;
consider , It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension ( namely channel) It represents a head, That is, the size of a calculation head ;
In convolution calculation ,
Above x' Each head of ( The last dimension size d/H, common H individual ), Corresponding W Parameter is W in A row of parameters in order ( size k,H The head corresponds to W Of H That's ok ); meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;
therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one (B=1 Time is H The head corresponds to H That's ok kernel Parameters );
W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;
W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;
The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;
Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ? So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;
Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;
Of course , The above filling method Every time I get Fill sequence (fake kernel) It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;
Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;
The above statement may still be a little misleading , Readers think matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;
Wait for me to add a better legend ...
CUDA Realization
The above calculation method consumes resources , Consider customizing CUDA operator ;
I will write another article to talk about ; Custom convolution attention operator CUDA Realization
experimental result
To be continued
summary
- This paper introduces a convolution substitution of attention mechanism ;
- Through a certain design, the convolution can be lightweight , At the same time, imitate Attention To achieve better results ;
边栏推荐
- TP-LINK 企业路由器 PPTP 配置
- Simple use of promise in uniapp
- LeetCode:34. 在排序数组中查找元素的第一个和最后一个位置
- [OC-Foundation框架]-<字符串And日期与时间>
- Nacos 的安装与服务的注册
- pytorch查看张量占用内存大小
- Roguelike game into crack the hardest hit areas, how to break the bureau?
- LeetCode:221. Largest Square
- Bitwise logical operator
- Hutool gracefully parses URL links and obtains parameters
猜你喜欢
ROS compilation calls the third-party dynamic library (xxx.so)
使用latex导出IEEE文献格式
[sword finger offer] serialized binary tree
Warning in install. packages : package ‘RGtk2’ is not available for this version of R
Detailed explanation of heap sorting
[OC-Foundation框架]--<Copy对象复制>
【ROS】usb_ Cam camera calibration
注意力机制的一种卷积替代方式
项目连接数据库遇到的问题及解决
LeetCode:498. 对角线遍历
随机推荐
POI add write excel file
CUDA实现focal_loss
【剑指offer】序列化二叉树
Warning in install. packages : package ‘RGtk2’ is not available for this version of R
Leetcode: Jianzhi offer 03 Duplicate numbers in array
LeetCode:剑指 Offer 42. 连续子数组的最大和
SAP ui5 date type sap ui. model. type. Analysis of the parsing format of date
LeetCode:124. 二叉树中的最大路径和
MySQL uninstallation and installation methods
【嵌入式】Cortex M4F DSP库
甘肃旅游产品预订增四倍:“绿马”走红,甘肃博物馆周边民宿一房难求
R language uses the principal function of psych package to perform principal component analysis on the specified data set. PCA performs data dimensionality reduction (input as correlation matrix), cus
LeetCode:221. 最大正方形
使用latex导出IEEE文献格式
LeetCode:498. 对角线遍历
vb.net 随窗口改变,缩放控件大小以及保持相对位置
Bitwise logical operator
[sword finger offer] serialized binary tree
Simclr: comparative learning in NLP
在QWidget上实现窗口阻塞