当前位置:网站首页>A convolution substitution of attention mechanism
A convolution substitution of attention mechanism
2022-07-06 08:57:00 【cyz0202】
Reference from :fairseq
background
Common attention mechanisms are as follows
And generally use multi heads To perform the above calculation ;
time-consuming ;
Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;
Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ?
One idea is depthwise convolution( Reduce computation ), And imitate attention softmax Mechanism ( weighted mean );
The improved scheme
common depthwise convolution as follows
W by kernel,, The parameter size is : When d=1024,k=7,d*k=7168
-------
In order to further reduce the amount of calculation , Can make , In general , Such as H=16,d=1024;
H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;
There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;
Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;( It's a little winding , Refer to the following formula 3, It's more intuitive )
The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head ( The size is d/H) Can calculate independently ; The second way is equivalent to x Every d/H That's ok (d Every head on the dimension ) Use separate kernel Parameters ;
Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;
Through the above design ,W The parameter quantity of is reduced to H*k( Such as H=16,k=7, Then the parameter quantity is only 112)
-------
In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;
Observation formula (1), Function and Attention softmax Output very similar , Namely weighted average ;
So why don't we let Also a distribution Well ?
So make
-------
The final convolution is calculated as follows
notes :W Rounding on the bottom corner is the default , You can also change it to round down and make
Concrete realization
The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;
In order to use matrix calculation , You can think of a compromise
Make , Make , perform BMM(Batch MatMul), Available
The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ?
It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;
consider , It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension ( namely channel) It represents a head, That is, the size of a calculation head ;
In convolution calculation ,
Above x' Each head of ( The last dimension size d/H, common H individual ), Corresponding W Parameter is W in A row of parameters in order ( size k,H The head corresponds to W Of H That's ok ); meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;
therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one (B=1 Time is H The head corresponds to H That's ok kernel Parameters );
W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;
W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;
The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;
Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ? So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;
Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;
Of course , The above filling method Every time I get Fill sequence (fake kernel) It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;
Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;
The above statement may still be a little misleading , Readers think matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;
Wait for me to add a better legend ...
CUDA Realization
The above calculation method consumes resources , Consider customizing CUDA operator ;
I will write another article to talk about ; Custom convolution attention operator CUDA Realization
experimental result
To be continued
summary
- This paper introduces a convolution substitution of attention mechanism ;
- Through a certain design, the convolution can be lightweight , At the same time, imitate Attention To achieve better results ;
边栏推荐
- Super efficient! The secret of swagger Yapi
- Purpose of computer F1-F12
- To effectively improve the quality of software products, find a third-party software evaluation organization
- [OC]-<UI入门>--常用控件的学习
- 可变长参数
- ROS compilation calls the third-party dynamic library (xxx.so)
- LeetCode:劍指 Offer 42. 連續子數組的最大和
- Tdengine biweekly selection of community issues | phase III
- Niuke winter vacation training 6 maze 2
- Esp8266-rtos IOT development
猜你喜欢
LeetCode:498. 对角线遍历
UML图记忆技巧
Intel Distiller工具包-量化实现1
[embedded] cortex m4f DSP Library
Crash problem of Chrome browser
UML圖記憶技巧
Mise en œuvre de la quantification post - formation du bminf
Guangzhou will promote the construction of a child friendly city, and will explore the establishment of a safe area 200 meters around the school
TP-LINK enterprise router PPTP configuration
【文本生成】论文合集推荐丨 斯坦福研究者引入时间控制方法 长文本生成更流畅
随机推荐
[Hacker News Weekly] data visualization artifact; Top 10 Web hacker technologies; Postman supports grpc
Marathon envs project environment configuration (strengthen learning and imitate reference actions)
LeetCode:剑指 Offer 48. 最长不含重复字符的子字符串
MongoDB 的安装和基本操作
Variable length parameter
To effectively improve the quality of software products, find a third-party software evaluation organization
CUDA implementation of self defined convolution attention operator
自动化测试框架有什么作用?上海专业第三方软件测试公司安利
LeetCode:673. 最长递增子序列的个数
数学建模2004B题(输电问题)
【文本生成】论文合集推荐丨 斯坦福研究者引入时间控制方法 长文本生成更流畅
Fairguard game reinforcement: under the upsurge of game going to sea, game security is facing new challenges
[OC]-<UI入门>--常用控件-UIButton
Implement window blocking on QWidget
Hutool gracefully parses URL links and obtains parameters
LeetCode:673. Number of longest increasing subsequences
Bitwise logical operator
Super efficient! The secret of swagger Yapi
704 binary search
ant-design的走马灯(Carousel)组件在TS(typescript)环境中调用prev以及next方法