当前位置：网站首页>A convolution substitution of attention mechanism

A convolution substitution of attention mechanism

2022-07-06 08:57:00 【cyz0202】

Reference from ：fairseq

background

Common attention mechanisms are as follows

$Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt{d}})V$

And generally use multi heads To perform the above calculation ;

time-consuming ;

Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;

Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ？

One idea is depthwise convolution（ Reduce computation ）, And imitate attention softmax Mechanism （ weighted mean ）;

The improved scheme

common depthwise convolution as follows

$\\O_{i,c} \\= DepthwiseConv(X, W_{c,:}, i, c) \\= \sum^k_{j=1}X_{(i+j-\left \lfloor \frac{k+1}{2} \right \rfloor), c} \cdot W_{c,j} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (1)$

W by kernel, $W \in R^{d \times k}$ , The parameter size is ： When d=1024,k=7,d*k=7168

-------

In order to further reduce the amount of calculation , Can make $W \in R^{H \times k}$ , In general $H \ll d$ , Such as H=16,d=1024;

H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;

There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;

Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;（ It's a little winding , Refer to the following formula 3, It's more intuitive ）

The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head （ The size is d/H） Can calculate independently ; The second way is equivalent to x Every d/H That's ok （d Every head on the dimension ） Use separate kernel Parameters ;

Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;

Through the above design ,W The parameter quantity of is reduced to H*k（ Such as H=16,k=7, Then the parameter quantity is only 112）

-------

In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;

Observation formula (1), $W_{c,j}$ Function and Attention softmax Output very similar , Namely weighted average ;

So why don't we let $W_{c,:}$ Also a distribution Well ？

So make

$\\W_{c,:} = Softmax(W_{c,:}) \\softmax(W)_{c,j} = \frac{e^{W_{c,j}}}{\sum_{j'} e^{W_{c,j'}}} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (2)$

-------

The final convolution is calculated as follows

$\\O_{i,c} \\= DepthwiseConv(X, W_{\left \lceil \frac{cH}{d} \right \rceil,:}, i, c) \\= \sum^k_{j=1}X_{(i+j-\left \lfloor \frac{k+1}{2} \right \rfloor), c} \cdot W_{\left \lceil \frac{cH}{d} \right \rceil,j} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (3)$

notes ：W Rounding on the bottom corner is the default $c \in [1, d]$ , You can also change it to round down and make $c \in [0, d-1]$

Concrete realization

The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;

In order to use matrix calculation , You can think of a compromise

Make $W => W' \in R^{BH \times n \times n}$ , Make $x => x' \in R^{BH \times n \times \frac{d}{H}}$ , perform BMM（Batch MatMul）, Available

$\\Out = W' \cdot x' \\ Out \in R^{BH\times n \times \frac{d}{H}} \\ or\\ Out \in R^{B \times n \times d} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (4)$

The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ？

It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;

consider $x => x' \in R^{BH \times n \times \frac{d}{H}}$ , It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension （ namely channel） It represents a head, That is, the size of a calculation head ;

In convolution calculation ,

Above x' Each head of （ The last dimension size d/H, common H individual ）, Corresponding W Parameter is W in A row of parameters in order （ size k,H The head corresponds to W Of H That's ok ）; meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;

therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one （B=1 Time is H The head corresponds to H That's ok kernel Parameters ）;

W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;

W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;

The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;

Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ？ So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;

Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;

Of course , The above filling method Every time I get Fill sequence （fake kernel） It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;

Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;

The above statement may still be a little misleading , Readers think $R^{n \times n} \times R^{n \times \frac{d}{H}}$ matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;