当前位置:网站首页>A convolution substitution of attention mechanism

A convolution substitution of attention mechanism

2022-07-06 08:57:00 cyz0202

Reference from :fairseq

background

Common attention mechanisms are as follows

Attention(Q,K,V) = Softmax(\frac{QK^T}{\sqrt{d}})V

And generally use multi heads To perform the above calculation ;

time-consuming ;

Convolution has long been used in NLP in , It's just that the calculation amount of common use methods is not small , And the effect is not as good as Attention;

Is it possible to improve convolution in NLP The application of , Achieve both fast and good results ?

One idea is depthwise convolution( Reduce computation ), And imitate attention softmax Mechanism ( weighted mean );


The improved scheme

common depthwise convolution as follows

\\O_{i,c} \\= DepthwiseConv(X, W_{c,:}, i, c) \\= \sum^k_{j=1}X_{(i+j-\left \lfloor \frac{k+1}{2} \right \rfloor), c} \cdot W_{c,j} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (1)

W by kernel,W \in R^{d \times k}, The parameter size is : When d=1024,k=7,d*k=7168

-------

In order to further reduce the amount of calculation , Can make  W \in R^{H \times k}, In general  H \ll d, Such as H=16,d=1024;

H The dimension becomes smaller , To keep right x Do a complete convolution calculation , It needs to be in the original d This dimension Repeated use W, That's different channel May use W Parameters in the same line ;

There are two ways to reuse , One is W stay d The translation of this dimension repeats , That is, every H Line is used once W;

Another way to consider W My line is d Dimensional repetition , That is, in the process of convolution calculation ,x stay d On this dimension Every time d/H That's ok Corresponding to W A line in , Such as x Of d Dimensionally [0-d/H) That's ok depthwise Convolution calculation uses W Of the 0 Line parameters ;( It's a little winding , Refer to the following formula 3, It's more intuitive )

The second way is used here , Immediate repetition ; Thinking about imitation Attention Of multi heads, Every head ( The size is d/H) Can calculate independently ; The second way is equivalent to x Every d/H That's ok (d Every head on the dimension ) Use separate kernel Parameters ;

Careful readers can find out , there H In fact, it's imitation Attention in muliti heads Of H;

Through the above design ,W The parameter quantity of is reduced to H*k( Such as H=16,k=7, Then the parameter quantity is only 112)

-------

In addition to reducing the amount of calculation , You can also let convolution imitate Attention Of softmax The process ;

Observation formula (1),W_{c,j} Function and Attention softmax Output very similar , Namely weighted average ;

So why don't we let W_{c,:} Also a distribution Well ?

So make

\\W_{c,:} = Softmax(W_{c,:}) \\softmax(W)_{c,j} = \frac{e^{W_{c,j}}}{\sum_{j'} e^{W_{c,j'}}} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (2)

-------

The final convolution is calculated as follows

\\O_{i,c} \\= DepthwiseConv(X, W_{\left \lceil \frac{cH}{d} \right \rceil,:}, i, c) \\= \sum^k_{j=1}X_{(i+j-\left \lfloor \frac{k+1}{2} \right \rfloor), c} \cdot W_{\left \lceil \frac{cH}{d} \right \rceil,j} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (3)

notes :W Rounding on the bottom corner is the default c \in [1, d], You can also change it to round down and make c \in [0, d-1]


Concrete realization

The above calculation method of line repetition cannot be directly calculated by using the existing convolution operator ;

In order to use matrix calculation , You can think of a compromise

Make  W => W' \in R^{BH \times n \times n}, Make x => x' \in R^{BH \times n \times \frac{d}{H}}, perform BMM(Batch MatMul), Available

\\Out = W' \cdot x' \\ Out \in R^{BH\times n \times \frac{d}{H}} \\ or\\ Out \in R^{B \times n \times d} \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ \space\ (4)

The above design can guarantee the result Out There is what we need right shape, So how to guarantee Out The value is also correct , That is, how to design W',x The specific value of ?

It's not hard to , Just according to the previous section Make W Row repeat With Yes x Conduct multi-head Calculation Just start with your thoughts ;

consider  x => x' \in R^{BH \times n \times \frac{d}{H}}, It's a simple one reshape, dimension 1 Of BH representative batch*H, hypothesis batch=1, That's it H Calculation header ; dimension 2 Of n Represents the length of the sequence n, namely n A place ; The last dimension ( namely channel) It represents a head, That is, the size of a calculation head ;

In convolution calculation ,

Above x' Each head of ( The last dimension size d/H, common H individual ), Corresponding W Parameter is W in A row of parameters in order ( size k,H The head corresponds to W Of H That's ok ); meanwhile x' The first 2 Dimensions indicate n Convolution calculation shall be carried out for positions , And the calculation method of each position is the same as Window size is k And fixed value sequence To sum by weight ;

therefore W' dimension 1 Of BH representative And x' Of BH Convolution parameters corresponding to heads one by one (B=1 Time is H The head corresponds to H That's ok kernel Parameters );

W' dimension 2 Of n, representative n Convolution calculation shall be carried out for positions ;

W' dimension 3 Of n, It's right The current header corresponds to kernel Inner row Of k Parameters Extended to n Parameters , The extension method is 0 fill ;

The reason why we should be right now kernel The size is k Row parameters of Fill in n, The main reason is that the above convolution calculation is a fixed size of k The window of is... In length n Slide on the sequence ;

Because the number of convolution parameters is constant , But the position changes , It seems to be irregular calculation ; however Think about it , Although the position is sliding , But the length is n Slide on the sequence ; Can we construct a length of n Of fake kernel, In addition to the window where the real kerenl Parameter values , Other non window positions are filled 0 Well ? So we get a fixed length n Of fake kernel, It can be calculated in a unified form ;

Illustrate with examples , Suppose the calculation position is 10, Then the center of the window is also 10 This position , The specific position of the window may be [7,13], Then I'll remove Outside the window Other places All filled 0, You get a length of n Sequence of parameters , You can use unified n*n Dot product calculation method Go and x The length is n The input of ;

Of course , The above filling method Every time I get Fill sequence (fake kernel) It's all different , Because the current convolution line parameter value is unchanged , But the position is changing ;

Read about How does the deep learning framework implement convolution computation classmate You may be more familiar with this filling method ;

The above statement may still be a little misleading , Readers think  R^{n \times n} \times R^{n \times \frac{d}{H}}  matrix multiplication , The first 1 individual n Express n A place , The first 2/3 individual n Indicates a current location Convolution calculation that occurs , The reason is n*n, In order to achieve Unified calculation form the 0 fill ;

Wait for me to add a better legend ...


CUDA Realization

The above calculation method consumes resources , Consider customizing CUDA operator ;

I will write another article to talk about ; Custom convolution attention operator CUDA Realization


experimental result

To be continued


summary

- This paper introduces a convolution substitution of attention mechanism ;

- Through a certain design, the convolution can be lightweight , At the same time, imitate Attention To achieve better results ;

原网站

版权声明
本文为[cyz0202]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060850360766.html