当前位置：网站首页>【ARIXV2204】Neighborhood attention transformer

【ARIXV2204】Neighborhood attention transformer

2022-07-28 05:01:00 【AI frontier theory group @ouc】

Please add a picture description

thank B standing “ The alchemy workshop of saury ” Explanation , The analysis here combines many explanations .
The paper ：https://arxiv.org/abs/2204.07143
Code ：https://github.com/SHI-Labs/Neighborhood-Attention-Transformer

This paper is very simple , In fact, ideas have also appeared in previous papers . First look at the picture below , standard VIT Of attention Calculation is global , Like the red one in the first picture token And the blue ones token Will be global and all token Calculate .swin Are the two figures in the middle , First step token Feature interaction is limited to local windows . Step 2 the window has shift, but token The feature interaction of is still in the local window . The last figure is proposed in this paper neighborhood attention transformer, NAT, all attention Is calculated in 7X7 In the neighborhood of . Looks like convolution equally , Just in one kernel Operate within the scope . But and convolution The difference is ,NAT It's calculation attention, So every one value The weight is determined according to the input value , Instead of a fixed value after training as in convolution kernel .

Please add a picture description

The author also gives attention Figure of calculation . As shown in the figure below , about CHW Input matrix of ,Query It's a certain place 1XC Vector , key It's a 3x3xC Matrix , The two matrices are multiplied element by element （ Different sizes broadcast ）, The result is 3x3xC, Last in C Sum this dimension , obtain 3X3 The similarity matrix of . Use this matrix to value Assign weights , Finally merged into one 1x1xC Vector , Namely attention Calculated results of .

Please add a picture description

The author also analyzes the computational complexity , It can be seen that , Because calculating attention in local neighborhood , The calculation complexity is greatly reduced , and swin It's basically the same .

Please add a picture description

The overall architecture of the network is the same as the current method , All are 4 Stage . The resolution of each stage is reduced by half . however , The resolution reduction uses In steps of 2 Of 3X3 Convolution . First step overlapping tokenizer It uses 2 individual 3x3 Convolution , The step size of each convolution is 2.

Please add a picture description

The author designed 4 It's a network structure , The neighborhood size is 7X7, As follows ：

Please add a picture description

On the task of image classification ,NAT Very good performance , As shown in the following table ：

stay Ablation study Inside , The author contrasts postion embedding and attention The performance difference of calculation . however , The author's model is 81.4% , And the table above 83.2 There are differences , I don't know why .

Please add a picture description

in general , The idea of this paper is very simple , Many previous papers have also reflected this idea . But this paper is jointly done with enterprises , The difficulty should be CUDA Hardware implementation , The author wrote a lot of CUDA Code to right neighborhood The operation is accelerated .

原网站

版权声明
本文为[AI frontier theory group @ouc]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/197/202207131733389621.html

当前位置：网站首页>【ARIXV2204】Neighborhood attention transformer

【ARIXV2204】Neighborhood attention transformer

边栏推荐

猜你喜欢

随机推荐