当前位置：网站首页>【Transformer】TransMix: Attend to Mix for Vision Transformers

【Transformer】TransMix: Attend to Mix for Vision Transformers

2022-07-29 06:03:00 【Dull cat】

List of articles

Code ：https://github.com/Beckschen/TransMix

One 、 Background and motivation

be based on mix-up Data enhancement method of ViT This structure is very useful , Because this structure is easy to produce over fitting , however , Previous mixup-based The method has a potential a priori , That is, the linear interpolation ratio of the target is the same as the interpolation ratio of the input whole graph . This will lead to mixed image There may be no effective goal in it , But there will still be label.

In order to make up for the problems caused by the above phenomenon , The author puts forward TransMix, Can be based on attention map Come on label To deal with .

Two 、 Method

2.1 Mixup

Raw input ：

Mixup Use a pair of images $x_A$ and $x_B$ , And the corresponding label $y_A$ and $y_B$ As input

Input and truth processing ：

Use the above two images to get false training samples $\lambda x_A + (1-\lambda)x_B$ , And truth value $\lambda y_A + (1-\lambda)y_B$ , here $\lambda \in [0,1]$ It is a slave. Beta The random number obtained by distribution .

Pictured 1 Shown , There is no way for background pixels to match the foreground label Play the same role , That is, not all pixel pairs label Their contribution is the same .

So this article focuses on how to use learnable methods to achieve input and label Unity of space .

The author found ,vision transformer Produced attention map It can be better used in this task .

Pictured 1 Shown , Author use attention map As $\lambda$ Value ,label It can be re-weighted, The weight of each pixel is different , Therefore, all pixels in the image will not be combined with the same value . And because of the use of attention map, So this method can be applied to any ViT-based Methods , And there are no additional parameters .

Insert picture description here

2.2 TransMix

CutMix Data to enhance ：

CutMix Is a simple way to enhance , Put two label Combine , Create a new label：
Insert picture description here

$M\in\{0, 1\}^{HW}$ , Is a binary mask, Decide where to give up , Where to use

TransMix
Insert picture description here

$A$ It's from cls token To the input image token Of attention map, Represents each of the patch Importance to the final classification results . For bulls attention, The author uses the average method .

Use attention map Yes label To deal with ：

Insert picture description here
The down arrow indicates the nearest neighbor interpolation , You can put M from HW Size down sampling into p Pixel .

In this case , The network can give label Each point of is based on attention map To dynamically allocate weights .

Pseudo code ：

Insert picture description here

3、 ... and 、 effect

Insert picture description here

Insert picture description here
TransMix The visualization is as follows ：

The first line shows area-based Of label assignment , hold image A Paste a piece of into B On ,TransMix Able to use attention map Yes label Amendment , It can improve the mutation area label Of weight,

Insert picture description here