当前位置：网站首页>An image is word 16x16 words: transformers for image recognition at scale

An image is word 16x16 words: transformers for image recognition at scale

2022-06-13 02:16:00 【zjuPeco】

List of articles

1 summary

This paper is a paper that will tranformer Landmark articles introduced into the field of image . Because this is the first time when processing images , Discard all convolution modules , Use only attention. And the experiment proved that only attention It is better than convolution network in image classification .

The content of the positive article is not difficult to understand , Premise familiarity transformer Principle , Friends who don't know or want to review , You can read another one of mine understand Transformer.

And the paper puts forward vision transorformer Is in the transformer I did some tricks on the input and output of , Yes transformer There is no change in itself .

Convolution module is almost irreplaceable in the field of image , The reason why the author makes such an attempt is transformer stay NLP There has been a huge success in this field , It may also have miraculous effects in the field of images , And under the same parameter quantity tranformer It is more efficient than convolution module .

After the test , The author finds that when the amount of training data is small （ Such as ImageNet）,vision transformer It's better than resnet Such a mainstream convolution network has a slightly worse effect . But when there is a lot of data to provide pre training （ Such as Google's internal JFT-300M）,vision transformer The advantages of . After pre training on the big data set , And then on the small data set finetune,vision tranformer It is better than other mainstream convolution classification models .

Convolution network has a strong calculation method for images inductive bias. firstly , Convolution uses convolution kernel to tell convolution network , Each pixel has a great correlation with the pixels around it ; second , The convolution kernel weight sharing mechanism tells the convolution network , After the object in the image moves , Still the same thing . And these two points ,vision transformer I don't know , So it needs more data to learn .

2 Method brief introduction

vision transformer The structure of is not complicated , Just look at the picture , The schematic diagram is shown in the figure below 2-1 Shown . I really don't understand , Take a look at resources [4] perhaps [5] The code is very clear .

Generally speaking, it can be divided into two parts ,encoder Before and encoder after . In the middle of the transformer encoder Don't say , It's standard transformer, But there can be $L$ layer . Of course, you can also put encoder For example BERT And so on transformer.
vision transformer Sketch Map

chart 2-1 vision transformer Sketch Map

2.1 encoder Before

The input image will be cut into pieces patches, General code implementation will use patch_size To represent each patch The length and width of , Cut into pieces patches after , From left to right , From top to bottom, it is arranged into a patches Sequence , Every patch Will be expanded into patch_size[0] x patch_size[1] The input of , after embedding Output after layer . The output is the graph 2-1 in transformer encoder below 1-9 The pink module next to the number .

Then there will be an extra... In the head concat The last special feature *, The dimensions of this feature and patch after embedding The following dimensions are the same , And it can be learned . This feature and BERT Of class token It's like .

At the same time, all the features will be added with the position embedding, This can also be learned . Attention is to add , instead of concat.

2.2 encoder after

transformer encoder Each input of the corresponds to an output , In image classification , We just need to take the number 0 individual embedding Corresponding output , It can be classified through several layers of full connection .

So it seems , This 0 It is also like a housekeeper learned by the model itself , It tidies up everything patches Information about .

There's really nothing else , To sum up, it is pretreatment + post-processing + Huge data sets .

3 experimental result

The author put vision transformer The effect and BiT Compared with , This is also a model of their own . Generally speaking, it has better performance in each data set , And the resources needed for training are also more saved , The specific data are shown in the table below 3-1 Shown .

surface 3-1 Table of test results

Table of test results

besides , The author also visualizes what the model has learned , The schematic diagram is shown in the figure below 3-1 Shown .

Visualization of model results

chart 3-1 Visualization of model results

chart 3-1 On the left is the picture just entered for each patch Doing it embedding when , Yes patch Conduct embedding Of filter The principal component result graph of the learned weights , You can see different filter Will pay attention to patch Different positions and different textures .

chart 3-1 In the middle is each patch And other patch Between position embedding The degree of similarity , It can be seen that ,position embedding I did learn patch Distance between .

chart 3-1 On the right is the model 16 individual heads Of attention distance The average distance , Here you can see , In shallow layers , There are some head I'm already looking at things that are far away from me patch 了 , This is a CNN Impossible .