当前位置:网站首页>An image is word 16x16 words: transformers for image recognition at scale
An image is word 16x16 words: transformers for image recognition at scale
2022-06-13 02:16:00 【zjuPeco】
List of articles
1 summary
This paper is a paper that will tranformer Landmark articles introduced into the field of image . Because this is the first time when processing images , Discard all convolution modules , Use only attention. And the experiment proved that only attention It is better than convolution network in image classification .
The content of the positive article is not difficult to understand , Premise familiarity transformer Principle , Friends who don't know or want to review , You can read another one of mine understand Transformer.
And the paper puts forward vision transorformer Is in the transformer I did some tricks on the input and output of , Yes transformer There is no change in itself .
Convolution module is almost irreplaceable in the field of image , The reason why the author makes such an attempt is transformer stay NLP There has been a huge success in this field , It may also have miraculous effects in the field of images , And under the same parameter quantity tranformer It is more efficient than convolution module .
After the test , The author finds that when the amount of training data is small ( Such as ImageNet),vision transformer It's better than resnet Such a mainstream convolution network has a slightly worse effect . But when there is a lot of data to provide pre training ( Such as Google's internal JFT-300M),vision transformer The advantages of . After pre training on the big data set , And then on the small data set finetune,vision tranformer It is better than other mainstream convolution classification models .
Convolution network has a strong calculation method for images inductive bias. firstly , Convolution uses convolution kernel to tell convolution network , Each pixel has a great correlation with the pixels around it ; second , The convolution kernel weight sharing mechanism tells the convolution network , After the object in the image moves , Still the same thing . And these two points ,vision transformer I don't know , So it needs more data to learn .
2 Method brief introduction
vision transformer The structure of is not complicated , Just look at the picture , The schematic diagram is shown in the figure below 2-1 Shown . I really don't understand , Take a look at resources [4] perhaps [5] The code is very clear .
Generally speaking, it can be divided into two parts ,encoder Before and encoder after . In the middle of the transformer encoder Don't say , It's standard transformer, But there can be L L L layer . Of course, you can also put encoder For example BERT And so on transformer.
2.1 encoder Before
The input image will be cut into pieces patches, General code implementation will use patch_size To represent each patch The length and width of , Cut into pieces patches after , From left to right , From top to bottom, it is arranged into a patches Sequence , Every patch Will be expanded into patch_size[0] x patch_size[1] The input of , after embedding Output after layer . The output is the graph 2-1 in transformer encoder below 1-9 The pink module next to the number .
Then there will be an extra... In the head concat The last special feature *, The dimensions of this feature and patch after embedding The following dimensions are the same , And it can be learned . This feature and BERT Of class token It's like .
At the same time, all the features will be added with the position embedding, This can also be learned . Attention is to add , instead of concat.
2.2 encoder after
transformer encoder Each input of the corresponds to an output , In image classification , We just need to take the number 0 individual embedding Corresponding output , It can be classified through several layers of full connection .
So it seems , This 0 It is also like a housekeeper learned by the model itself , It tidies up everything patches Information about .
There's really nothing else , To sum up, it is pretreatment + post-processing + Huge data sets .
3 experimental result
The author put vision transformer The effect and BiT Compared with , This is also a model of their own . Generally speaking, it has better performance in each data set , And the resources needed for training are also more saved , The specific data are shown in the table below 3-1 Shown .

besides , The author also visualizes what the model has learned , The schematic diagram is shown in the figure below 3-1 Shown .

chart 3-1 On the left is the picture just entered for each patch Doing it embedding when , Yes patch Conduct embedding Of filter The principal component result graph of the learned weights , You can see different filter Will pay attention to patch Different positions and different textures .
chart 3-1 In the middle is each patch And other patch Between position embedding The degree of similarity , It can be seen that ,position embedding I did learn patch Distance between .
chart 3-1 On the right is the model 16 individual heads Of attention distance The average distance , Here you can see , In shallow layers , There are some head I'm already looking at things that are far away from me patch 了 , This is a CNN Impossible .
Reference material
[1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
[2] Yannic Kilcher Explain vision transformer
[3] The alchemy workshop of saury vision transformer
[4] https://github.com/lucidrains/vit-pytorch
[5] https://keras.io/examples/vision/image_classification_with_vision_transformer/
边栏推荐
- 0- blog notes guide directory (all)
- [the fourth day of actual combat of stm32f401ret6 smart lock project in 10 days] voice control is realized by externally interrupted keys
- swiper 横向轮播 grid
- [work with notes] MFC solves the problem that pressing ESC and enter will automatically exit
- [pytorch]fixmatch code explanation (super detailed)
- 一、搭建django自动化平台(实现一键执行sql)
- Use of Arduino series pressure sensors and detected data displayed by OLED (detailed tutorial)
- Sensor: MQ-5 gas module measures the gas value (code attached at the bottom)
- [51nod.3210] binary Statistics (bit operation)
- [sequence structure, branch structure, loop structure, continue statement, break statement, return statement] (learning Note 6 -- C language process control)
猜你喜欢

Decoding iFLYTEK open platform 2.0 is a fertile land for developers and a source of industrial innovation

STM32 IIC protocol controls pca9685 steering gear drive board

Ruixing coffee moves towards "national consumption"

I didn't expect that the index occupies several times as much space as the data MySQL queries the space occupied by each table in the database, and the space occupied by data and indexes. It is used i
![[learning notes] xr872 GUI littlevgl 8.0 migration (display part)](/img/5e/fc8c3fe3029c36648fbc3f48bc0c2f.jpg)
[learning notes] xr872 GUI littlevgl 8.0 migration (display part)

Sqlserver2008 denied select permission on object'***** '(database'*****', schema'dbo')

C language conditional compilation routine

1、 Set up Django automation platform (realize one click SQL execution)

【 unity】 Problems Encountered in Packaging webgl Project and their resolution Records

C语言压缩字符串保存到二进制文件,从二进制文件读取压缩字符串后解压。
随机推荐
Introduction to arm Cortex-M learning
[the second day of actual combat of smart lock project based on stm32f401ret6 in 10 days] GPIO and register
C language complex type description
Restrict cell input type and display format in CXGRID control
ROS learning -5 how function packs with the same name work (workspace coverage)
[learning notes] xr872 GUI littlevgl 8.0 migration (display part)
Huawei equipment is configured with IP and virtual private network hybrid FRR
华为设备配置虚拟专用网FRR
synchronized下的 i+=2 和 i++ i++执行结果居然不一样
[pytorch]fixmatch code explanation - data loading
The commercial value of Kwai is being seen by more and more brands and businesses
Chapter7-11_ Deep Learning for Question Answering (2/2)
[pytorch] kaggle image classification competition arcface + bounding box code learning
Laptop touch pad operation
Think about the possibility of attacking secure memory through mmu/tlb/cache
Gome's ambition of "folding up" app
Paper reading - beat tracking by dynamic programming
Jump model between mirrors
Rsync transport exclusion directory
About the fact that I gave up the course of "Guyue private room course ROS manipulator development from introduction to actual combat" halfway