当前位置：网站首页>Deep learning - patches are all you need

Deep learning - patches are all you need

2022-07-28 06:09:00 【Food to doubt life】

List of articles

Preface
ConvMixer Structure
Some ideas for designing network structure
reflection

Preface

This article is currently posted to ICLR 2022, There is no record of publication .

since 2020 year ViT Since its inception , About transformer Articles are emerging one after another ,ViT It has such good performance , Because of its unique network structure , Or because of its unique input form ？ This article is designed ConvMixer The Internet , Verified ViT Its good performance may come from its unique input form .

This article will briefly introduce ConvMixer The structure of the network , And briefly summarize by ViT Several ideas for designing network structure are introduced , Finally, I will briefly talk about my views on this article .

ConvMixer Structure

Insert picture description here
and ViT similar ,ConvMixer The input image will be cut into several non coincident patch, Every patch The size is $p * p$ , Go on $h$ individual kernel size by $p * p$ , The number of steps is $p$ The convolution of , Get a size of $h * n / p * n / p$ Of feature map, This process is ViT Medium patch embedding（patch embedding Itself can be equivalent to the above operation ）.

Then the data will go through $d$ layer ConvMixier Layer Handle ,Depthwise Convolution Simulated MLP mixer Medium spatial mixing, For blending spatial information ,Pointwise Convolution Simulated MLP mixer Medium channel-wise mixing, Used to mix channel information .ConvMixier Layer and No down sampling , Output from different layers feature map Of resolution and channel The numbers are consistent .

The last layer outputs feature map It will go through a global pooling process , Input into a classifier for classification

There are not many experiments in this paper , This paper gives the ImageNet 1k The results of the experiment on .ConvMixer-1536/20 Express feature map Of channel The number of 1536, The network depth is 20, And so on
Insert picture description here
You can see ,ConvMixer-1536/20 The performance of is better than that of plain Mixer-B/16, And the number of parameters should be less

Some ideas for designing network structure

ViT After coming out , There is a kind of Isotropic architectures, That is, first use patch embeddings, Output from different layers feature map Of channel and resolution Exactly the same , Similar to that proposed in this paper ConvMixer.

At present, there is also some work , take ViT And CNN combination .

reflection

This article and MLP mixer Is very similar , and MLP mixer Many operations in can themselves be equivalent to Depthwise Convolution and Pointwise Convolution, There are not many differences in the structure of the model .

The author designed ConvMixer contain ViT Two factors in ,Isotropic and patch embedding, The article does not discuss which of these two factors has a greater impact on performance ,patch embedding The essence of convolution is to convolute the input image with a large convolution check , Well, can we do it in ResNet Introduction in patch embedding, If the performance of the model is improved , that ViT Its performance is likely to come from its unique input form , To some extent, it is more in line with the title of the paper .

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518199593.html