当前位置：网站首页>Deep learning - metaformer is actually what you need for vision

Deep learning - metaformer is actually what you need for vision

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
MetaFormer structure
PoolFormer structure

Preface

This paper summarizes CVPR2022 Of oral article 《MetaFormer Is Actually What You Need for Vision》. This article studies ViT Structure and class MLP The model of the structure , Extract the same part of both , Make up the MetaFormer structure , It is pointed out that the performance of both benefit from MetaFormer structure , Then on this basis, it puts forward PoolFormer structure .

MetaFormer structure

Insert picture description here
On the left is MetaFormer structure ,MetaFormer Medium Token Mixer Modules are used to mix multiple token Information between . This module is in class ViT The model of the structure corresponds to Attention modular （ for example DeiT）, While in class MLP The model of structure corresponds to SpatialMLP modular （ for example GMLP、ResMLP）.

Author points out MetaFormer Structure is class ViT And the class MLP The main source of model performance , To test this , The author will MetaFormer The structure of the Token Moixer Replace the module with an identity map （ In fact, it is a large convolution model , A single channel large convolution processing a characteristic graph （token））, Model in ImageNet Upper top-1 The accuracy can still reach 74.3%.

Besides , The author found that removing Norm、Channel MLP、Shortcut Any module in , The models are difficult to converge , So it's verified that MetaFormer The structure in is indispensable .

PoolFormer structure

The author in MetaFormer On the basis of the introduction of PoolFormer structure ,PoolFormer Integrate multiple through average pooling Token Information between , Compared with Attention and SpatialMLP, Pooling does not introduce additional parameters , And the amount of calculation is smaller .PoolFormer The structure of is shown in the figure below ：
Insert picture description here
among Pooling Operation of the Pytorch The code is as follows

Notice that there is a subtraction operation after the pooling operation , The author's explanation （ See the note above ） I don't really agree , This operation is more like trick, But I didn't find the performance change of the model after removing this subtraction operation in the article .

PoolFormer stay ImageNet The accuracy is shown in the figure below , All models are not used pretrain The weight of
Insert picture description here

See the previous blog post for personal thoughts , The experimental results of target detection and other tasks are also included in the article , Do it now backbone Is getting more and more volume .

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518198610.html