当前位置：网站首页>Benchmarking Detection Transfer Learning with Vision Transformers（2021-11）

Benchmarking Detection Transfer Learning with Vision Transformers（2021-11）

2022-07-07 23:47:00 【Gy Zhao】

Insert picture description here
This article is written by he Kaiming in MAE Then about pure transformer The architecture is used to explore the downstream tasks of target detection , stay MAE Finally, I mentioned , Then there is an article ViTDET In line . about VIT Architecture for visual tasks brings a lot of inspiration .
Insert picture description here

brief introduction

As a central downstream task, target detection is often used to test the performance of the pre training model , Such as training speed or accuracy . When a new architecture such as VIT When it appears , The complexity of target detection task makes this benchmark more important . in fact , Some difficulties （ If the architecture is incompatible 、 Slow training 、 High memory consumption 、 Unknown training formula, etc ） Obstructed VIT Migration to target detection task research . The paper proposes the use of VIT As Mask RCNN backbone, Achieved the original research purpose ： The author compares five VITs initialization , Including self supervised learning methods 、 Supervised initialization and strong random initialization baseline.

Unsupervised / Self supervised deep learning is a commonly used pre training method to initialize model parameters , Then they moved to the downstream task , For example, image classification and target detection finetune. The effectiveness of unsupervised algorithms often uses the indicators of downstream tasks ( Accuracy 、 Convergence speed, etc ) To measure and baseline Make a comparison , For example, there is supervised pre training or Network of retraining from scratch ( Not applicable to pre training ).

Unsupervised deep learning in the visual field usually uses CNN Model , because CNNs Widely used in most downstream tasks , So the benchmark prototype is easy to define , You can use CNN The unsupervised algorithm of is regarded as a plug and play parameter initialization model .

The paper uses Mask RCNN Framework assessment ViT The model is used in the field of object detection and semantic segmentation COCO Performance on data sets , Yes ViT Make minimal modifications , Keep it simple 、 Flexible features .

Conclusion

The paper gives a conclusion in Mask RCNN Use... In the architecture VIT The basic model serves as backbone Effective method . These methods can be accepted in training memory and time , At the same time, without using too many complex extensions COCO The powerful effect of .

An effective training formula is obtained , Be able to deal with five different ViT Initialize the method to benchmark . Experiments show that ,Random initation It takes more time than any pre-training The initialization of is long 4 times , But we got better than ImageNet-1k Higher training before supervision AP.MoCo v3, Compare the representative algorithms of unsupervised learning , It shows almost the same performance as supervised pre training ( But worse than random initialization ).
It is important to , An exciting new result : be based on mask Methods (BEiT and MAE) It shows considerable gains in supervision and random initialization , These gains increase as the model size increases . be based on supervise Initialization and based on MoCo v3 This kind of sacling Behavior .

VIT backbone

Insert picture description here
Use ViT As Mask RCNN Of backbone There are two questions ：

How to make it with FPN synergy (ViT Produce a single scale feature map)
How to reduce memory consumption and running time

For 1 spot ：
In order to adapt FPN Multi scale input of , Yes ViT Produced by the middle layer feature map Multiscale features are generated by up sampling or down sampling through four modules with different resolutions , As shown in the above figure, green module, The interval between these four modules is $\frac{d}{4}$ , $d$ yes transformer blocks Number of , That is, equal interval division .

The first green module, By using two stride-two Of $\times 2$ The transpose convolution of 4 Multiple sampling , First, use a $\times 2$ Transposition convolution , And then through Group Normaliztion , One more pass Gelu Nonlinear functions , Then use another stride-two $\times 2$ Transposition convolution .
Next $\frac{d}{4}$ block Use a stride-two $\times 2$ Transpose convolution 2 Multiple sampling , Do not use normalized and nonlinear functions .
Third $\frac{d}{4}$ block The output does not change
the last one block Conduct stride by 2 Of $\times 2$ Of max pooling Double down sampling .
these module, Each one is preserved ViT Of Embeding/channel dimension , For one size =16 Of patch To produce feature map stride Respectively 4,8,16,32, Then input FPN.( Because the original ViT Produced feature The size is input $\frac{1}{16}$ , After the upper sampling and the lower sampling, it is the size mentioned in the paper )

For 2 spot ：
ViT Every one of them self-attention The calculation has $O(h^2w^2)$ Spatial complexity of , And expand the image into non overlapping $\times w$ Of patches Time spent .
This complexity is controllable during pre training , Because the general image size is $224 \times 224$ ,patch The size is $16$ , $h = w = 14$ This is a typical setting . But in the downstream task of target detection , Standard image size is $1024 \times 1024$ , This is usually pre trained pixels as well patch Of 20 Times as big as , Such a large resolution is also needed to detect larger and smaller targets . So in this case , Even with ViT-base As Mask RCNN Of backbone, In the case of small batches and semi precision floating-point numbers, it usually needs to consume 20-30GB Of GPU Memory .

To reduce space and time complexity , Use restricted self-attention( Also called windowed self-attention)（ originate attention is all you need , No impression , I'm going to check , The first impression is Swin Of window attention）. take $\times w$ The mosaic image of is divided into $\times r$ A non overlapping window , Calculate separately in each window self-attention. such windowed self-attention Have $O(rh^2w^2)$ Space and time complexity . Set up r Is the size of global self attention , Typical values for 14.（ share $\frac{h}{r} \times \frac{w}{r} individual windows, Every window Have O(r^2) Complexity$ ）.

One side effect of this approach is window There is no cross window information interaction between , Therefore, a hybrid method is adopted , As shown in the figure , contain 4 A global self attention module , And FPN consistent , Here is what the author calls adding extra parts to make VIT Generate multiscale features .

Yes Ｍask RCNN module Modification of

FPN Convolution in Batch Normalization
stay RPN Use two convolution layers instead of one
region-of-interest (RoI) classification and box regression head The following full connection layer uses four with batch normal Instead of the original convolution layer with Normalization The two layers MLP.
Follow the standard mask head Convolution in Batch Normal
The modification code is located in ：https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md #new-baselines-using-large-scale-jitter-and-longer-training-schedule