当前位置：网站首页>[Transformer]MViTv2:Improved Multiscale Vision Transformers for Classification and Detection

[Transformer]MViTv2:Improved Multiscale Vision Transformers for Classification and Detection

2022-06-11 04:41:00 【Yellow millet acridine】

MViT: Optimized multiscale Transformer For classification and detection

Abstract
Section I Introduction
Section II Related Work
Section III Revisting Mulit Scale Vision Transformer
Section IV Improved MViT
Section V Experiments Image Classification
- Part 1 Comparison results
- Part 2 Ablation Experiment
Section VI Conclusion

From FAIR,UC Berkerly

Abstract

In this paper, a multi-scale Transformer(MViT) As a general architecture for image and video classification and object detection . This optimization is multiscale ViT The relative position embedding and residual connection of decomposition are used . In this paper ImageNet,COCOhe Kinetics There will be 5 Of all sizes MViT Compare with other work , Better than previous work .
  In addition, this paper further compares MViT Pooled attention and window attention, It is found that the pooled attention in this paper is better than the latter in accuracy and computation . at present MViT stay 3 Areas have reached SOTA：ImageNey classification 88.8%,COCO object detection 56.1AP as well as Kinetics86.1%.

Section I Introduction

Different visual task designs generally adopt different architectures , It usually combines the simplest and most effective network , Such as VGGNet and ResNet. lately ViT In many fields, it shows that CNN Comparable performance , although ViT stay It performs well in image classification, but its application in high-resolution target detection and spatio-temporal video understanding is still challenging . because SA The computational complexity of is the square term of the resolution , At present, the two mainstream optimization methods are ：
  （1） Confine your attention to the window  ;
  （2） Use pooled attention to aggregate local features in video tasks  
   The latter inspired this paper to propose MViT frame , This is a simple and extensible framework ： It does not use a fixed resolution in the network , Different stage Corresponding to the resolution from high to low .  This paper introduces two simple designs to further improve MViT Performance of , Through image classification 、 Research on the performance of target detection and video classification MViT Whether it can be used as a backbone framework for general vision and spatio-temporal recognition tasks .
   The contributions of this paper are summarized as follows ： 
  （1） In this paper, a decomposable relative position code is proposed to inject position information , This method of location coding is location invariant ;
   （2） This paper uses residual pooled connections to compensate for SA Loss of step size in calculation ;
    （3） After the above optimization MViT Used for intensive forecasting tasks ： Combine feature pyramids and FPN Used for object detection and instance segmentation . 
     This paper also explores MVIt Whether it is possible to pool attention to handle high-resolution input while solving the problems of computing cost and memory . Experimental results show that pooled attention is better than local attention （Swin Transformer） More effective ; This paper also proposes a simple but effective hybrid window attention strategy as a supplement to pooled attention , To further balance accuracy and computational efficiency .
    （4） This paper also builds 5 Of different sizes MvIt, Image classification is tested respectively 、 The effect of object detection and instance segmentation .
     at present MViT stay 3 Areas have reached SOTA：ImageNey classification 88.8%,COCO object detection 56.1AP as well as Kinetics86.1%.

Section II Related Work

CNN Currently, it is the mainstream framework of visual tasks . But recently, many Transformer Applied to image classification, it can be compared with CNN Comparable performance . Therefore, there is also a series of research to improve Transformer frame , Such as better training strategies 、 Multiscale Transforme Framework and more advanced attention mechanism . In this paper, the multi-scale Transformer frame （MViT） Dedicated to being the backbone of different visual tasks .
       ViT For target detection   The target detection task requires high-resolution input in order to carry out accurate positioning , and SA The computational complexity of is the square term of the resolution , The recent work in this field is mainly through shifted window perhaps Longformer Attention to reduce the amount of computation ,MViT Pooled attention in is also an efficient way to calculate attention . ViT For video classification  ViT It also shows excellent performance for video recognition , But most of them rely on pre training on large data sets , In this paper, the MVit Simple but effective , At the same time, it is also studied that ImageNet The effect of pre training .

Section III Revisting Mulit Scale Vision Transformer

MViTv1 In different stage Build modules with different resolutions , Unlike the original Transformer all block The resolution is the same . therefore MViT The number of channels D Will gradually increase , Simultaneous resolution L It's going to go down （ This corresponds to the sequence length ）.  In order to be in Transformer The module implements the down sampling operation ,MViT Pool attention （Pooling Attention）.
 
  For any input sequence , Can do linear mapping to get Q,K,V,Q,K,V After pool treatment , Mainly to shorten K,V The length of the sequence , Then the attention is calculated on the basis of pooling .  The coefficient of down sampling Pk and Pq It can be done with Q Pool coefficient of Pq Different . 
  Insert picture description here

Pooling Attention Can be in each stage Are pooled , This can greatly reduce Q-K-V Memory cost and calculation amount during calculation .

Section IV Improved MViT

Part 1 Improved Pooling Attention

Insert picture description here
Decomposed relative position embedding 
although MViT Catching token Excellent performance has been demonstrated in relation to , But this kind of attention is more about content than spatial structure . Absolute position coding can only provide position information, but it ignores a very important point in the image , Is the translation invariance of features . So in the primitive MViT If two of them patch If their absolute position changes, their dependencies will change , Although these two patch The relative position of has not changed .

Insert picture description here
In order to solve this problem, this paper uses relative position coding , That is, calculate two patch Relative position information of , Then the location information is embedded . But there are too many embedded numbers O(TWH), Therefore, this article will further i,j The distance between them is calculated and decomposed along the space-time axis , That is, along the long 、 Width and spatial axis are calculated , This reduces the complexity to O(T+W+H). 
Insert picture description here

Insert picture description here

Residual pooling connection
 MViTv1 Introduced in MSPA Pooling attention can greatly reduce SA Amount of computation , Mainly in Q-K-V After linear mapping, perform one-step pooling operation , But in v1 in K,V Larger steps are used ,Q Downsampling is performed only when the output sequence changes , This needs to be done in pooling attention module To increase the flow of information by adding residual connections to the calculation of . this paper （MViTv2） A new residual pooling connection is introduced into the attention module , Expressed as the following formula ： ’
  Insert picture description here

After the attention calculation, I will communicate with pooled Q Make residual connection as the final output , Notice the output Z And Q It's the same length .  Ablation experiments show that , Use residuals to join and pair Q Pooling is quite comparable , On the one hand, it can reduce SA On the one hand, it can improve the performance .

Part 2 MViT for Object Detection

This section describes MViT How to apply it to target detection . Fig 3 It shows MViT As backbone combination FPN Carry out the target detection task ,MViT There will be four stage Generate feature map , And then integrated into the feature pyramid network ,FPN With transverse connection , Use MViT The output feature graph with strong semantic information .
  Insert picture description here
Hybrid window Attention 
Transformer in SA The computational complexity is similar to token The number is related to , In target detection, high-resolution input and feature map are usually required . This paper studies two methods to reduce computation and memory ：
  One is in MViT Used in the attention module pooling attention,
  Two is Swin Transformer Window attention used in . 
  Both are through restrictions Q,K,V Vector to reduce the amount of computation , But their essence is different . pooling After local aggregation, features are gathered through down sampling , Still maintain the overall attention calculation ; window attention Although maintained tensor But the input is divided into different windows to perform the calculation of attention locally , The inherent differences between these two methods urge us to explore whether they can complement each other in the target detection task .  because window attention Calculate your attention in the window , Default lack of connection between windows ; And SwinTransformer Use in shiftwindow To solve this problem is different , In this paper, a simple hybrid window attention is proposed for cross window connection .
 HWin Accounting for every stage Except for the last one block The attention of all other windows , Then the features containing the global information are sent into FPN.  The results of ablation experiments show that this simple HWin The performance in classification and target detection tasks is always better than SWin; This article also compares pooled attention with HWin The combination achieves the best performance in target detection .
  Positional embedding in detection
    Unlike image classification , Object detection usually involves objects of different sizes , Therefore, this article will MViT When the position embedding is used for target detection, it will be initially transformed into 224x224 The size is then interpolated to different resolutions , Training for target detection .

Part 3 MViT for Video Recognition

MViT It can be easily used for video classification tasks , because MViT The modules in can be easily migrated to the space-time domain , This article also explores the effect of pre training .  And for image classification MViT The difference is ： 
    （1）MViT When projecting, it is projected into the space-time domain instead of 2Dpatch
     （2） Pooling operations need to be aggregated into the spatio-temporal map  
     （3） Relative position embedding also needs to consider spatiotemporal information

Part 4 MViT Architecture Variants

In this paper, we build different scale MViT Model , So as to make a fair comparison with other frameworks , Mainly by changing the number of channels 、block Number 、heads number And other parameters Tiny,Small.Base,Large,Huge5 A network of three sizes , For detailed network structure parameters, see Table 1.  The pooled step size is set to 4, And in stage Adaptive attenuation in .
      Insert picture description here

Section V Experiments Image Classification

Part 1 Comparison results

First of all, this paper ImageNet Image classification and COCO Target detection detected MViTv2 Performance of . Table 2 It shows the relationship with other frameworks and MViTv1 Performance comparison of .

Insert picture description here
Table2 All the networks participating in the comparison are grouped according to the calculation amount of the network , You can see that NViTv1 Than , In this paper, the v2 It has higher classification accuracy , even to the extent that MViTv2-B Than MViTv1-B More lightweight .  also MViTv2 The performance of is also better than others Transformer Model , such as DeiT and Swin, Especially as the network scale gradually increases , such as MViTv2-B Reached 84.8% The accuracy of , To surpass respectively DeiT-B and Swin-B 2.6% and 1.1%, At the same time, the parameter quantity is also reduced 33%.
  Besides using center-crop To test , This paper also tests the performance under full-scale input , Performance can be found from 86% Upgrade to 86.3%, This is by far the highest accuracy （ Without using external data or model distillation ） 
Insert picture description here

Table 3 It is displayed in ImageNet-21K Results after pre training , You can see that the pre training is MViT-L Accuracy has been improved 2.2%. Part 2 Object Deteciton  Target detection is in COCO Test on dataset The target detection framework uses Mask R-CNN as well as Cascaded R-CNN. For the sake of fair comparison, this article follows and Swin Same settings , front 36epoch stay IN Pre training on datasets and then on COCO Fine tune up , And the test uses HWin The effect of , The window size is set to [56,28,14,7] .
Insert picture description here

Table 5 Show the COCO Effects of two target detection frameworks on datasets ,MViT It's better than CNN as well as Transformer Other networks , such as MVit-B On the premise that the model is smaller and the parameter quantity is smaller than Swin-B Promoted 2.5APbox, stay Cascaded R-CNN The result is similar .

Part 2 Ablation Experiment

Different attention mechanisms   This paper studies the effects of different attention mechanisms , Namely pooling attention and Hwin Self attention ,Table 4 Show the results of the comparison , According to the experimental results, it is found that ：
（1） about ViT-B Model , The default window based approach does reduce the amount of computation and memory , However, due to the lack of cross window interaction bring top-1 The accuracy has also decreased 2.0%;Swin window Can improve 0.4% The accuracy of the .  In this paper, the HWin Then with full attention Similar performance , Than Swin Window promote 1.7%, combination pooling attention Then the calculation amount is reduced 38%d The best accuracy is achieved in the case of . 
（2） about MViT-S By default pooling attention, This article observed the addition of Swin and HWin Can reduce the complexity of the model But the performance will be slightly degraded ; The best accuracy can be achieved by further increasing the step length of the pool / Calculation force balance .
Insert picture description here

Positional embeddings 
Table 6 Is the comparison result of embedding in different positions , It can be seen that using absolute location coding is significantly better than not using location embedding , This is because the pooling operation has modeled the location information ; Relative position coding can greatly improve performance , Because the translation invariance is introduced before the pooling operation , If we use the decomposition relative position embedding in this paper, we can 3.9x The acceleration of . 
Residual pooling connection 
Table 7 It shows the effect of using residual pooling connection , It can be seen that only introducing residual connections has improved the performance , The increased computational costs are negligible ; If residual pooling is used and applied to all layers, it will be greatly improved , especially COCO The data set has 1.4AP The promotion of , This shows that in MViT Use in Qpooling block And residual connection is very necessary . 
Runtime Compression 
Table 8 Is the runtime comparison , You can see MViT Than Swin Higher throughput . 
Table 9 The difference between single-scale detection and multi-scale detector is compared , You can see FFN Significantly improved two backbone Performance of , and MViT-B Significantly better than ViT-B, This shows that the multi-scale hierarchical design is very suitable for the target detection task of intensive prediction .
Insert picture description here

Insert picture description here

Section VI Conclusion

This paper presents an improved multiscale vision Transformer-MViT- It can be used as a general framework for visual tasks , In image classification 、 Instance segmentation 、 Video classification and other fields have shown excellent performance . This paper hopes that this framework can be further studied in other visual tasks in the future .

原网站

版权声明
本文为[Yellow millet acridine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203020547142171.html