当前位置:网站首页>[Transformer]MViTv2:Improved Multiscale Vision Transformers for Classification and Detection
[Transformer]MViTv2:Improved Multiscale Vision Transformers for Classification and Detection
2022-06-11 04:41:00 【Yellow millet acridine】
MViT: Optimized multiscale Transformer For classification and detection
From FAIR,UC Berkerly
Abstract
In this paper, a multi-scale Transformer(MViT) As a general architecture for image and video classification and object detection . This optimization is multiscale ViT The relative position embedding and residual connection of decomposition are used . In this paper ImageNet,COCOhe Kinetics There will be 5 Of all sizes MViT Compare with other work , Better than previous work .
In addition, this paper further compares MViT Pooled attention and window attention, It is found that the pooled attention in this paper is better than the latter in accuracy and computation . at present MViT stay 3 Areas have reached SOTA:ImageNey classification 88.8%,COCO object detection 56.1AP as well as Kinetics86.1%.
Section I Introduction
Different visual task designs generally adopt different architectures , It usually combines the simplest and most effective network , Such as VGGNet and ResNet. lately ViT In many fields, it shows that CNN Comparable performance , although ViT stay It performs well in image classification, but its application in high-resolution target detection and spatio-temporal video understanding is still challenging . because SA The computational complexity of is the square term of the resolution , At present, the two mainstream optimization methods are :
(1) Confine your attention to the window
;
(2) Use pooled attention to aggregate local features in video tasks
The latter inspired this paper to propose MViT frame , This is a simple and extensible framework : It does not use a fixed resolution in the network , Different stage Corresponding to the resolution from high to low .
This paper introduces two simple designs to further improve MViT Performance of , Through image classification 、 Research on the performance of target detection and video classification MViT Whether it can be used as a backbone framework for general vision and spatio-temporal recognition tasks .
The contributions of this paper are summarized as follows :
(1) In this paper, a decomposable relative position code is proposed to inject position information , This method of location coding is location invariant ;
(2) This paper uses residual pooled connections to compensate for SA Loss of step size in calculation ;
(3) After the above optimization MViT Used for intensive forecasting tasks : Combine feature pyramids and FPN Used for object detection and instance segmentation .
This paper also explores MVIt Whether it is possible to pool attention to handle high-resolution input while solving the problems of computing cost and memory . Experimental results show that pooled attention is better than local attention (Swin Transformer) More effective ; This paper also proposes a simple but effective hybrid window attention strategy as a supplement to pooled attention , To further balance accuracy and computational efficiency .
(4) This paper also builds 5 Of different sizes MvIt, Image classification is tested respectively 、 The effect of object detection and instance segmentation .
at present MViT stay 3 Areas have reached SOTA:ImageNey classification 88.8%,COCO object detection 56.1AP as well as Kinetics86.1%.
Section II Related Work
CNN Currently, it is the mainstream framework of visual tasks . But recently, many Transformer Applied to image classification, it can be compared with CNN Comparable performance . Therefore, there is also a series of research to improve Transformer frame , Such as better training strategies 、 Multiscale Transforme Framework and more advanced attention mechanism . In this paper, the multi-scale Transformer frame (MViT) Dedicated to being the backbone of different visual tasks .
ViT For target detection
The target detection task requires high-resolution input in order to carry out accurate positioning , and SA The computational complexity of is the square term of the resolution , The recent work in this field is mainly through shifted window perhaps Longformer Attention to reduce the amount of computation ,MViT Pooled attention in is also an efficient way to calculate attention .
ViT For video classification
ViT It also shows excellent performance for video recognition , But most of them rely on pre training on large data sets , In this paper, the MVit Simple but effective , At the same time, it is also studied that ImageNet The effect of pre training .
Section III Revisting Mulit Scale Vision Transformer
MViTv1 In different stage Build modules with different resolutions , Unlike the original Transformer all block The resolution is the same . therefore MViT The number of channels D Will gradually increase , Simultaneous resolution L It's going to go down ( This corresponds to the sequence length ).
In order to be in Transformer The module implements the down sampling operation ,MViT Pool attention (Pooling Attention).
For any input sequence , Can do linear mapping to get Q,K,V,Q,K,V After pool treatment , Mainly to shorten K,V The length of the sequence , Then the attention is calculated on the basis of pooling .
The coefficient of down sampling Pk and Pq It can be done with Q Pool coefficient of Pq Different .


Pooling Attention Can be in each stage Are pooled , This can greatly reduce Q-K-V Memory cost and calculation amount during calculation .
Section IV Improved MViT
Part 1 Improved Pooling Attention

Decomposed relative position embedding
although MViT Catching token Excellent performance has been demonstrated in relation to , But this kind of attention is more about content than spatial structure . Absolute position coding can only provide position information, but it ignores a very important point in the image , Is the translation invariance of features . So in the primitive MViT If two of them patch If their absolute position changes, their dependencies will change , Although these two patch The relative position of has not changed .

In order to solve this problem, this paper uses relative position coding , That is, calculate two patch Relative position information of , Then the location information is embedded . But there are too many embedded numbers O(TWH), Therefore, this article will further i,j The distance between them is calculated and decomposed along the space-time axis , That is, along the long 、 Width and spatial axis are calculated , This reduces the complexity to O(T+W+H).


Residual pooling connection
MViTv1 Introduced in MSPA Pooling attention can greatly reduce SA Amount of computation , Mainly in Q-K-V After linear mapping, perform one-step pooling operation , But in v1 in K,V Larger steps are used ,Q Downsampling is performed only when the output sequence changes , This needs to be done in pooling attention module To increase the flow of information by adding residual connections to the calculation of . this paper (MViTv2) A new residual pooling connection is introduced into the attention module , Expressed as the following formula :
’

After the attention calculation, I will communicate with pooled Q Make residual connection as the final output , Notice the output Z And Q It's the same length . Ablation experiments show that , Use residuals to join and pair Q Pooling is quite comparable , On the one hand, it can reduce SA On the one hand, it can improve the performance .
Part 2 MViT for Object Detection
This section describes MViT How to apply it to target detection .
Fig 3 It shows MViT As backbone combination FPN Carry out the target detection task ,MViT There will be four stage Generate feature map , And then integrated into the feature pyramid network ,FPN With transverse connection , Use MViT The output feature graph with strong semantic information .

Hybrid window Attention
Transformer in SA The computational complexity is similar to token The number is related to , In target detection, high-resolution input and feature map are usually required . This paper studies two methods to reduce computation and memory :
One is in MViT Used in the attention module pooling attention,
Two is Swin Transformer Window attention used in .
Both are through restrictions Q,K,V Vector to reduce the amount of computation , But their essence is different .
pooling After local aggregation, features are gathered through down sampling , Still maintain the overall attention calculation ;
window attention Although maintained tensor But the input is divided into different windows to perform the calculation of attention locally , The inherent differences between these two methods urge us to explore whether they can complement each other in the target detection task .
because window attention Calculate your attention in the window , Default lack of connection between windows ; And SwinTransformer Use in shiftwindow To solve this problem is different , In this paper, a simple hybrid window attention is proposed for cross window connection .
HWin Accounting for every stage Except for the last one block The attention of all other windows , Then the features containing the global information are sent into FPN.
The results of ablation experiments show that this simple HWin The performance in classification and target detection tasks is always better than SWin; This article also compares pooled attention with HWin The combination achieves the best performance in target detection .
Positional embedding in detection
Unlike image classification , Object detection usually involves objects of different sizes , Therefore, this article will MViT When the position embedding is used for target detection, it will be initially transformed into 224x224 The size is then interpolated to different resolutions , Training for target detection .
Part 3 MViT for Video Recognition
MViT It can be easily used for video classification tasks , because MViT The modules in can be easily migrated to the space-time domain , This article also explores the effect of pre training .
And for image classification MViT The difference is :
(1)MViT When projecting, it is projected into the space-time domain instead of 2Dpatch
(2) Pooling operations need to be aggregated into the spatio-temporal map
(3) Relative position embedding also needs to consider spatiotemporal information
Part 4 MViT Architecture Variants
In this paper, we build different scale MViT Model , So as to make a fair comparison with other frameworks , Mainly by changing the number of channels 、block Number 、heads number And other parameters Tiny,Small.Base,Large,Huge5 A network of three sizes , For detailed network structure parameters, see Table 1.
The pooled step size is set to 4, And in stage Adaptive attenuation in .

Section V Experiments Image Classification
Part 1 Comparison results
First of all, this paper ImageNet Image classification and COCO Target detection detected MViTv2 Performance of . Table 2 It shows the relationship with other frameworks and MViTv1 Performance comparison of .

Table2 All the networks participating in the comparison are grouped according to the calculation amount of the network , You can see that NViTv1 Than , In this paper, the v2 It has higher classification accuracy , even to the extent that MViTv2-B Than MViTv1-B More lightweight .
also MViTv2 The performance of is also better than others Transformer Model , such as DeiT and Swin, Especially as the network scale gradually increases , such as MViTv2-B Reached 84.8% The accuracy of , To surpass respectively DeiT-B and Swin-B 2.6% and 1.1%, At the same time, the parameter quantity is also reduced 33%.
Besides using center-crop To test , This paper also tests the performance under full-scale input , Performance can be found from 86% Upgrade to 86.3%, This is by far the highest accuracy ( Without using external data or model distillation )

Table 3 It is displayed in ImageNet-21K Results after pre training , You can see that the pre training is MViT-L Accuracy has been improved 2.2%.
Part 2 Object Deteciton
Target detection is in COCO Test on dataset The target detection framework uses Mask R-CNN as well as Cascaded R-CNN. For the sake of fair comparison, this article follows and Swin Same settings , front 36epoch stay IN Pre training on datasets and then on COCO Fine tune up , And the test uses HWin The effect of , The window size is set to [56,28,14,7]
.
Table 5 Show the COCO Effects of two target detection frameworks on datasets ,MViT It's better than CNN as well as Transformer Other networks , such as MVit-B On the premise that the model is smaller and the parameter quantity is smaller than Swin-B Promoted 2.5APbox, stay Cascaded R-CNN The result is similar .
Part 2 Ablation Experiment
Different attention mechanisms
This paper studies the effects of different attention mechanisms , Namely pooling attention and Hwin Self attention ,Table 4 Show the results of the comparison , According to the experimental results, it is found that :
(1) about ViT-B Model , The default window based approach does reduce the amount of computation and memory , However, due to the lack of cross window interaction bring top-1 The accuracy has also decreased 2.0%;Swin window Can improve 0.4% The accuracy of the .
In this paper, the HWin Then with full attention Similar performance , Than Swin Window promote 1.7%, combination pooling attention Then the calculation amount is reduced 38%d The best accuracy is achieved in the case of .
(2) about MViT-S By default pooling attention, This article observed the addition of Swin and HWin Can reduce the complexity of the model But the performance will be slightly degraded ; The best accuracy can be achieved by further increasing the step length of the pool / Calculation force balance .


Positional embeddings
Table 6 Is the comparison result of embedding in different positions , It can be seen that using absolute location coding is significantly better than not using location embedding , This is because the pooling operation has modeled the location information ; Relative position coding can greatly improve performance , Because the translation invariance is introduced before the pooling operation , If we use the decomposition relative position embedding in this paper, we can 3.9x The acceleration of .
Residual pooling connection
Table 7 It shows the effect of using residual pooling connection , It can be seen that only introducing residual connections has improved the performance , The increased computational costs are negligible ; If residual pooling is used and applied to all layers, it will be greatly improved , especially COCO The data set has 1.4AP The promotion of , This shows that in MViT Use in Qpooling block And residual connection is very necessary .
Runtime Compression
Table 8 Is the runtime comparison , You can see MViT Than Swin Higher throughput .
Table 9 The difference between single-scale detection and multi-scale detector is compared , You can see FFN Significantly improved two backbone Performance of , and MViT-B Significantly better than ViT-B, This shows that the multi-scale hierarchical design is very suitable for the target detection task of intensive prediction .

Section VI Conclusion
This paper presents an improved multiscale vision Transformer-MViT- It can be used as a general framework for visual tasks , In image classification 、 Instance segmentation 、 Video classification and other fields have shown excellent performance . This paper hopes that this framework can be further studied in other visual tasks in the future .
边栏推荐
- Mindmanager22 professional mind mapping tool
- Product milestones in May 2022
- Detailed decomposition of the shortest path problem in Figure
- 福州口罩洁净厂房建设知识概述
- Leetcode question brushing series - mode 2 (datastructure linked list) - 206:reverse linked list
- 把所有单词拆分成单个字词,删选适合公司得产品词库
- Getting started with mathmatica
- 正大国际 至秋天的第一个主帐户
- SQL optimization
- Guanghetong launched a new generation of 3GPP R16 industrial 5g module fg160 engineering sample
猜你喜欢

Redis persistence (young people always set sail with a fast horse, with obstacles and long turns)

Guanghetong officially released the sc126 series of intelligent modules to promote more intelligent connection

PostgreSQL database replication - background first-class citizen process walreceiver receiving and sending logic

PCB ground wire design_ Single point grounding_ Bobbin line bold

用万用表检测数码管

MindManager22专业版思维导图工具

Unity 遮挡剔除

PHP话费充值通道网站完整运营源码/全解密无授权/对接免签约支付接口

Powerful new UI installation force artifact wechat applet source code + multiple templates support multiple traffic main modes

CoDeSys get system time
随机推荐
Programming Examples Using RDMA Verbs
智慧工地怎样做到数字化转型?
Guanghetong officially released the annual theme of 2022 5g Huanxin: Five Forces co drive · Jizhi future
Project architecture evolution
Unity Advanced Backpack System
Product milestones in May 2022
Hiredis determines the master node
Unity item model rotating display
Unity 高級背包系統
正大国际琪貨:交易市场
DL deep learning experiment management script
Leetcode question brushing series - mode 2 (datastructure linked list) - 83:remove duplicates from sorted list
2021 5g aiot annual innovation achievements! release!
用万用表检测数码管
Description of construction scheme of Meizhou P2 Laboratory
International qihuo: what are the risks of Zhengda master account
What is the KDM of digital movies?
An adaptive chat site - anonymous online chat room PHP source code
Guanghetong advanced the fm160-cn project sample of the first 3GPP R16 industrial 5g module customized for China
MySQL lock summary