当前位置：网站首页>[Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes

[Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes

2022-06-11 04:41:00 【Yellow millet acridine】

CoAtNet: Convolution + Attention is used for classification tasks of any scale

Abstract
Section I Introduction
Section II Model
- Part 1 Merging Convolution and Self-Attention
- Part 2 Vertical Layput Design
Section III Related Work
Section IV Experiments
- Part 1 Main Results
- Part 2 Ablation Studies
Section V Conclusion
Appendix
- Part 1 Model Details

from Google Research Brain Team 
Paper

Abstract

Transformer stay CV More and more people are interested in this field , But some tasks Transformer The performance of is still inferior to CNN Model . This article means , although Transformer Larger model capacity , However, due to the lack of appropriate inductive paranoid information , therefore Transformer The generalization ratio of CNN Worse .
  In order to effectively combine the advantages of the two architectures , In this paper, CoAtNet This hybrid framework , It has the following advantages ：
  （1） Deep convolution and self attention can be easily combined  ;
  （2） The generalization and model performance can be significantly improved by vertically stacking the attention layer and convolution layer  
   Experimental results show that , In this paper, the CoAtNet Under different data sets and different resource constraints SOTA： Without the help of additional data ImageNet Reached 86% Of top-1 Accuracy rate ; If with the help of ImageNet-21K With the help of pre training, Jiangdu will be promoted to 88.56%; With the help of larger JFT-300M The pre training accuracy will be improved to 90.88%.

Section I Introduction

since AlexNet Make a breakthrough , Convolutional neural network has always been the main model of computer vision ; And see Transformer stay NLP Achievements in the field , More and more studies have applied it to CV field . Recent research has shown that even with a layer of primitive Transformer It can also be in ImageNet To achieve more reasonable performance ; If in large data sets JFT-30M After pre training, it can reach the current state-of-the-art ConvNet Similar results , This shows that Transformer Maybe it's better than ConvNet Have more capacity .
     although ViT It shows excellent performance but still lags behind the convolution network in the case of limited data , Some work tries to use special regularization 、 Data enhancement to make up for this, but with the same amount of data and computation ViT The variant is not superior to SOTA Convolution model , Mainly because of the original Transformer lack convnets The inductive bigotry , So a lot of data and computing resources are needed to make up for it .
     Therefore, many recent works try to add the locality of convolution to Transformer Module , Or by exerting local receptive fields or by increasing attention or by FFN Layer using implicit or explicit convolution operations . But these efforts either focus on one aspect , Lack of a holistic understanding of convolution and attention .
      This paper systematically studies how to mix convolution and attention from two aspects of generalization of machine learning and model capacity . The results show that , Convolution layer has strong inductive bias ability , Convergence is faster ; The attention layer has a larger model capacity , Can benefit from large amounts of data . Combine the two , On the one hand, it can converge faster with the help of inductive paranoia , Better model capacity and generalization with attention . The key to the problem is how to effectively combine the two , To better balance performance and efficiency . 
      This paper puts forward two views ：  First, the commonly used depth convolution can be effectively combined with attention  ; Secondly, stacking the convolution layer and attention layer in an appropriate way can achieve excellent performance improvement and generalization ability . 
      Based on the above point of view , This paper proposes a simple but effective hybrid network architecture -CoAtNet.（ homophonic ：coat coat ）
What this article puts forward CoAtNet In different datasets Size and different resource constraints have reached SOTA, Especially in the case of limited data, it also shows high accuracy due to its good inductive bias and excellent generalization performance ; On large datasets CoAtNet Not only has Transformer At the same time, it converges faster , So it improves efficiency .
  In use only ImageNet1K The new alliance has reached 86%de top-1 Accuracy rate , When in ImageNet-21K After pre training, the accuracy was improved to 88.56%, And in JFT-300M On and training ViT Similar performance , The two data sets are poor 23 times ; stay JFT-3B After pre training CoAtNet It shows more excellent performance , The accuracy is raised to 90.88%, The amount of calculation is greater than Vit-G Less 1.5 times .

Section II Model

This section will focus on convolution kernels Transformer The best combination of . Divide the problem into two parts to solve ：
 （1） How to combine convolution and self attention in a basic computing module ？ 
 （2） How to stack different types of computing modules together to form a complete network ？  This article will solve the problem one by one .

Part 1 Merging Convolution and Self-Attention

Convolution this article focuses on MBConv modular , It mainly uses depth convolution to capture spatial interactive information . choice MBConv The main reason is Transformer Medium FFN and MBConv All adopted “inverted bottleneck” The design of the , That is, first expand the number of channels to the original 4 Times and then shrink the number of channels by widening the network .
   In addition, we also note that the deep convolution kernel self attention can be expressed as a weighted sum operation of all values in a predefined receptive field . For example, convolution is to extract features from a predefined receptive field ：
   Insert picture description here

among L Local receptive field    Self attention can be understood as the receptive field is the whole picture , The weighted sum of all values is still calculated , The final output is obtained by normalization .
Insert picture description here

among G Indicates the global receptive field .

Next, we will analyze the characteristics of the two , In this way, we can combine the advantages of the two . 
（1） The convolution kernel of deep convolution is independent of input , And attention weight A Depends on the input , So it's not hard to understand SA You can capture more complex relationships between two spatial locations , This is what we need when dealing with high-level concepts ; But there is also the risk of fitting , Especially under the premise of limited data . 
（2） For any pair of positions （i,j） Convolution only focuses on the relative displacement between the two （i-j） instead of i,j It's worth it .
This feature is often referred to as - Translation invariance , This feature helps to improve the generalization performance of the model . stay ViT Because the absolute position embedding is used, this method is not feasible , And that explains why ViT The effect on small data sets is not as good as convolution . 
（3） The size of receptive field is convolution kernel SA One of the big differences in the end .
Generally speaking , The larger the receptive field, the richer the context , Therefore, the capacity of the model is larger ,SA The global receptive field is also used by many tasks Transformer Why ; But the price is the explosion of computing costs , Is the square term of the input resolution .
 Table 1 The convolution kernel is shown SA Comparison of characteristics , Therefore, the ideal model should have the above three characteristics , Their formulas are also similar , A simple way to combine the two is to softmax Before or after normalization, the global static convolution kernel is combined with attention The attention matrix is added .
Insert picture description here

Although the idea is very simple , however ypre It seems to correspond to a particular version of relative attention , Attention matrix A Depends on the relative displacement and input , So this article will Ypre As CoAtNet The key components of .

Part 2 Vertical Layput Design

After determining how to effectively combine convolution and attention , Next, consider how to build a complete network by stacking .  Because I mentioned SA The complexity of is the square term of the input resolution , Therefore, if the original image is directly used as input, the computational complexity is too high , So this article considers three approaches ：
  （1） Reduce the input resolution through some down sampling operations , stay feature map After reaching a certain level, the global relative attention is calculated ; 
  （2） Enhance local attention About to feel the whole world G Limited to local scope , Like a convoluted local receptive field ; 
  （3） take SA Replace the calculation of with the operation of linear complexity .
    This article simply tries （3） But the effect is not good ;（2） It involves many deformation operations that require frequent memory access . This is not hardware friendly , Can't play TPU Acceleration effect , It also damages the capacity of the model . Therefore, this article will focus on （1） On .
     Then the downsampling operation can be performed through ：（1） image ViT With the help of step convolution or （2） A multiscale network like convolution is gradually downsampled .
  When using ViT When stacking directly L individual Transformer block, be called ViTrel;  When reference convolution is used to build a multiscale network, a 5 The layer network gradually reduces the resolution , Namely S0,S1,S2,S3S4,, Every stage The spatial resolution will be halved , At the same time, the number of channels is doubled . S0 It's a simple 2 Layer convolution module , S1 Used SE Operation of the MBConv MBConv: Using depth separable convolution Mainly to reduce the parameter quantity ; When the bottleneck design is adopted, the dimension will be increased first and then reduced .
  S2-S4 Is a combination of various variants , Namely ：,C-C-C-C、C-C-C-T、C-C-T-T and C-T-T-T, among C and T Represent convolution and... Respectively Transformer  The experiment mainly compares two aspects ： Generalization and model capacity . 
   Generalization ： This paper is interested in training loss and verification accuracy . If the training losses of the two models are the same , The lower the verification loss, the better the generalization . Generalization is crucial for scenarios with limited training data .
   ** Model capacity ：** Model capacity represents the learning ability on large data sets , The larger the model capacity, the less easy it is to over fit , It is also easy to get better performance .
    Insert picture description here

To compare this 5 The capacity of each model , Respectively in ImageNet-1K and JFT Training 300 and 3 individual rpoch, But without any data enhancement or regularization . Fig 1 Shows training losses and verification losses , In terms of generalization, the comparison result is ：
Insert picture description here

You can see ViT The generalization of is much weaker than other models , This paper speculates that this is related to the lack of appropriate low-level information in the down sampling process ; The general trend is that the more convolutions in the model, the better generalization performance .  In terms of model capacity , You can see the final ranking as follows ：

Insert picture description here

This shows that ,Transformer The more does not necessarily mean the better ability to handle visual tasks . Although the final results show Transformer There are more than two models MBConv A variation of the , Shows Transformer Modeling capabilities of , But there are also two MBConv Better performance than the Transformer, This shows that the step based ViT The model is missing too much information , It limits the modeling ability of the model . 
What is more interesting is that the performance of these two models is close to that of convolution in processing low-level information SA As strong as , At the same time, convolution can also greatly reduce memory and computing costs . 
Finally, this article is in C-C-T-T and C-T-T-T Conducted a migration test between , That is, in two JFT The pre trained model on the dataset is migrated to ImageNet-1K Compare the migration performance .
Insert picture description here

Table 2 It is the comparison result that we can see C-C-T-T This structure has better migration performance .  Considering the efficiency of the model 、 Model capacity 、 Portability , This article ends with CoAtNet choose C-C-T-T structure .

Section III Related Work

Convolutional network building blocks
  Convolutional neural network has been the mainstream framework for many computer vision tasks , Such as ResNet; On the other hand, deep convolution is also used in lightweight networks because of its low computational cost and few parameters 、 Mobile terminals are favored . In the near future MBConv The improved inversion residual module can better balance the accuracy and efficiency , As discussed earlier MBConv There is a strong connection with convolution , Therefore, most articles choose MBConv Convolution module as the basis .
  Self-Attention and Transformer 
  SA yes Transformer The core components of ,Transformer It has been widely used in language modeling and understanding tasks . Previous work has shown that it only uses SA You can also get certain effects .ViT It's the first one to Transformer Good performance is also obtained for image classification , however ViT Still slightly inferior to OCnvNet, Pre training on large data sets is required . So a lot of work is focused on how to improve Transformer On the efficiency of .
Relative Attention 
There are many variants of relative attention , It can usually be divided into two categories ：
One is the additional relative attention grouping, which is the function of input ; One is relative attention independent of input .
 CoAtNet Belong to the second category - Enter the dependent version , However, this article does not share cross layer relative attention parameters , Not used bucket, The advantage of this is that the calculation cost is relatively low . In addition, you can cache the results for reasoning . A recent work also uses this input independent parameterization scheme , But confine the receptive field to a local window .
  Combining convolution and self-attention 
   Combine convolution with SA Combined with this idea Not new , It's usually choice ConvNet As backbone Then use the displayed self attention or non local modules to enhance backbone, Or replace part of the convolution layer with SA Or other more flexible linear attention and convolution hybrid operations ; But this usually adds additional computational costs .  although SA It can be highly accurate but is usually considered as an additional component of convolutional networks SE modular .  The other direction is based on Transformer The Internet , To fuse convolution or some convolution properties .
    Although the work of this paper is also based on this idea , But this paper is a natural combination of deep convolution kernel based attention , And the additional cost is also relatively small .  More importantly, this paper starts from generalization and model capacity . Show different stage Which types of layers tend to . For example, with ResNet-ViT comparison , When the overall size increases CoAtNet You can also scale the convolution size ; On the other hand, compared with the model using local attention , In this paper, the S3,S4 It can guarantee the capacity of the model , because S3,S4 The stage occupies the main calculation force and parameter quantity .

Section IV Experiments

Experimental settings  
CoAtNet family Table3 This paper shows the different scales of CoAtNet The Internet , Generally, the number of channels in different stages will be doubled , The resolution will be halved .
  Evaluation criteria  
  To test model performance , This paper tests on three different scale datasets , Namely ImageNet-1K,ImageNet-21K,JFT-300M, Data sets are getting larger and larger .  The results of direct training and pre training were also tested .
  Data enhancement and regularization  
  This article considers only two widely used data enhancements ： Random enhancement and MixUp, And three regularization schemes ：stocastic depth,label smoothing,weight decay.
   An interesting phenomenon is also observed in this paper , If you don't use this enhancement during pre training , Using this enhancement in fine tuning can lead to performance degradation , This paper suggests that this may be related to the migration of data distribution . 
   Therefore, in this paper ImageNet and JFT A small degree of random depth is used in pre training , This allows you to use more regularization or data enhancement in fine-tuning , To improve the performance of downstream tasks .

Part 1 Main Results

Insert picture description here

ImageNet-1K 
  Table 4 Show the ImageNet-1K The results of the experiment on , You can see CoAtNet Not only more than ViT The performance of can also be compared with CNN The performance is comparable to , such as EfficientNet-V2 and NFNets.Fig 2 It also shows that 224x224 The resolution is the precision comparison of the input , You can see that with the increase of attention modules CoAtNet Gradually improve the performance of .
   Insert picture description here
ImageNet-21K 
Table 4 Also shows the ImageNet-21K The experimental results of , You can see after pre training CoAtNet The effect has been greatly improved , It can surpass all the networks involved in the comparison , Up to 88.56% Of top-1 precision , And the performance reaches 88.55% Of ViT-h/14 It is in JFT The results of pre training on the data set can be achieved , And the model is also CoAtNet Of 2.3 times . This fully explains CoAtNet The data efficiency and computing efficiency have been significantly improved .
Insert picture description here

JFT
  Table 5 The show is in JFT Training results on the dataset , You can see CoaTNet-4 Almost the same as before NFTNet-F4 The performance is comparable to , But the training time and parameters have been improved 2x; If and NFNet Similar parameter quantities and computing resources , The accuracy can reach 89.77%.  If you will CoAtNet achieve ViT-G/14 The scale of , Use JFT-3B Dataset training ,CoAtNet-6 The accuracy can reach 90.45%, The amount of calculation is reduced 1.5x;CoAtNet-7 Of top-1 Accuracy of 90.88% It is the highest accuracy at present .

Part 2 Ablation Studies

First, we study the relation between convolution and SA The importance of using relative attention to form a computing unit . The two models are mainly compared , One is the use of relative attention One doesn't work .
  from Table 6 The experimental results show that the effect of relative attention is better than that of standard attention , And it has better generalization . For example ImageNet-21K Experiment to find that , Their hearts are more difficult to approach, but relativism shows better migration performance , It shows that the advantage of relative attention in visual tasks is not higher modeling ability but better generalization ability . 
  secondly because S2（MBConv） and S3（relative Transformer） Occupy the main amount of calculation , A problem arises naturally ：  How to allocate computing resources to them so as to obtain the optimal accuracy combination , In the experiment, it comes down to each stage How much is allocated block- This paper calls it layout design （layout).Table 7 The accuracy comparison of different layouts is shown .
  Insert picture description here

The experimental results are limited to S3 Set more Transformer block Performance will be better , until S2 Because block The number is too small to generalize well .
    In order to further evaluate whether the compromise portfolio has certain mobility , This paper makes a further comparison between 1K and 21K Performance of the same kind of design , But I found that the performance decreased Therefore, it shows that convolution remobilization and generalization play an important role .  Finally, this paper also explores the detailed design of the model , from Table 8 It can be seen that head size from 32 Add to 64 It can degrade performance , Although it can improve the hardware acceleration effect ; Therefore, accuracy should be considered in practice - The trade-off of speed ; On the other hand BN and LN Same performance , But use BN But more hardware friendly ,TPU Average fast 10%-20%.

Section V Conclusion

In this paper, convolution kernel is studied systematically Transformer Characteristics , A new method of mixing the two is proposed -CoAtNet. A lot of experiments have proved that ,CoAtNet And ConvNet It also has good generalization performance , Simultaneous ratio Tranformer Larger model capacity , It has reached... On data sets of different scales SOTA.
     It is worth noting that this article focuses on ImageNet Research and development of upper classification model , But this article believes that CoAtNet Suitable for a wider range of application scenarios , For example, target detection and semantic segmentation , This article will continue to study in the future .

Appendix

Part 1 Model Details

Fig 4 It shows CoAtNet Detailed structure of  .
   Insert picture description here

2D Relative Attention 
   For size [H,W] Image , For each attention head, a [2H-1,2W-1] The trainable parameters of P, Then calculate any two positions (i,j) and （i’,j’） Silicon oriented position between . stay TPU During the training, the height axis and the width axis are calculated respectively , The overall computational complexity is O（HW(H+W)）;GPU Which indexes on the can be calculated more efficiently by accessing memory .  When you reason, you will be right H2W2 Cache elements within the scope to improve throughput ; If you need a larger resolution, use bilinear interpolation to interpolate to the required size . 
  Pre-Activation 
   In order to improve the homogeneity of the model, this article always uses pre-activation Yes MBConv and Transformer modular . This paper also tests the MBConv Also used in LN It is found that the performance is the same , however LN stay TPU The acceleration is slow , So this article uses GELU As an activation function .
Insert picture description here
 Down-Sampling 
  about S1 To S4 every last stage First of all block, For residual bifurcation and identity All branches use the down sampling operation ; Besides identity Branches also do channel mapping to larger hidden size.
   therefore SA The down sampling operation of the module can be expressed as ：

Insert picture description here

MBConv Down sampling can be expressed as ：

Insert picture description here

Can be seen with the standard MBConv It's different , The original is completed by step convolution ; However, this paper finds that this method is slow when the model is small , Reference resources Fig 9.
   Insert picture description here

So for better speed - Accuracy tradeoff , This paper adopts the down sampling operation of the above formula .
    Classification head  
     This article is not like ViT Add an extra... As in < cls > To sort it out , But for the last one stage Use global pooling to simplify the process .
Insert picture description here