当前位置:网站首页>[Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes
[Transformer]CoAtNet:Marrying Convolution and Attention for All Data Sizes
2022-06-11 04:41:00 【Yellow millet acridine】
CoAtNet: Convolution + Attention is used for classification tasks of any scale
from Google Research Brain Team
Paper
Abstract
Transformer stay CV More and more people are interested in this field , But some tasks Transformer The performance of is still inferior to CNN Model . This article means , although Transformer Larger model capacity , However, due to the lack of appropriate inductive paranoid information , therefore Transformer The generalization ratio of CNN Worse .
In order to effectively combine the advantages of the two architectures , In this paper, CoAtNet This hybrid framework , It has the following advantages :
(1) Deep convolution and self attention can be easily combined
;
(2) The generalization and model performance can be significantly improved by vertically stacking the attention layer and convolution layer
Experimental results show that , In this paper, the CoAtNet Under different data sets and different resource constraints SOTA: Without the help of additional data ImageNet Reached 86% Of top-1 Accuracy rate ; If with the help of ImageNet-21K With the help of pre training, Jiangdu will be promoted to 88.56%; With the help of larger JFT-300M The pre training accuracy will be improved to 90.88%.
Section I Introduction
since AlexNet Make a breakthrough , Convolutional neural network has always been the main model of computer vision ; And see Transformer stay NLP Achievements in the field , More and more studies have applied it to CV field . Recent research has shown that even with a layer of primitive Transformer It can also be in ImageNet To achieve more reasonable performance ; If in large data sets JFT-30M After pre training, it can reach the current state-of-the-art ConvNet Similar results , This shows that Transformer Maybe it's better than ConvNet Have more capacity .
although ViT It shows excellent performance but still lags behind the convolution network in the case of limited data , Some work tries to use special regularization 、 Data enhancement to make up for this, but with the same amount of data and computation ViT The variant is not superior to SOTA Convolution model , Mainly because of the original Transformer lack convnets The inductive bigotry , So a lot of data and computing resources are needed to make up for it .
Therefore, many recent works try to add the locality of convolution to Transformer Module , Or by exerting local receptive fields or by increasing attention or by FFN Layer using implicit or explicit convolution operations . But these efforts either focus on one aspect , Lack of a holistic understanding of convolution and attention .
This paper systematically studies how to mix convolution and attention from two aspects of generalization of machine learning and model capacity . The results show that , Convolution layer has strong inductive bias ability , Convergence is faster ; The attention layer has a larger model capacity , Can benefit from large amounts of data . Combine the two , On the one hand, it can converge faster with the help of inductive paranoia , Better model capacity and generalization with attention . The key to the problem is how to effectively combine the two , To better balance performance and efficiency .
This paper puts forward two views :
First, the commonly used depth convolution can be effectively combined with attention
; Secondly, stacking the convolution layer and attention layer in an appropriate way can achieve excellent performance improvement and generalization ability .
Based on the above point of view , This paper proposes a simple but effective hybrid network architecture -CoAtNet.( homophonic :coat coat )
What this article puts forward CoAtNet In different datasets Size and different resource constraints have reached SOTA, Especially in the case of limited data, it also shows high accuracy due to its good inductive bias and excellent generalization performance ; On large datasets CoAtNet Not only has Transformer At the same time, it converges faster , So it improves efficiency .
In use only ImageNet1K The new alliance has reached 86%de top-1 Accuracy rate , When in ImageNet-21K After pre training, the accuracy was improved to 88.56%, And in JFT-300M On and training ViT Similar performance , The two data sets are poor 23 times ; stay JFT-3B After pre training CoAtNet It shows more excellent performance , The accuracy is raised to 90.88%, The amount of calculation is greater than Vit-G Less 1.5 times .
Section II Model
This section will focus on convolution kernels Transformer The best combination of . Divide the problem into two parts to solve :
(1) How to combine convolution and self attention in a basic computing module ?
(2) How to stack different types of computing modules together to form a complete network ?
This article will solve the problem one by one .
Part 1 Merging Convolution and Self-Attention
Convolution this article focuses on MBConv modular , It mainly uses depth convolution to capture spatial interactive information . choice MBConv The main reason is Transformer Medium FFN and MBConv All adopted “inverted bottleneck” The design of the , That is, first expand the number of channels to the original 4 Times and then shrink the number of channels by widening the network .
In addition, we also note that the deep convolution kernel self attention can be expressed as a weighted sum operation of all values in a predefined receptive field . For example, convolution is to extract features from a predefined receptive field :

among L Local receptive field
Self attention can be understood as the receptive field is the whole picture , The weighted sum of all values is still calculated , The final output is obtained by normalization .
among G Indicates the global receptive field .
Next, we will analyze the characteristics of the two , In this way, we can combine the advantages of the two .
(1) The convolution kernel of deep convolution is independent of input , And attention weight A Depends on the input , So it's not hard to understand SA You can capture more complex relationships between two spatial locations , This is what we need when dealing with high-level concepts ; But there is also the risk of fitting , Especially under the premise of limited data .
(2) For any pair of positions (i,j) Convolution only focuses on the relative displacement between the two (i-j) instead of i,j It's worth it .
This feature is often referred to as - Translation invariance , This feature helps to improve the generalization performance of the model . stay ViT Because the absolute position embedding is used, this method is not feasible , And that explains why ViT The effect on small data sets is not as good as convolution .
(3) The size of receptive field is convolution kernel SA One of the big differences in the end .
Generally speaking , The larger the receptive field, the richer the context , Therefore, the capacity of the model is larger ,SA The global receptive field is also used by many tasks Transformer Why ; But the price is the explosion of computing costs , Is the square term of the input resolution .
Table 1 The convolution kernel is shown SA Comparison of characteristics , Therefore, the ideal model should have the above three characteristics , Their formulas are also similar , A simple way to combine the two is to softmax Before or after normalization, the global static convolution kernel is combined with attention The attention matrix is added .

Although the idea is very simple , however ypre It seems to correspond to a particular version of relative attention , Attention matrix A Depends on the relative displacement and input , So this article will Ypre As CoAtNet The key components of .
Part 2 Vertical Layput Design
After determining how to effectively combine convolution and attention , Next, consider how to build a complete network by stacking .
Because I mentioned SA The complexity of is the square term of the input resolution , Therefore, if the original image is directly used as input, the computational complexity is too high , So this article considers three approaches :
(1) Reduce the input resolution through some down sampling operations , stay feature map After reaching a certain level, the global relative attention is calculated ;
(2) Enhance local attention About to feel the whole world G Limited to local scope , Like a convoluted local receptive field ;
(3) take SA Replace the calculation of with the operation of linear complexity .
This article simply tries (3) But the effect is not good ;(2) It involves many deformation operations that require frequent memory access . This is not hardware friendly , Can't play TPU Acceleration effect , It also damages the capacity of the model . Therefore, this article will focus on (1) On .
Then the downsampling operation can be performed through :(1) image ViT With the help of step convolution or (2) A multiscale network like convolution is gradually downsampled .
When using ViT When stacking directly L individual Transformer block, be called ViTrel;
When reference convolution is used to build a multiscale network, a 5 The layer network gradually reduces the resolution , Namely S0,S1,S2,S3S4,, Every stage The spatial resolution will be halved , At the same time, the number of channels is doubled .
S0 It's a simple 2 Layer convolution module ,
S1 Used SE Operation of the MBConv
MBConv: Using depth separable convolution Mainly to reduce the parameter quantity ; When the bottleneck design is adopted, the dimension will be increased first and then reduced .
S2-S4 Is a combination of various variants , Namely :,C-C-C-C、C-C-C-T、C-C-T-T and C-T-T-T, among C and T Represent convolution and... Respectively Transformer
The experiment mainly compares two aspects : Generalization and model capacity .
Generalization : This paper is interested in training loss and verification accuracy . If the training losses of the two models are the same , The lower the verification loss, the better the generalization . Generalization is crucial for scenarios with limited training data .
** Model capacity :** Model capacity represents the learning ability on large data sets , The larger the model capacity, the less easy it is to over fit , It is also easy to get better performance .

To compare this 5 The capacity of each model , Respectively in ImageNet-1K and JFT Training 300 and 3 individual rpoch, But without any data enhancement or regularization .
Fig 1 Shows training losses and verification losses , In terms of generalization, the comparison result is :
You can see ViT The generalization of is much weaker than other models , This paper speculates that this is related to the lack of appropriate low-level information in the down sampling process ; The general trend is that the more convolutions in the model, the better generalization performance . In terms of model capacity , You can see the final ranking as follows :

This shows that ,Transformer The more does not necessarily mean the better ability to handle visual tasks . Although the final results show Transformer There are more than two models MBConv A variation of the , Shows Transformer Modeling capabilities of , But there are also two MBConv Better performance than the Transformer, This shows that the step based ViT The model is missing too much information , It limits the modeling ability of the model .
What is more interesting is that the performance of these two models is close to that of convolution in processing low-level information SA As strong as , At the same time, convolution can also greatly reduce memory and computing costs .
Finally, this article is in C-C-T-T and C-T-T-T Conducted a migration test between , That is, in two JFT The pre trained model on the dataset is migrated to ImageNet-1K Compare the migration performance .
Table 2 It is the comparison result that we can see C-C-T-T This structure has better migration performance . Considering the efficiency of the model 、 Model capacity 、 Portability , This article ends with CoAtNet choose C-C-T-T structure .
Section III Related Work
Convolutional network building blocks
Convolutional neural network has been the mainstream framework for many computer vision tasks , Such as ResNet; On the other hand, deep convolution is also used in lightweight networks because of its low computational cost and few parameters 、 Mobile terminals are favored . In the near future MBConv The improved inversion residual module can better balance the accuracy and efficiency , As discussed earlier MBConv There is a strong connection with convolution , Therefore, most articles choose MBConv Convolution module as the basis .
Self-Attention and Transformer
SA yes Transformer The core components of ,Transformer It has been widely used in language modeling and understanding tasks . Previous work has shown that it only uses SA You can also get certain effects .ViT It's the first one to Transformer Good performance is also obtained for image classification , however ViT Still slightly inferior to OCnvNet, Pre training on large data sets is required . So a lot of work is focused on how to improve Transformer On the efficiency of .
Relative Attention
There are many variants of relative attention , It can usually be divided into two categories :
One is the additional relative attention grouping, which is the function of input ; One is relative attention independent of input .
CoAtNet Belong to the second category - Enter the dependent version , However, this article does not share cross layer relative attention parameters , Not used bucket, The advantage of this is that the calculation cost is relatively low . In addition, you can cache the results for reasoning . A recent work also uses this input independent parameterization scheme , But confine the receptive field to a local window .
Combining convolution and self-attention
Combine convolution with SA Combined with this idea Not new , It's usually choice ConvNet As backbone Then use the displayed self attention or non local modules to enhance backbone, Or replace part of the convolution layer with SA Or other more flexible linear attention and convolution hybrid operations ; But this usually adds additional computational costs .
although SA It can be highly accurate but is usually considered as an additional component of convolutional networks SE modular .
The other direction is based on Transformer The Internet , To fuse convolution or some convolution properties .
Although the work of this paper is also based on this idea , But this paper is a natural combination of deep convolution kernel based attention , And the additional cost is also relatively small .
More importantly, this paper starts from generalization and model capacity . Show different stage Which types of layers tend to . For example, with ResNet-ViT comparison , When the overall size increases CoAtNet You can also scale the convolution size ; On the other hand, compared with the model using local attention , In this paper, the S3,S4 It can guarantee the capacity of the model , because S3,S4 The stage occupies the main calculation force and parameter quantity .
Section IV Experiments
Experimental settings
CoAtNet family
Table3 This paper shows the different scales of CoAtNet The Internet , Generally, the number of channels in different stages will be doubled , The resolution will be halved .
Evaluation criteria
To test model performance , This paper tests on three different scale datasets , Namely ImageNet-1K,ImageNet-21K,JFT-300M, Data sets are getting larger and larger .
The results of direct training and pre training were also tested .
Data enhancement and regularization
This article considers only two widely used data enhancements : Random enhancement and MixUp, And three regularization schemes :stocastic depth,label smoothing,weight decay.
An interesting phenomenon is also observed in this paper , If you don't use this enhancement during pre training , Using this enhancement in fine tuning can lead to performance degradation , This paper suggests that this may be related to the migration of data distribution .
Therefore, in this paper ImageNet and JFT A small degree of random depth is used in pre training , This allows you to use more regularization or data enhancement in fine-tuning , To improve the performance of downstream tasks .
Part 1 Main Results

ImageNet-1K
Table 4 Show the ImageNet-1K The results of the experiment on , You can see CoAtNet Not only more than ViT The performance of can also be compared with CNN The performance is comparable to , such as EfficientNet-V2 and NFNets.Fig 2 It also shows that 224x224 The resolution is the precision comparison of the input , You can see that with the increase of attention modules CoAtNet Gradually improve the performance of .

ImageNet-21K
Table 4 Also shows the ImageNet-21K The experimental results of , You can see after pre training CoAtNet The effect has been greatly improved , It can surpass all the networks involved in the comparison , Up to 88.56% Of top-1 precision , And the performance reaches 88.55% Of ViT-h/14 It is in JFT The results of pre training on the data set can be achieved , And the model is also CoAtNet Of 2.3 times . This fully explains CoAtNet The data efficiency and computing efficiency have been significantly improved .
JFT
Table 5 The show is in JFT Training results on the dataset , You can see CoaTNet-4 Almost the same as before NFTNet-F4 The performance is comparable to , But the training time and parameters have been improved 2x; If and NFNet Similar parameter quantities and computing resources , The accuracy can reach 89.77%.
If you will CoAtNet achieve ViT-G/14 The scale of , Use JFT-3B Dataset training ,CoAtNet-6 The accuracy can reach 90.45%, The amount of calculation is reduced 1.5x;CoAtNet-7 Of top-1 Accuracy of 90.88% It is the highest accuracy at present .
Part 2 Ablation Studies
First, we study the relation between convolution and SA The importance of using relative attention to form a computing unit . The two models are mainly compared , One is the use of relative attention One doesn't work .
from Table 6 The experimental results show that the effect of relative attention is better than that of standard attention , And it has better generalization . For example ImageNet-21K Experiment to find that , Their hearts are more difficult to approach, but relativism shows better migration performance , It shows that the advantage of relative attention in visual tasks is not higher modeling ability but better generalization ability .
secondly because S2(MBConv) and S3(relative Transformer) Occupy the main amount of calculation , A problem arises naturally :
How to allocate computing resources to them so as to obtain the optimal accuracy combination , In the experiment, it comes down to each stage How much is allocated block- This paper calls it layout design (layout).Table 7 The accuracy comparison of different layouts is shown .


The experimental results are limited to S3 Set more Transformer block Performance will be better , until S2 Because block The number is too small to generalize well .
In order to further evaluate whether the compromise portfolio has certain mobility , This paper makes a further comparison between 1K and 21K Performance of the same kind of design , But I found that the performance decreased Therefore, it shows that convolution remobilization and generalization play an important role .
Finally, this paper also explores the detailed design of the model , from Table 8 It can be seen that head size from 32 Add to 64 It can degrade performance , Although it can improve the hardware acceleration effect ; Therefore, accuracy should be considered in practice - The trade-off of speed ; On the other hand BN and LN Same performance , But use BN But more hardware friendly ,TPU Average fast 10%-20%.
Section V Conclusion
In this paper, convolution kernel is studied systematically Transformer Characteristics , A new method of mixing the two is proposed -CoAtNet. A lot of experiments have proved that ,CoAtNet And ConvNet It also has good generalization performance , Simultaneous ratio Tranformer Larger model capacity , It has reached... On data sets of different scales SOTA.
It is worth noting that this article focuses on ImageNet Research and development of upper classification model , But this article believes that CoAtNet Suitable for a wider range of application scenarios , For example, target detection and semantic segmentation , This article will continue to study in the future .
Appendix
Part 1 Model Details
Fig 4 It shows CoAtNet Detailed structure of
.

2D Relative Attention
For size [H,W] Image , For each attention head, a [2H-1,2W-1] The trainable parameters of P, Then calculate any two positions (i,j) and (i’,j’) Silicon oriented position between . stay TPU During the training, the height axis and the width axis are calculated respectively , The overall computational complexity is O(HW(H+W));GPU Which indexes on the can be calculated more efficiently by accessing memory .
When you reason, you will be right H2W2 Cache elements within the scope to improve throughput ; If you need a larger resolution, use bilinear interpolation to interpolate to the required size .
Pre-Activation
In order to improve the homogeneity of the model, this article always uses pre-activation Yes MBConv and Transformer modular . This paper also tests the MBConv Also used in LN It is found that the performance is the same , however LN stay TPU The acceleration is slow , So this article uses GELU As an activation function .
Down-Sampling
about S1 To S4 every last stage First of all block, For residual bifurcation and identity All branches use the down sampling operation ; Besides identity Branches also do channel mapping to larger hidden size.
therefore SA The down sampling operation of the module can be expressed as :

MBConv Down sampling can be expressed as :

Can be seen with the standard MBConv It's different , The original is completed by step convolution ; However, this paper finds that this method is slow when the model is small , Reference resources Fig 9.

So for better speed - Accuracy tradeoff , This paper adopts the down sampling operation of the above formula .
Classification head
This article is not like ViT Add an extra... As in < cls > To sort it out , But for the last one stage Use global pooling to simplify the process .




边栏推荐
- Minor problems encountered in installing the deep learning environment -- the jupyter service is busy
- An adaptive chat site - anonymous online chat room PHP source code
- What is the KDM of digital movies?
- Temporary website English Writing
- Unity music playback Manager
- DL deep learning experiment management script
- Hiredis determines the master node
- 一款自适应的聊天网站-匿名在线聊天室PHP源码
- Leetcode classic guide
- Record of serial baud rate
猜你喜欢

决策树(Hunt、ID3、C4.5、CART)

2022年新高考1卷17题解析

QT method for generating QR code pictures

Production of unity scalable map

Guanghetong LTE CAT6 module fm101-cg, which supports CBRS band, took the lead in obtaining FCC certification

精益产品开发体系最佳实践及原则

Feature engineering feature dimension reduction

New UI learning method subtraction professional version 34235 question bank learning method subtraction professional version applet source code

USB转232 转TTL概述

Unity music playback Manager
随机推荐
Holiday Homework
The first master account of Chia Tai International till autumn
Unity music playback Manager
Hiredis determines the master node
How the idea gradle project imports local jar packages
Vulkan official example interpretation raytracingshadows & use the model here (1)
福州化工实验室建设注意隐患分析
Leetcode question brushing series - mode 2 (datastructure linked list) - 83:remove duplicates from sorted list
正大国际琪貨:交易市场
JVM (7): dynamic link, method call, four method call instructions, distinguishing between non virtual methods and virtual methods, and the use of invokedynamic instructions
Leetcode question brushing series - mode 2 (datastructure linked list) - 160:intersection of two linked list
What are the similarities and differences between the data center and the data warehouse?
Zhengda international qihuo: trading market
Overview of construction knowledge of Fuzhou mask clean workshop
Relational database system
New UI learning method subtraction professional version 34235 question bank learning method subtraction professional version applet source code
The second discussion class on mathematical basis of information and communication
Collation of construction data of Meizhou plant tissue culture laboratory
Minor problems encountered in installing the deep learning environment -- the jupyter service is busy
Emlog new navigation source code / with user center