当前位置：网站首页>When CNN meets transformer cmt:revolutionary neural networks meet vision transformers

When CNN meets transformer cmt:revolutionary neural networks meet vision transformers

2022-07-28 19:26:00 【I'm Mr. rhubarb】

List of articles

Original address

Thesis reading methods

First time to know

ViT Direct will Transformer Architecture for vision , Pre training attached to big data sets , Achieved good results . And there are also a lot of excellent improvement work in the follow-up , It shows transformer Extraordinary performance , But it is still weaker than the same size CNN（ such as EfficientNet）.

The author believes that although the standard transformer Can capture patch Long distance dependency between , But compared to the NLP Mission , In visual tasks 2D structure , as well as patch The local spatial information between is also very important . Besides ,transformer Due to the fixed size , Therefore, it is difficult to capture low resolution 、 Multiscale features , This is very detrimental to intensive prediction tasks （ Division 、 testing ）. and transformer Of self-attention The time for the module to input the image resolution 、 The complexity of space is O( $N^2C$ ), The convolutional neural network is O( $NC^2$ ).

N Represents sequence length ,C Represent dimension .

The purpose of this paper is to CNN Combine the advantages of Transformer in , To solve the above problems . Thus, a new architecture is proposed CMT, Convolution operation is introduced for fine-grained feature extraction , In addition, introduce hierarchy （stage-wise） It is used to extract multi-scale features and reduce computational consumption , At the same time, some special components are designed to extract local and global features . Finally, experiments are carried out on different data sets , That is, improved performance , It also reduces the calculation loss .

Know each other

The core technology

This section ignores Scaling Comparison of strategy and computational complexity , If you are interested, please refer to the original .

Insert picture description here
Above picture ,(a) by CNN Classic ResNet The Internet ,(b) For the current vision ViT structure . You can see ViT Directly cut the input image into non overlapping patch Then do linear mapping , This directly lost patch Medium 2 Dimensional space features . therefore CMT use stem structure , use 3x3 Convolution down sampling and local feature extraction . Besides , In order to extract multi-scale features stage structure , So every stage Use before 2x2 Convolution +LayerNorm Conduct down sampling and up dimension （channel x 2）. At every stage Inside , Build a series of CMT The module extracts features , It can capture local performance , It can also capture long-distance dependencies . Let's introduce each block Inside details ：

① LPU,local perception unit Local sensing unit ：

Insert picture description here
Rotation and translation are CNN Commonly used augmentation methods in , Used to increase the translation invariance of the model . However, in ViT Absolute position coding is usually used in , Every patch All correspond to a unique location code , Therefore, translation invariance cannot be introduced through data augmentation , At the same time, local correlation and structural information are ignored . therefore LPU use 3x3 Depth separation convolution + Residual connection further extracts local features .

② LMHSA,lightweight multi-head self-attention Lightweight multi head attention ：

Insert picture description here
The original self-attention The operation is as follows ：

and LMHSA Will first use the step size K Of KxK Depth convolution is reduced K And V The spatial dimension of ( Turn into $\frac{n}{k^2}\times d_k$ ), And the learnable bias term is added B：
Insert picture description here

③ IRFFN,inverted residual feed-forward network Reverse residual feedforward network

Insert picture description here
This module is similar to mobilenetv2 Medium inverted residual modular （ The channel dimension design with narrow ends and wide middle ）, It just changes the position of residual connection and uses GELU As an activation function .

Last , The author designed several different network architectures , It's kind of like ResNet And Efficient The combination of ：
Insert picture description here

experimental result

The experiment is only attached to ImageNet The result on , About ablation experiments and transfer The task is not mentioned here .

Insert picture description here
The figure above compares CNN and transformer Method , You can see CMT On the basis of performance improvement , It also keeps less parameter quantity and calculation consumption . Another point worth noting is , All the previous ones are based on transformer The performance of the method is not as good as efficientnet, and CMT Beyond efficientnet, And the calculation loss is lower .

review

CMT Produced by Huawei Noah Ark Laboratory , The article analyzes transformer Disadvantages of models in visual tasks ：1. Loss of spatial local structure information ;2. The parameters are large 、 Large amount of computation ;3. Unable to extract fine-grained features 、 Multiscale features . And these aspects CNN Be good at it , So the author will CNN Some classic designs of the network are introduced into transformer in , And add convolution module , Improve efficiency and performance , such as ：ResNet Residual structure of 、Mobilenet Medium depth-wise Convolution 、efficient Parameters in Scaling Strategy and so on .

Although the whole article does not have much academic innovation in methods , More engineering oriented network design , But this undoubtedly promoted transformer stay CV Progress in the field , After all, who would refuse a fast and good model .

Code

At present, the official code is not open source , but github There have been many excellent reproduction work , Stick a copy ：
https://github.com/FlyEgle/CMT-pytorch

原网站

版权声明
本文为[I'm Mr. rhubarb]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207281723094665.html