当前位置:网站首页>When CNN meets transformer cmt:revolutionary neural networks meet vision transformers
When CNN meets transformer cmt:revolutionary neural networks meet vision transformers
2022-07-28 19:26:00 【I'm Mr. rhubarb】
List of articles
Original address
Original text and additional materials
Thesis reading methods
First time to know
ViT Direct will Transformer Architecture for vision , Pre training attached to big data sets , Achieved good results . And there are also a lot of excellent improvement work in the follow-up , It shows transformer Extraordinary performance , But it is still weaker than the same size CNN( such as EfficientNet).
The author believes that although the standard transformer Can capture patch Long distance dependency between , But compared to the NLP Mission , In visual tasks 2D structure , as well as patch The local spatial information between is also very important . Besides ,transformer Due to the fixed size , Therefore, it is difficult to capture low resolution 、 Multiscale features , This is very detrimental to intensive prediction tasks ( Division 、 testing ). and transformer Of self-attention The time for the module to input the image resolution 、 The complexity of space is O( N 2 C N^2C N2C), The convolutional neural network is O( N C 2 NC^2 NC2).
N Represents sequence length ,C Represent dimension .
The purpose of this paper is to CNN Combine the advantages of Transformer in , To solve the above problems . Thus, a new architecture is proposed CMT, Convolution operation is introduced for fine-grained feature extraction , In addition, introduce hierarchy (stage-wise) It is used to extract multi-scale features and reduce computational consumption , At the same time, some special components are designed to extract local and global features . Finally, experiments are carried out on different data sets , That is, improved performance , It also reduces the calculation loss .
Know each other
The core technology
This section ignores Scaling Comparison of strategy and computational complexity , If you are interested, please refer to the original .

Above picture ,(a) by CNN Classic ResNet The Internet ,(b) For the current vision ViT structure . You can see ViT Directly cut the input image into non overlapping patch Then do linear mapping , This directly lost patch Medium 2 Dimensional space features . therefore CMT use stem structure , use 3x3 Convolution down sampling and local feature extraction . Besides , In order to extract multi-scale features stage structure , So every stage Use before 2x2 Convolution +LayerNorm Conduct down sampling and up dimension (channel x 2). At every stage Inside , Build a series of CMT The module extracts features , It can capture local performance , It can also capture long-distance dependencies . Let's introduce each block Inside details :
① LPU,local perception unit Local sensing unit :

Rotation and translation are CNN Commonly used augmentation methods in , Used to increase the translation invariance of the model . However, in ViT Absolute position coding is usually used in , Every patch All correspond to a unique location code , Therefore, translation invariance cannot be introduced through data augmentation , At the same time, local correlation and structural information are ignored . therefore LPU use 3x3 Depth separation convolution + Residual connection further extracts local features .
② LMHSA,lightweight multi-head self-attention Lightweight multi head attention :

The original self-attention The operation is as follows :

and LMHSA Will first use the step size K Of KxK Depth convolution is reduced K And V The spatial dimension of ( Turn into n k 2 × d k \frac{n}{k^2}\times d_k k2n×dk), And the learnable bias term is added B:

③ IRFFN,inverted residual feed-forward network Reverse residual feedforward network

This module is similar to mobilenetv2 Medium inverted residual modular ( The channel dimension design with narrow ends and wide middle ), It just changes the position of residual connection and uses GELU As an activation function .

Last , The author designed several different network architectures , It's kind of like ResNet And Efficient The combination of :
experimental result
The experiment is only attached to ImageNet The result on , About ablation experiments and transfer The task is not mentioned here .

The figure above compares CNN and transformer Method , You can see CMT On the basis of performance improvement , It also keeps less parameter quantity and calculation consumption . Another point worth noting is , All the previous ones are based on transformer The performance of the method is not as good as efficientnet, and CMT Beyond efficientnet, And the calculation loss is lower .
review
CMT Produced by Huawei Noah Ark Laboratory , The article analyzes transformer Disadvantages of models in visual tasks :1. Loss of spatial local structure information ;2. The parameters are large 、 Large amount of computation ;3. Unable to extract fine-grained features 、 Multiscale features . And these aspects CNN Be good at it , So the author will CNN Some classic designs of the network are introduced into transformer in , And add convolution module , Improve efficiency and performance , such as :ResNet Residual structure of 、Mobilenet Medium depth-wise Convolution 、efficient Parameters in Scaling Strategy and so on .
Although the whole article does not have much academic innovation in methods , More engineering oriented network design , But this undoubtedly promoted transformer stay CV Progress in the field , After all, who would refuse a fast and good model .
Code
At present, the official code is not open source , but github There have been many excellent reproduction work , Stick a copy :
https://github.com/FlyEgle/CMT-pytorch
边栏推荐
- Share several coding code receiving verification code platforms, which will be updated in February 2022
- Application of time series database in cigarette factory
- 智能合约安全——溢出漏洞
- BLDC 6步换相 simulink
- Swing事件处理的过程是怎样的?
- Learn from Li Mu, deep learning - linear regression and basic optimization function
- Jestson nano Object detection
- Pointer learning of C language -- the consolidation of pointer knowledge and the relationship with functions, arrays and structures
- After several twists and turns, how long can the TSDB C-bit of influxdb last?
- MES生产管理系统对设备的应用价值
猜你喜欢

After several twists and turns, how long can the TSDB C-bit of influxdb last?

Adobe XD web design tutorial

Qt: one signal binds multiple slots

Tikz draw Gantt chart in FJSP -trans necessary

BLDC 6步换相 simulink

力扣 1331. 数组序号转换

ICLR21(classification) - 未来经典“ViT” 《AN IMAGE IS WORTH 16X16 WORDS》(含代码分析)

Application of time series database in cigarette factory

Web 3.0 development learning path
![[filter tracking] target tracking based on EKF, TDOA and frequency difference positioning with matlab code](/img/c7/e149e35a544b7a89bbd167c45637a4.png)
[filter tracking] target tracking based on EKF, TDOA and frequency difference positioning with matlab code
随机推荐
Streamlit machine learning application development tutorial
SQL custom automatic calculation
RFs self study notes (II): theoretical measurement model - without clutter but with detection probability
SaltStack之数据系统
身份证号的奥秘
Smart contract security - overflow vulnerability
After several twists and turns, how long can the TSDB C-bit of influxdb last?
关于ASM冗余问题
Method of win7 system anti ARP attack
GPIO port configuration of K60
用LEX(FLEX)生成PL语言的词法分析器
It is the best tool to evaluate six kinds of map visualization software in three categories
Jestson nano Object detection
BLDC 6-step commutation simulink
The difference between --save Dev and --save in NPM
CVPR21-无监督异常检测《CutPaste:Self-Supervised Learning for Anomaly Detection and Localization》
ICLR21(classification) - 未来经典“ViT” 《AN IMAGE IS WORTH 16X16 WORDS》(含代码分析)
Photoshop responsive web design tutorial
Learn from Li Mu in depth -softmax return
Application of TSDB in civil aircraft industry