当前位置:网站首页>【ARXIV2205】Inception Transformer
【ARXIV2205】Inception Transformer
2022-07-28 05:01:00 【AI frontier theory group @ouc】

【ARXIV2205】Inception Transformer
The paper :https://arxiv.org/abs/2205.12956
Code :https://github.com/sail-sg/iFormer
1、 Research motivation
The core idea of this paper is still : hold attention and CNN Combination ( Google's Inception), But the starting point is different . The author gives VIT Of Fourier spectrum , It is found that the energy is basically concentrated in the low-frequency part , For the edges in the image 、 Insufficient extraction of high-frequency information such as corners .( It's easy to understand ,attention Essentially, it is similar to the global weighted sum , There is a loss of local information )

2、 Inception mixer
The main contribution of this paper is to improve attention , A new module is proposed : Inception mixer. The author's idea is very direct , As shown in the figure below , In the existing VIT Add high-frequency branches to the structure !

(1) High frequency branch . It can be seen that , The high-frequency branch comes from the classic Inception( As shown in the figure below ), Among them linear The essence of layer is 1x1 Convolution .

The characteristics of the input are C C C Channels , Divide it into C h C_h Ch and C l C_l Cl Channels , It is used to extract high-frequency and low-frequency features respectively . For high frequency characteristics , Evenly divided into two parts X h 1 X_{h1} Xh1 and X h 2 X_{h2} Xh2( Are all C h / 2 C_h/2 Ch/2 Channels ), Do the following :
Y h 1 = FC ( MaxPool ( X h 1 ) ) Y_{h1}=\text{FC}(\text{MaxPool}(X_{h1})) Yh1=FC(MaxPool(Xh1))
Y h 2 = DwConv ( FC ( X h 2 ) ) Y_{h2}=\text{DwConv}(\text{FC}(X_{h2})) Yh2=DwConv(FC(Xh2))
(2) Low frequency branch . The low frequency branch is traditional MHSA, Because other branches bring extra computation , So this branch goes first average pooling operation , Then enter MHSA Calculate later upsample operation .
Last , The results of high frequency and low frequency are directly spliced together : Y c = Concat ( Y l , Y h 1 , Y h 2 ) Y_c=\text{Concat}(Y_l, Y_{h1}, Y_{h2}) Yc=Concat(Yl,Yh1,Yh2).
Last , Because the direct interpolation in the low-frequency upsampling operation , Cause adjacency token Too smooth and similar , To solve this problem , The author adds one DwConv, Specific for : Y = FC ( Y c + DwConv ( Y c ) ) Y=\text{FC}(Y_c+\text{DwConv}(Y_c)) Y=FC(Yc+DwConv(Yc))
3、 Overall framework

The author adopts the current mainstream 4 Stage transformer framework , To build the small, base, large Three models , The details are shown in the table below . As can be seen from the table , In the shallow stage of the network , high frequency (conv) Account for a large proportion , Low frequency (MHSA) Account for a small proportion . In the deep stage of network , It's the other way around . This is convoluted with the current mainstream Transformer The conclusion of the combination method is basically the same . meanwhile , In the conclusion , The author also acknowledges , The ratio between high frequency and low frequency needs to be determined according to experience , It's this method limitation.

This method has achieved very good performance in image classification tasks . The author also did target detection 、 Experiments on semantic segmentation , For details, please refer to the author's paper .

边栏推荐
- How to simulate common web application operations when using testcafe
- 多御安全浏览器将改进安全模式,让用户浏览更安全
- Observable time series data downsampling practice in Prometheus
- Redux basic syntax
- POJ 3417 network (lca+ differential on tree)
- Look at the experience of n-year software testing summarized by people who came over the test
- [daily one] visual studio2015 installation in ancient times
- C语言ATM自动取款机系统项目的设计与开发
- Take out system file upload
- Read the paper -- a CNN RNN framework for clip yield prediction
猜你喜欢

动态sql和分页
![(manual) [sqli labs27, 27a] error echo, Boolean blind injection, filtered injection](/img/72/d3e46a820796a48b458cd2d0a18f8f.png)
(manual) [sqli labs27, 27a] error echo, Boolean blind injection, filtered injection
![[daily one] visual studio2015 installation in ancient times](/img/b1/066ed0b9e93b8f378c89ee974163e5.png)
[daily one] visual studio2015 installation in ancient times

Interview fraud: there are companies that make money from interviews

The go zero singleton service uses generics to simplify the registration of handler routes

Melt cloud x chat, create a "stress free social" habitat with sound

Win10 machine learning environment construction pycharm, anaconda, pytorch

After a year of unemployment, I learned to do cross-border e-commerce and earned 520000. Only then did I know that going to work really delayed making money!

App test process and test points

FreeRTOS startup process, coding style and debugging method
随机推荐
[Oracle] 083 wrong question set
Have you learned the common SQL interview questions on the short video platform?
05.01 string
Clickhouse填坑记2:Join条件不支持大于、小于等非等式判断
UI automation test farewell from now on, manual download browser driver, recommended collection
Redis type
Special topic of APP performance design and Optimization - poor implementation affecting performance
提升学生群体中的STEAM教育核心素养
数据安全逐步落地,必须紧盯泄露源头
HDU 3666 the matrix problemdifferential constraint + stack optimization SPFA negative ring
list indices must be integers or slices, not tuple
Anaconda common instructions
机器人教育在STEM课程中的设计研究
go-zero单体服务使用泛型简化注册Handler路由
FPGA:使用PWM波控制LED亮度
【CVPR2022】Lite Vision Transformer with Enhanced Self-Attention
吉利AI面试题【杭州多测师】【杭州多测师_王sir】
Test report don't step on the pit
猿辅导技术进化论:助力教与学 构想未来学校
HDU 1530 maximum clique