当前位置:网站首页>Deep learning pay attention to MLPs
Deep learning pay attention to MLPs
2022-07-28 06:10:00 【Food to doubt life】
List of articles
Preface
I joined a big factory not long ago , I miss the carefree school days . I received my first optimization task soon after I joined the company , The high uncertainty of the algorithm post does make people a little anxious . From the current body feeling , The performance of existing deep learning models depends heavily on data quality , On the premise of sufficient data quality , There are a series of operations on the model .
This article will summarize the class ViT Network structure Gmlp, The paper is entitled Pay Attention to MLPs
This article is a personal summary , If there is a mistake , Welcome to point out . This article defaults that readers have ViT Related knowledge
Gmlp
Input and output
Gmlp It's a class ViT Structure , Its input is still several image blocks ( That is, an image is cut into several image blocks ), Output as several vectors (token) A matrix of stacks , for example token The dimensions are L L L, The number is N N N, The output of N ∗ L N*L N∗L Matrix . Through pooling and other operations, it is converted into the final eigenvector .
structure
contrast ViT,Gmlp To cancel the Position Embedding as well as Self Attention, It is formed by stacking several basic constituent units , Basic building blocks (unit) The structure of is shown in the figure below 
Set the input matrix ( That is... In the picture Input Embeddedings) by n ∗ d n*d n∗d Matrix X X X, be Gmlp Of unit The structure can be simplified as ( omitted Norm Wait for the operation ):
Z = δ ( X U ) Z ^ = s ( Z ) Y = Z ^ V + X \begin{aligned} Z&=\delta(XU)\\ \hat Z&=s(Z)\\ Y&=\hat Z V+X \end{aligned} ZZ^Y=δ(XU)=s(Z)=Z^V+X
U 、 V U、V U、V For a learnable matrix ( Namely FC layer , Make your own dimensions ), δ \delta δ Is the activation function , s ( z ) s(z) s(z) This is... In the picture Spatial Gating Unit, Its structure can be expressed as
f W , b ( Z ) = W Z + b s ( Z ) = Z ⊙ f W , b ( Z ) \begin{aligned} f_{W,b}(Z)&=WZ+b\\ s(Z)&=Z\odot f_{W,b}(Z) \end{aligned} fW,b(Z)s(Z)=WZ+b=Z⊙fW,b(Z)
set up Z Z Z by n ∗ d n*d n∗d Matrix , be W W W by n ∗ n n*n n∗n Matrix ( Does not change the dimension of the input matrix ), b b b by n n n Dimension vector ( W Z + b WZ+b WZ+b Express W Z WZ WZ And b Add the first dimensional elements of ), In order to ensure the stability of training , W W W The initialization value is close to 0( Seems to use [-1,1] Uniform distribution initialization ), b b b The initial value of 1, here f W , b ( Z ) ≈ 1 f_{W,b}(Z)\approx1 fW,b(Z)≈1, here Spatial Gating Unit amount to Identity Mapping.
Further, the author found that Z Z Z Along channel The dimension is cut into Z 1 、 Z 2 Z_1、Z_2 Z1、Z2( Z 1 、 Z 2 Z_1、Z_2 Z1、Z2 The dimensions are n ∗ d 1 n*d_1 n∗d1, n ∗ d 2 n*d_2 n∗d2, d 1 + d 2 = n d_1+d_2=n d1+d2=n) Two parts are more effective , here s ( Z ) s(Z) s(Z) The operation changes to
s ( Z ) = Z 1 ⊙ f W , b ( Z 2 ) s(Z)=Z_1\odot f_{W,b}(Z_2) s(Z)=Z1⊙fW,b(Z2)
Spatial Gating Unit The output of is passing through the matrix V V V Adjust the dimension and unit Add the input of .
Personal understanding
Here are some important points I think , The first is disengagement Gmlp In the general direction of , After reading ViT After the structure of , I always feel Self Attention Not the main source of its performance ,gMLP、MLP-Mixer, And the following summary MetaFormer Have verified this . further , These jobs all have one thing in common , Namely class ViT Structural models will be used first FC Layer mixing single token Information in , Then use a specific operation to mix multiple token Information about (MLP-Mixer adopt token mixer To mix multiple token Information about ,MetaFormer Through pooling ,gMLP adopt Spatial Gating Unit), These two operations can be seen in pairs CNN Disassembly of convolution operation in , stay CNN in , After multiple characteristic images are input into convolution kernel , Convolution kernel will fuse the spaces in multiple characteristic graphs (Spatial) And channel (channel) Information . While in class ViT In structure , If the token Look at it as a feature diagram , Then it should be used first FC Layer blending single feature graph (token) Spatial information of , Then combine multiple feature maps (token) Information about ( Channel information ). Besides , class ViT Matrix operation in structure can be equivalent to convolution operation ( But the resolution of convolution kernel is equal to the size of characteristic image ). Combine the above analysis , class ViT The structure is more like CNN Variants of the model , We seem to have found a more effective CNN structure .
go back to Gmlp On ,Spatial Gating Unit Formally, I want to Attention, But the author's original intention of introducing it is not that it is important to choose which kind of features , Instead, mix multiple token Information about , However, this way of mixing is strange , There should be more explanatory ways .
边栏推荐
- What are the points for attention in the development and design of high-end atmospheric applets?
- Regular verification rules of wechat applet mobile number
- 将项目部署到GPU上,并且运行
- 面试官:让你设计一套图片加载框架,你会怎么设计?
- NLP中常用的utils
- Briefly understand MVC and three-tier architecture
- Marsnft: how do individuals distribute digital collections?
- 【五】redis主从同步与Redis Sentinel(哨兵)
- 深度学习(自监督:CPC v2)——Data-Efficient Image Recognition with Contrastive Predictive Coding
- No module named yum
猜你喜欢
随机推荐
Create a virtual environment using pycharm
tensorboard可视化
tf.keras搭建神经网络功能扩展
卷积神经网络
UNL-类图
Flink CDC (MySQL as an example)
matplotlib数据可视化
How much does small program development cost? Analysis of two development methods!
Automatic scheduled backup of remote MySQL scripts
循环神经网络
深度学习——Pay Attention to MLPs
强化学习——Proximal Policy Optimization Algorithms
速查表之各种编程语言小数|时间|base64等操作
Uniapp WebView listens to the callback after the page is loaded
The project does not report an error, operates normally, and cannot request services
Xshell suddenly failed to connect to the virtual machine
【一】redis简介
【5】 Redis master-slave synchronization and redis sentinel (sentinel)
Matplotlib data visualization
分布式集群架构场景优化解决方案:分布式调度问题









