当前位置：网站首页>Deep learning pay attention to MLPs

Deep learning pay attention to MLPs

2022-07-28 06:10:00 【Food to doubt life】

List of articles

Preface
Gmlp

Preface

I joined a big factory not long ago , I miss the carefree school days . I received my first optimization task soon after I joined the company , The high uncertainty of the algorithm post does make people a little anxious . From the current body feeling , The performance of existing deep learning models depends heavily on data quality , On the premise of sufficient data quality , There are a series of operations on the model .

This article will summarize the class ViT Network structure Gmlp, The paper is entitled Pay Attention to MLPs

This article is a personal summary , If there is a mistake , Welcome to point out . This article defaults that readers have ViT Related knowledge

Gmlp

Input and output

Gmlp It's a class ViT Structure , Its input is still several image blocks （ That is, an image is cut into several image blocks ）, Output as several vectors （token） A matrix of stacks , for example token The dimensions are $L$ , The number is $N$ , The output of $N * L$ Matrix . Through pooling and other operations, it is converted into the final eigenvector .

structure

contrast ViT,Gmlp To cancel the Position Embedding as well as Self Attention, It is formed by stacking several basic constituent units , Basic building blocks （unit） The structure of is shown in the figure below
Insert picture description here
Set the input matrix （ That is... In the picture Input Embeddedings） by $n * d$ Matrix $X$ , be Gmlp Of unit The structure can be simplified as （ omitted Norm Wait for the operation ）：
$\begin{aligned} Z&=\delta(XU)\\ \hat Z&=s(Z)\\ Y&=\hat Z V+X \end{aligned}$

$U 、 V$ For a learnable matrix （ Namely FC layer , Make your own dimensions ）, $\delta$ Is the activation function , $s (z)$ This is... In the picture Spatial Gating Unit, Its structure can be expressed as

$\begin{aligned} f_{W,b}(Z)&=WZ+b\\ s(Z)&=Z\odot f_{W,b}(Z) \end{aligned}$
set up $Z$ by $n * d$ Matrix , be $W$ by $n * n$ Matrix ( Does not change the dimension of the input matrix ), $b$ by $n$ Dimension vector （ $W Z + b$ Express $W Z$ And b Add the first dimensional elements of ）, In order to ensure the stability of training , $W$ The initialization value is close to 0（ Seems to use [-1,1] Uniform distribution initialization ）, $b$ The initial value of 1, here $f_{W,b}(Z)\approx1$ , here Spatial Gating Unit amount to Identity Mapping.

Further, the author found that $Z$ Along channel The dimension is cut into $Z_1、Z_2$ （ $Z_1、Z_2$ The dimensions are $n*d_1$ , $n*d_2$ , $d_1+d_2=n$ ） Two parts are more effective , here $s (Z)$ The operation changes to
$s(Z)=Z_1\odot f_{W,b}(Z_2)$

Spatial Gating Unit The output of is passing through the matrix $V$ Adjust the dimension and unit Add the input of .

Personal understanding

Here are some important points I think , The first is disengagement Gmlp In the general direction of , After reading ViT After the structure of , I always feel Self Attention Not the main source of its performance ,gMLP、MLP-Mixer, And the following summary MetaFormer Have verified this . further , These jobs all have one thing in common , Namely class ViT Structural models will be used first FC Layer mixing single token Information in , Then use a specific operation to mix multiple token Information about （MLP-Mixer adopt token mixer To mix multiple token Information about ,MetaFormer Through pooling ,gMLP adopt Spatial Gating Unit）, These two operations can be seen in pairs CNN Disassembly of convolution operation in , stay CNN in , After multiple characteristic images are input into convolution kernel , Convolution kernel will fuse the spaces in multiple characteristic graphs （Spatial） And channel （channel） Information . While in class ViT In structure , If the token Look at it as a feature diagram , Then it should be used first FC Layer blending single feature graph （token） Spatial information of , Then combine multiple feature maps （token） Information about （ Channel information ）. Besides , class ViT Matrix operation in structure can be equivalent to convolution operation （ But the resolution of convolution kernel is equal to the size of characteristic image ）. Combine the above analysis , class ViT The structure is more like CNN Variants of the model , We seem to have found a more effective CNN structure .

go back to Gmlp On ,Spatial Gating Unit Formally, I want to Attention, But the author's original intention of introducing it is not that it is important to choose which kind of features , Instead, mix multiple token Information about , However, this way of mixing is strange , There should be more explanatory ways .

原网站

版权声明
本文为[Food to doubt life]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280518198681.html