当前位置:网站首页>How to get the 2 d space prior to ViT?UMA & Hong Kong institute of technology & ali SP - ViT, study for visual Transformer 2 d space prior knowledge!.
How to get the 2 d space prior to ViT?UMA & Hong Kong institute of technology & ali SP - ViT, study for visual Transformer 2 d space prior knowledge!.
2022-08-03 15:44:00 【I love computer vision】
关注公众号,发现CV技术之美
本篇分享论文『SP-ViT: Learning 2D Spatial Priors for Vision Transformers』,曼海姆大学&香港理工&阿里(Team hua)提出 SP-ViT,为视觉 Transformer 学习 2D A priori knowledge space!
详细信息如下:
论文地址:https://arxiv.org/abs/2206.07662
代码地址:未开源
01
摘要
最近,Transformer In the aspect of image classification shows great potential,并在 ImageNet The most advanced results is set up on benchmark.然而,与 CNN 相比,transformer 收敛速度较慢,And inductive bias due to lack of space,In the condition of low data easily excessive fitting.This kind of space inductive deviation can be especially beneficial,Because of the input image 2D 结构在Transformer Not well in the reserve.
在这项工作中,作者提出了Space prior to enhance the attention (SP-SA),This is a kind of for visualTransformerTailored common since the note (SA) 的新变体.空间先验(SPs)Is proposed in this paper a series of inductive bias,It highlights some spatial relations group.And forced to focus on the hard coded local convolution inductive bias is different,本文提出的 SP Is a study by the model itself,And consider all kinds of spatial relations.
具体来说,Attention points calculation is focused on each head of some types of spatial relations,And the space of these learning to focus can complement each other.基于 SP-SA,作者提出了 SP-ViT 系列,It is always better than other similar GFlops 或参数的 ViT 模型.本文最大的模型 SP-ViT-L 实现了 86.3% 的 Top-1 准确度,Compared with the most advanced model before,参数数量减少了近 50%.
02
Motivation
In the dominant natural language processing (NLP) 任务之后,TransformersRecently in image classification has made exciting achievements.所有TransformerSince the attention mechanism is the core of the so-called,It captures all global inputtokenThe content of the relationship between,And selective attention related to.与卷积相比,Since the attention more flexible,Convolution is by hard-coded capture local dependencies.
这可能为TransformerModels equipped with a bigger capacity and greater potential of computer vision tasks.As the recent work reports,当在大型数据集上进行预训练时,Transformer的性能优于卷积神经网络 (CNN),And through the training in advance CNN The distillation of knowledge or false labels to promote.
尽管如此,CNN The generalization ability and convergence speed are better than Vision Transformers (ViT) 更好.This suggests that the convolution is used in certain types of inductive bias can still be beneficial to solve visual task.因此,Many of the recent studies have put in a different way to convolution inductive bias into the ViT,And prove the performance boost.The effectiveness of the convolution relies on natural images of the adjacent pixels highly relevant facts,But in the local receptive field of convolution filter there may be other highly relevant content is ignored.
因此,The author put forward at the same time use a variety of inductive bias,Like humans do in our daily life,例如,If we see a part of the level of object,We naturally along its direction,Rather than limit our line of sight to local scope.
在这项工作中,作者Through called space prior to enhance the attention (SP-SA) The common since the note (SA) 的扩展,The space called a priori (SPs) A new type of inductive bias series introduced ViT.SP-SA 根据 key 和 query patch The relative position of a group in every attention first highlight 2D 空间关系.It helps in the context of this kind of spatial relation computing attention points.Due to the appropriate space prior to build and verify very hard,So the author introduced the concept of learning space prior.
具体地说,The author just impose the prior knowledge of the weak to model,The different relative distance should be treated differently.然而,The author is not mandatory model support any type of spatial relations in advance,例如,Is not a partial or not is a local.The effective space prior(SP)Should be determined by the model itself to be found in the stage of training.为此,SP Consists of a series of mathematical functions that,These functions will be relative to the coordinate mapping to abstract the score,Known as the space relation function.
In order to find the ideal spatial relations function,The authors of these functions are parameterized by neural network,并与 ViT To optimize them.因此,This model can study the similar to the convolution of the induced spatial priori,But it can also learn more distance on the spatial relations.
如上图b所示,The attention of different head present different complementary mode,To deal with different types of spatial relations.同时,A global receptive field is by considering all the heads of approximate.
事实上,Convolution inductive bias can be thought of as a special kind of space prior:First they coordinate spatial relations can be divided into two types of,Focus on local neighborhood and focus on the local area.Then they learn the local neighborhood priori and ignore the local relationship.为了比较,Will this kind of convolution bias and ViT The combination of some existing methods such as abovea所示.Based on local window,, respectively, by changing the aspect ratio or mobile center,为 CSWin-Transformer和 Swin-TransformerPuts forward a new variant.但是,The window design is intuitive,Focus on the main idea of the relationship between local hasn't changed.
总的来说,This paper make the following贡献:
The paper put forward a series of ViT 的归纳偏置,The offset focus on different types of spatial relations,Known as the space prior (SP).SP Summarize the local limited convolution bias to local and nonlocal correlation.通过神经网络参数化,SP In training during the automatic learning,No advanced on any hard coded regional preferences.
作者提出了 SP-SA,This is a novel since the note variant,Can automatically learn useful 2D Space inductive bias.基于本文提出的 SP-SA,The author further build a called SP-ViT 的 ViT 变体.SP-ViT 在没有额外数据的情况下在 ImageNet Benchmark 上实现了最先进的结果.
本文提出的 SP Compatible with various input size,Because they come from each pair ofpatchBetween the relative coordinates,而不是它们的绝对位置.When on the higher resolution fine-tune,SP-ViT Also show than image classificationbaseline模型更好的性能.
03
方法
3.1 Revisiting Multi-Head Self-Attention
Self-attention Receive input sequence and the output of the same length new sequence,Each element of this calculation as a weighted linear transformation of input elements and:
Each weight coefficient or attention scores are based on semantic dependency relationship between the two elements of the,通过将 softmax Zoom function is applied to the linear transformation elements dot product,In this article will be referred to as content score:
Long since attention in parallel using several such operations to study different types of interdependent relationship.The final output is from each head through theconcatThe output using linear transformation to get.
3.2 Spatial Prior-enhanced Self-Attention
Due to observe some types of spatial relationships inductive bias is likely toTransformer有益,The author put forward by combining learning 2D 空间先验 (SP) To strengthen the attention to the extension of the,Known as the space prior to enhance the attention (SP-SA).将SP-SA嵌入到ViT,Can be formed in this paperSP-ViT,其结构如上图所示.
每个 SPForm a spatial context used to calculate the attention points,It is derived from input elements on the coordinates of the spatial relationship between,即 ViT The key and the querypatch之间的相对位置.因此,SP Have the same attention points in the shape of,Simply by multiplication integrating it intoAttention计算中:
3.2.1 2D Spatial Prior
To query blocki为参考点,Can get the image block j 的相对坐标.And then for all queries and keypatchTo use a Shared map,Named spatial relations function:
Output together to form the so-called 2D SP 矩阵 Ω.
3.2.2 Learnable 2D Spatial Priors
In order to make the model can automatically learn the required inductive bias,The authors use multilayer perceptron(MLP)To the parameterized from 2D Relative coordinates to Ω 的映射:
因此,ΩCan learn query And the key attention were weighted,The weight only depends on their relative coordinates and application in a nonlinear manner,即在 softmax 之前.The author through to each head add a unique network will SP-SA Extended to the long version.This design follow the same motivation and long since attention,And assuming different SP The combination of should improve performance.
3.3 Relation to Other Methods
3.3.1 Relation to Local Windows
Before work used in the square and cross window in practice can be seen as a spatial relations function proposed in this paper a special kind of:
Among them respectively control shift、窗口宽度和高度.如果,Generate a square window,Otherwise generated cross window.Two jobs are adopted in the network only a few hard coding mode,Methods put forward in this paper from a variety of beneficial 2D Benefit structure.
3.3.2 Relation to PSA
Positional Self-Attention (PSA)Also can be regarded as a manual design of spatial relations functions:
其中和是可学习的参数,They are in accordance with clear rules initialization to approximate convolution effect.
Their main contribution is the so-called local/卷积初始化,Its head is the number of restricted to integer square,And the initial values of all need extra super parameter adjustment.In order to compare their method,The author adopts a 9 个头的 ViT baseline进行消融分析.
3.3.3 Relation to Relative Positional Embeddings
There are also some work before considering the relative position encoding:
Which is a nested table can learn,Get the relative position embedded.Then it with the query multiplication interaction.
与 1D The relative position embedded in different,The author proposed a function space in modeling the image 2D The ability to structure has more.如果扩展到 2D,Their method is equivalent to the linear transformation is applied to the relative distance one-hot 表示.对于 one-hot 表示,The size of the distance be ignored,While the relative coordinates, not so.
更重要的是,The relative position coding by embedding the relative position equation is added to the attention of keys and values.Since the attention change is fairly straightforward,But the lack of a clear physical interpretation.In contrast to the embedded location,In this paper, the method not only provides a neutral location information,But also learn the useful inductive bias and its injection model.
04
实验
上表展示了本文方法在ImageNet上和SOTA方法的对比,可以看出,The method has certain advantages in performance.
在上图中,The author used the recent Transformer Explainability A few images to show the target class activation graph visualization,以展示 SPViT 的行为.虽然 DeiT Model is only a small part of the display area of the target class class activated,例如“Australian parrots”的头部、“埃及猫”Fur or“美洲短吻鳄”的下巴,但提出的 SP- ViT Model shows a wider area of the target class in the class to activate.
如上图 a 所示,与 DeiT baseline相比,用 SP-SA 替换多个 SA 层可以提高 Top-1 精度.一般来说,As more layer is replaced,性能会提高.对于总共有 12 层的模型,当替换 10 Layer when the best performance.When using this article SP-SA Replace except the last one SA When all the layers,性能略有下降.
The author thinks this is probably because classificationtokenOnly the last layer,So in this case is not fully extract class specific features.The author further studied the aboveb With the more multilayer model(总共 16 层),And found a similar trend.When the first layer is replaced the second from bottom to,性能最佳.
Chart shows the elimination in ImageNet-100 After some layer insertion sorttoken的影响.可以看出,分类tokenLate into the indeed has positive influence on classification results.
The authors found that a simple 2 层 MLP Well as spatial relations function.在上表中,The authors study the effect of the hidden dimensions of performance.在 Top-1 The accuracy of,性能从 16 到 32 The hidden dimensions can improve the 0.4%,But hidden dimensions further increase did not improve the results.隐藏维度为 32 的结果是最好的.
The table above shows the comparative method and the other is based on the relative position of,可以看出,This method has obvious advantages.
In order to further validate this paper puts forward the importance of spatial relations function,在上表中,The author with the relative spatial information multiplication interaction of different ways were compared.The results show that strong nonlinear function for prior learning effective space is necessary.
In order to verify the combination of various SP 的好处,作者将本文的 SP-SA Compared with two variants of it:One for each layer using a SP,Another study for the entire network of the same SP.The results reported in the table above,With the study for each layer single SP 相比,The entire network to share SP 提供了更好的结果.然而,Each layer of the head is the only can learn SP The setting of the best,This proved that the combination of different SP 的好处.
05
总结
在本文中,作者提出了一种名为 Spatial Prior-enhanced Self-Attention (SP-SA) 的 Vanilla self-attention (SA) 变体,In order to promote with automatic learning space transcendental visionTransformer.基于SP-SA,作者进一步提出了SP-ViT,And through the experimental results show the validity of this method.
The authors put forward different sizes SP-ViT In only ImageNet-1K Training model is established on the most advanced results.例如,与之前最先进的 LV-ViT-M 相比,SP-ViT-M 的准确度提高了 0.8%.本文的 SP-SA Can stimulate more about visualTransformerResearch design and use of appropriate inductive bias.
最后,Space prior can be learned in the binarization and is used to design more efficientTransformer.尽管具有出色的性能,但视觉TransformerRely on a large amount of data to enhance technical,And compared with the convolution,Attention mechanism on the computing efficiency is low,Still need further study to improve ViT.
参考资料
[1]https://arxiv.org/abs/2206.07662
▊ 作者简介
研究领域:FightingCV公众号运营者,研究方向为多模态内容理解,专注于解决视觉模态和语言模态相结合的任务,促进Vision-Language模型的实地应用.
知乎/公众号:FightingCV
END
加入「Transformer」交流群备注:TFM
边栏推荐
猜你喜欢
2021年12月电子学会图形化四级编程题解析含答案:新冠疫苗接种系统
Daily practice------There are 10 numbers that are required to be output from large to small by selection method
ubiquant量化竞赛
Optimal Power Flow (OPF) for High Voltage Direct Current (HVDC) (Matlab code implementation)
深度学习GPU最全对比,到底谁才是性价比之王?
高压直流输电(HVDC)的最优潮流(OPF)(Matlab代码实现)
神经网络,凉了?
方舟开服教程win
js中的基础知识点 —— 事件
美国国防部更“青睐”光量子系统研究路线
随机推荐
实习路途:记录给我的第一个实习项目中的困惑
【周报】2022年7月24日
身为售后工程师的我还是觉得软件测试香,转行成功定薪11.5K,特来分享下经验。
一个在浏览器中看到的透视Cell实现
ECCV 2022 | 基于关系查询的时序动作检测方法
AWS中国区SDN Connector
聊聊这个SaaS领域爆火的话题
【899. Ordered Queue】
请问下阿里云全托管flink能执行两条flink sql命令么?
JS手写call apply bind (详细)(面试)
Awesome!Coroutines are finally here!Thread is about to be in the past
上亿数据怎么玩深度分页?兼容MySQL + ES + MongoDB
PWA 应用 Service Worker 缓存的一些可选策略和使用场景
MATLAB gcf figure save image with black background/transparent background
扩展欧几里得求逆元实例
Reptile attention
The general trend, another key industry related to Sino-US competition, has reached a critical moment
问题8:对朋友圈进行用例设计
Deep Learning - Install CUDA and CUDNN to implement GPU operation of tensorflow
unity用代码生成LightProbeGroup