当前位置:网站首页>Vit slimming -- joint structure search and patch selection
Vit slimming -- joint structure search and patch selection
2022-06-09 00:18:00 【Law-Yao】
Paper Address :https://arxiv.org/abs/2201.00814
GitHub link :https://github.com/Arnav0400/ViT-Slim
Methods

ViT Slimming Through structure search and Patch selection The combination of , On the one hand, it realizes multi-dimensional 、 Multiscale structure compression , On the other hand, it's reduced Patch or Token Length redundancy of , Thus, the amount of parameters and calculation can be effectively reduced . To be specific , by ViT Flowing in a structure Tensor The corresponding Soft mask, Multiply the two in the calculation , And in Loss function Introduction in Soft mask Of L1 Regular constraints :
![]()
among
A series of Mask A collection of vectors ,
Corresponding to the intermediate tensor
.
- Structure search : First, in the MHSA in Introduce differentiable Soft mask, stay Attention head Dimension implementation Feature size Of L1 Sparsity ( It can be constructed similarly Head number Thinning of ):

among
It means the first one l Layer of the first h individual Head Of Soft mask. Secondly, in FFN in Introduce differentiable Soft mask, stay FFN Dimension implementation Intermediate size Of L1 Sparsity :
![]()
among
It means corresponding Soft mask.
- Patch selection: For each Transformer layer Input or output of Tensor, It's all defined Soft mask To eliminate low importance Patches, And Mask value Go first Tanh To prevent numerical expansion . in addition , The shallow layer is eliminated Patch, It also needs to be eliminated in the deep layer , To avoid calculation exceptions .
class SparseAttention(Attention):
def __init__(self, attn_module, head_search=False, uniform_search=False):
super().__init__(attn_module.qkv.in_features, attn_module.num_heads, True, attn_module.scale, attn_module.attn_drop.p, attn_module.proj_drop.p)
self.is_searched = False
self.num_gates = attn_module.qkv.in_features // self.num_heads
if head_search:
self.zeta = nn.Parameter(torch.ones(1, 1, self.num_heads, 1, 1))
elif uniform_search:
self.zeta = nn.Parameter(torch.ones(1, 1, 1, 1, self.num_gates))
else:
self.zeta = nn.Parameter(torch.ones(1, 1, self.num_heads, 1, self.num_gates))
self.searched_zeta = torch.ones_like(self.zeta)
self.patch_zeta = nn.Parameter(torch.ones(1, self.num_patches, 1)*3)
self.searched_patch_zeta = torch.ones_like(self.patch_zeta)
self.patch_activation = nn.Tanh()
def forward(self, x):
z_patch = self.searched_patch_zeta if self.is_searched else self.patch_activation(self.patch_zeta)
x *= z_patch
B, N, C = x.shape
z = self.searched_zeta if self.is_searched else self.zeta
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4) # 3, B, H, N, d(C/H)
qkv *= z
q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple) # B, H, N, d
attn = (q @ k.transpose(-2, -1)) * self.scale
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
def compress(self, threshold_attn):
self.is_searched = True
self.searched_zeta = (self.zeta>=threshold_attn).float()
self.zeta.requires_grad = False
def compress_patch(self, threshold_patch=None, zetas=None):
self.is_searched = True
zetas = torch.from_numpy(zetas).reshape_as(self.patch_zeta)
self.searched_patch_zeta = (zetas).float().to(self.zeta.device)
self.patch_zeta.requires_grad = FalseDifferentiable Soft mask And the training of network weights is carried out jointly , And the network weights are initialized with pre training parameters , Therefore, the overall time cost of search training is relatively low . After the search workout , Press Soft mask Sort the size of the values , Eliminate unimportant network weights or Patches, So as to realize structure search and Patch selection. After extracting a specific reduced structure , Need extra Re-training Restore model accuracy .

experimental result


of Transformer Model compression and optimization acceleration More discussion of , Refer to the following article :
边栏推荐
- figlet:ASCII 艺术字生成器
- When the mouse is moved, the left and right buttons are displayed
- WordCloud-快速安装与应用
- 分形递归输出
- Glamour chemistry glamour
- Instrumental variables and two-stage least squares Stata
- 直播预告|FeatureStore Meetup V3 重磅来袭!
- 良好的编程风格(习惯)
- Entities, protocols, services, and service access points
- Graphic reading of precious metal silver spot trend
猜你喜欢

Graphic reading of precious metal silver spot trend

CF1505C Fibonacci Words

MySQL-mysql索引详解

0.96OLED 4针IIC STM32-HAL库版本(附源码)

leetcode开篇打卡:小黑吃回文

Outsourcing student management system architecture document

Reading notes - Reflections on the greatest psychological experiment of the 20th century 1

Shutter row & column assembly
![[问题已处理]-harbor可以login成功pull成功但是push报错](/img/43/24a1607be3b5bdbcd71b7744c032d2.png)
[问题已处理]-harbor可以login成功pull成功但是push报错

外包学生管理系统架构文档(架构实战营 模块三作业)
随机推荐
Figlet:ascii WordArt generator
Implementation of message queue system and underlying source code based on redis
云原生技术---kubeadm使用外置etcd集群
Xiao Hei licks the torch text
读书笔记-《20世纪最伟大的心理学实验》读后感1
实体、协议、服务和服务访问点
Output full spread
Eight positioning modes of selenium by
js设为首页
连接池关键源码理解
enumeration
Examination process of subject 2
用BWA进行序列比对
同花顺开户安全吗?
穷举问题-搬砖
How long will it take for financial products to be cleared?
User 10 must be unlocked for widgets to be available
消息队列实战之基于 Redis 实现 消息队列系统及底层源码探究
JZ77:按之字形顺序打印二叉树
Redis持久化