当前位置:网站首页>[NLP]—sparse neural network最新工作简述
[NLP]—sparse neural network最新工作简述
2022-07-03 04:06:00 【Muasci】
前言
ICLR 2019 best paper《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》提出了彩票假设(lottery ticket hypothesis):“dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations.”
而笔者在[文献阅读] Sparsity in Deep Learning: Pruning and growth for efficient inference and training in NN也(稀烂地)记录了这方面的工作。
本文打算进一步简述这方面最新的工作。另外,按照“when to sparsify”,这方面工作可被分为:Sparsify after training、Sparsify during training、Sparse training,而笔者更为关注后两者(因为是end2end的),所以本文(可能)会更加关注这两个子类别的工作。
《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19
步骤:
- 初始化完全连接的神经网络θ,并确定裁剪率p
- 训练一定步数,得到θ1
- 从θ1中根据参数权重的数量级大小,裁剪掉p的数量级小的权重,并将剩下的权重重置成原来的初始化权重
- 继续训练
代码:
- tf:https://github.com/google-research/lottery-ticket-hypothesis
- pt:https://github.com/rahulvigneswaran/Lottery-Ticket-Hypothesis-in-Pytorch
《Rigging the Lottery: Making All Tickets Winners》ICML 20
步骤:
- 初始化神经网络,并预先进行裁剪。预先裁剪的方式考虑:
- uniform:每一层的稀疏率相同;
- 其它方式:层中参数越多,稀疏程度越高,使得不同层剩下的参数量总体均衡;
- 在训练过程中,每ΔT步,更新稀疏参数的分布。考虑drop和grow两种更新操作:
- drop:裁剪掉一定比例的数量级小的权重
- grow:从被裁剪的权重中,恢复相同比例梯度数量级大的权重
- drop\grow比例的变化策略:
其中,α是初始的更新比例,一般设为0.3。
特点:
- 端到端
- 支持grow
代码:
- tf:https://github.com/google-research/rigl
- pt:https://github.com/varun19299/rigl-reproducibility
《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21
提出了In-Time Over-Parameterization指标:
认为,在可靠的探索的前提下,上述指标越高,也就是模型在训练过程中探索了更多的参数,最终的性能越好。
所以训练步骤应该和《Rigging the Lottery: Making All Tickets Winners》类似,不过具体使用的Sparse Evolutionary Training (SET)——《A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science》,它的grow是随机的。这样选择是因为:
SET activates new weights in a random fashion which naturally considers all possible parameters to explore. It also helps to avoid the dense over-parameterization bias introduced by the gradient-based methods e.g., The Rigged Lottery (RigL) (Evci et al., 2020a) and Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019), as the latter utilize dense gradients in the backward pass to explore new weights
特点:
- 探索度
代码:https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022
做法:
CYCLIC GAP:
- 把模型分成k份,每一份(?)都按照r的比例随机稀疏化。
- 初始时,把其中一份变为稠密
- 持续k步,步间隔为T个epoch。每一步将稠密的那一份稀疏化(magnitude-based),把那一份的下一份变为稠密。
- k步以后,将剩下的那一份稠密的稀疏化,然后微调。
PARALLEL GAP:k份,k节点,每一份在不同节点上都是稠密的。
特点:
- 扩大了探索空间
- 做了wmt14 de-en translation的实验
代码:https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022
做法:
特点:
- mask可训练,而非magnitude-based
- Task-Agnostic Mask Training(接下来不妨看看task-specific的?)
代码:https://github.com/llyx97/TAMT
《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020
方法:
- L0-close fine-tuning:通过预实验,发现某些层、某些模块,在下游任务上finetune之后,其参数和原先的预训练参数差别不大,于是将其排除在本文提出的这种finetune过程中
- Sparsification as fine-tuning:为每一个task训练一个0\1mask
特点:
- 训练mask,来达到finetune pretrained model的效果
代码:无
但是有https://github.com/llyx97/TAMT这里面也有关于mask-training的代码,可以参考?
边栏推荐
- Sklearn data preprocessing
- 【刷题篇】接雨水(一维)
- [Apple Photo Album push] IMessage group anchor local push
- Introduction to eth
- pytorch项目怎么跑?
- JS实现图片懒加载
- 树莓派如何连接WiFi
- Taking two column waterfall flow as an example, how should we build an array of each column
- [graduation season · aggressive technology Er] Confessions of workers
- TCP, the heavyweight guest in tcp/ip model -- Kuige of Shangwen network
猜你喜欢
pytorch怎么下载?pytorch在哪里下载?
Makefile demo
Web会话管理安全问题
For instruction, uploading pictures and display effect optimization of simple wechat applet development
How does the pytorch project run?
Web session management security issues
IPv6 transition technology-6to4 manual tunnel configuration experiment -- Kuige of Shangwen network
2022deepbrainchain biweekly report no. 104 (01.16-02.15)
『期末复习』16/32位微处理器(8086)基本寄存器
用户体验五要素
随机推荐
Interface embedded in golang struct
[Apple Photo Album push] IMessage group anchor local push
Is pytorch difficult to learn? How to learn pytorch well?
CEPH Shangwen network xUP Nange that releases the power of data
2022-07-02:以下go语言代码输出什么?A:编译错误;B:Panic;C:NaN。 package main import “fmt“ func main() { var a =
Idea shortcut keys
Write it down once Net travel management background CPU Explosion Analysis
MySQL create table
Taking two column waterfall flow as an example, how should we build an array of each column
Intercept string fixed length to array
QSAR model establishment script based on pytoch and rdkit
Sklearn data preprocessing
[brush questions] connected with rainwater (one dimension)
[mathematical logic] predicate logic (judge whether the first-order predicate logic formula is true or false | explain | example | predicate logic formula type | forever true | forever false | satisfi
JS realizes lazy loading of pictures
[learning notes] seckill - seckill project - (11) project summary
在写web项目的时候,文件上传用到了smartupload,用了new string()进行转码,但是在数据库中,还是会出现类似扑克的乱码
【刷题篇】 找出第 K 小的数对距离
[mathematical logic] predicate logic (predicate logic basic equivalent | eliminate quantifier equivalent | quantifier negative equivalent | quantifier scope contraction expansion equivalent | quantifi
[home push IMessage] software installation virtual host rental tothebuddy delay