当前位置:网站首页>[nlp] - brief introduction to the latest work of spark neural network
[nlp] - brief introduction to the latest work of spark neural network
2022-07-03 04:10:00 【Muasci】
Preface
ICLR 2019 best paper《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》 Put forward the lottery Hypothesis (lottery ticket hypothesis):“dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations.”
And the author is in [ Literature reading ] Sparsity in Deep Learning: Pruning and growth for efficient inference and training in NN also ( Ragged ground ) The work in this field is recorded .
This paper intends to further outline the latest work in this field . in addition , according to “when to sparsify”, This work can be divided into :Sparsify after training、Sparsify during training、Sparse training, The author pays more attention to the latter two ( the reason being that end2end Of ), So this article ( Probably ) We will pay more attention to the work of these two subcategories .
《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19
step :
- Initialize a fully connected neural network θ, And determine the cutting rate p
- Train a certain number of steps , obtain θ1
- from θ1 According to the order of magnitude of the parameter weight , Cut out p Small weight of order of magnitude , And reset the remaining weights to the original initialization weights
- Keep training
Code :
- tf:https://github.com/google-research/lottery-ticket-hypothesis
- pt:https://github.com/rahulvigneswaran/Lottery-Ticket-Hypothesis-in-Pytorch
《Rigging the Lottery: Making All Tickets Winners》ICML 20
step :
- Initialize the neural network , And cut in advance . Consider the way of pre cutting :
- uniform: The sparsity rate of each layer is the same ;
- Other ways : The more parameters in the layer , The higher the degree of sparsity , Make the remaining parameters of different layers generally balanced ;
- In the process of training , Every time ΔT Step , Update the distribution of sparse parameters . consider drop and grow Two update operations :
- drop: Cut out a certain proportion of the weight of small orders of magnitude
- grow: From the trimmed weights , Restore the weight with the same scale gradient order of magnitude
- drop\grow Proportion change strategy :
among ,α Is the initial update ratio , General set to 0.3.
characteristic :
- End to end
- Support grow
Code :
- tf:https://github.com/google-research/rigl
- pt:https://github.com/varun19299/rigl-reproducibility
《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21
Put forward In-Time Over-Parameterization indicators :
Think , Under the premise of reliable exploration , The higher the above indicators , That is, the model explores more parameters in the training process , The better the final performance .
So the training steps should be the same as 《Rigging the Lottery: Making All Tickets Winners》 similar , But the specific use Sparse Evolutionary Training (SET)——《A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science》, its grow Is random . This choice is because :
SET activates new weights in a random fashion which naturally considers all possible parameters to explore. It also helps to avoid the dense over-parameterization bias introduced by the gradient-based methods e.g., The Rigged Lottery (RigL) (Evci et al., 2020a) and Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019), as the latter utilize dense gradients in the backward pass to explore new weights
characteristic :
- Exploration
Code :https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022
practice :
CYCLIC GAP:
- Divide the model into k Share , Every one (?) All in accordance with r Proportional random sparseness .
- At the beginning , Make one of them dense
- continued k Step , The step interval is T individual epoch. Each step thins out the dense portion (magnitude-based), Turn the next one of that one into dense .
- k Step back , Sparse the remaining dense portion , And then fine tune .
PARALLEL GAP:k Share ,k node , Each copy is dense on different nodes .
characteristic :
- Expand the exploration space
- Did wmt14 de-en translation The experiment of
Code :https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization
《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022
practice :
characteristic :
- mask Can be trained , Instead of magnitude-based
- Task-Agnostic Mask Training( Next, let's take a look task-specific Of ?)
Code :https://github.com/llyx97/TAMT
《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020
Method :
- L0-close fine-tuning: Through pre experiment , Found some layers 、 Some modules , On downstream tasks finetune after , Its parameters are not much different from the original pre training parameters , Therefore, it is excluded from this kind of finetune In the process
- Sparsification as fine-tuning: For every one task Train one 0\1mask
characteristic :
- Training mask, In order to achieve finetune pretrained model The effect of
Code : nothing
But there are https://github.com/llyx97/TAMT There is also something about mask-training Code for , You can refer to ?
边栏推荐
- Nodejs Foundation: shallow chat URL and querystring module
- [untitled] 2022 safety production supervisor examination question bank and simulated safety production supervisor examination questions
- Social phobia of contemporary young people (II)
- Five elements of user experience
- 【刷题篇】接雨水(一维)
- JS实现图片懒加载
- What is the correct way to compare ntext columns with constant values- What's the right way to compare an NTEXT column with a constant value?
- 2022 tea master (primary) examination questions and tea master (primary) examination question bank
- Competitive product analysis and writing
- MySQL timestampdiff interval
猜你喜欢
The time has come for the domestic PC system to complete the closed loop and replace the American software and hardware system
"Designer universe" argument: Data Optimization in the design field is finally reflected in cost, safety and health | chinabrand.com org
Daily question - ugly number
Which code editor is easy to use? Code editing software recommendation
The 10th China Cloud Computing Conference · China Station: looking forward to the trend of science and technology in the next decade
【毕业季·进击的技术er】职场人的自白
Database management tool, querious direct download
nodejs基础:浅聊url和querystring模块
2022 Shandong Province safety officer C certificate examination questions and Shandong Province safety officer C certificate simulation examination question bank
拆一辆十万元的比亚迪“元”,快来看看里面的有哪些元器件。
随机推荐
[graduation season · aggressive technology Er] Confessions of workers
有监督预训练!文本生成又一探索!
Which code editor is easy to use? Code editing software recommendation
[untitled] 2022 safety production supervisor examination question bank and simulated safety production supervisor examination questions
Mutex and rwmutex in golang
The latest analysis of the main principals of hazardous chemical business units in 2022 and the simulated examination questions of the main principals of hazardous chemical business units
JS realizes the animation effect of text and pictures in the visual area
How to connect WiFi with raspberry pie
"Designer universe" argument: Data Optimization in the design field is finally reflected in cost, safety and health | chinabrand.com org
105. SAP UI5 Master-Detail 布局模式的联动效果实现明细介绍
深潜Kotlin协程(二十):构建 Flow
[Apple Photo Album push] IMessage group anchor local push
pytorch是什么?pytorch是一个软件吗?
xrandr修改分辨率與刷新率
[Blue Bridge Road -- bug free code] interpretation of some codes of matrix keyboard
因果AI,下一代可信AI的产业升级新范式?
Mila, University of Ottawa | molecular geometry pre training with Se (3) invariant denoising distance matching
Arlo's thinking about himself
When writing a web project, SmartUpload is used for file upload and new string () is used for transcoding, but in the database, there will still be random codes similar to poker
Feature_selection