当前位置：网站首页>[nlp] - brief introduction to the latest work of spark neural network

[nlp] - brief introduction to the latest work of spark neural network

2022-07-03 04:10:00 【Muasci】

Preface

ICLR 2019 best paper《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》 Put forward the lottery Hypothesis （lottery ticket hypothesis）：“dense, randomly-initialized, feed-forward networks contain subnetworks (winning tickets) that—when trained in isolationreach test accuracy comparable to the original network in a similar number of iterations.”

And the author is in [ Literature reading ] Sparsity in Deep Learning: Pruning and growth for efficient inference and training in NN also （ Ragged ground ） The work in this field is recorded .

This paper intends to further outline the latest work in this field . in addition , according to “when to sparsify”, This work can be divided into ：Sparsify after training、Sparsify during training、Sparse training, The author pays more attention to the latter two （ the reason being that end2end Of ）, So this article （ Probably ） We will pay more attention to the work of these two subcategories .

《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19

Insert picture description here

step ：

Initialize a fully connected neural network θ, And determine the cutting rate p
Train a certain number of steps , obtain θ1
from θ1 According to the order of magnitude of the parameter weight , Cut out p Small weight of order of magnitude , And reset the remaining weights to the original initialization weights
Keep training

Code ：

tf:https://github.com/google-research/lottery-ticket-hypothesis
pt:https://github.com/rahulvigneswaran/Lottery-Ticket-Hypothesis-in-Pytorch

《Rigging the Lottery: Making All Tickets Winners》ICML 20

Insert picture description here
step ：

Initialize the neural network , And cut in advance . Consider the way of pre cutting ：
- uniform： The sparsity rate of each layer is the same ;
- Other ways ： The more parameters in the layer , The higher the degree of sparsity , Make the remaining parameters of different layers generally balanced ;
In the process of training , Every time ΔT Step , Update the distribution of sparse parameters . consider drop and grow Two update operations ：
- drop： Cut out a certain proportion of the weight of small orders of magnitude
- grow： From the trimmed weights , Restore the weight with the same scale gradient order of magnitude
- drop\grow Proportion change strategy ：
  
  among ,α Is the initial update ratio , General set to 0.3.

characteristic ：

End to end
Support grow

Code ：

tf:https://github.com/google-research/rigl
pt:https://github.com/varun19299/rigl-reproducibility

《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21

Put forward In-Time Over-Parameterization indicators ：
Insert picture description here
Think , Under the premise of reliable exploration , The higher the above indicators , That is, the model explores more parameters in the training process , The better the final performance .
So the training steps should be the same as 《Rigging the Lottery: Making All Tickets Winners》 similar , But the specific use Sparse Evolutionary Training (SET）——《A. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science》, its grow Is random . This choice is because ：

SET activates new weights in a random fashion which naturally considers all possible parameters to explore. It also helps to avoid the dense over-parameterization bias introduced by the gradient-based methods e.g., The Rigged Lottery (RigL) (Evci et al., 2020a) and Sparse Networks from Scratch (SNFS) (Dettmers & Zettlemoyer, 2019), as the latter utilize dense gradients in the backward pass to explore new weights

characteristic ：

Exploration

Code ：https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization

《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022

practice ：

CYCLIC GAP：
- Divide the model into k Share , Every one （？） All in accordance with r Proportional random sparseness .
- At the beginning , Make one of them dense
- continued k Step , The step interval is T individual epoch. Each step thins out the dense portion （magnitude-based）, Turn the next one of that one into dense .
- k Step back , Sparse the remaining dense portion , And then fine tune .
PARALLEL GAP：k Share ,k node , Each copy is dense on different nodes .

characteristic ：

Expand the exploration space
Did wmt14 de-en translation The experiment of

Code ：https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization

《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022

practice ：
Insert picture description here

characteristic ：

mask Can be trained , Instead of magnitude-based
Task-Agnostic Mask Training（ Next, let's take a look task-specific Of ？）

Code ：https://github.com/llyx97/TAMT

《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020

Method ：

L0-close fine-tuning： Through pre experiment , Found some layers 、 Some modules , On downstream tasks finetune after , Its parameters are not much different from the original pre training parameters , Therefore, it is excluded from this kind of finetune In the process
Sparsification as fine-tuning： For every one task Train one 0\1mask

characteristic ：

Training mask, In order to achieve finetune pretrained model The effect of

Code ： nothing
But there are https://github.com/llyx97/TAMT There is also something about mask-training Code for , You can refer to ？

原网站

版权声明
本文为[Muasci]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207030405344772.html

当前位置：网站首页>[nlp] - brief introduction to the latest work of spark neural network

[nlp] - brief introduction to the latest work of spark neural network

Preface

《THE LOTTERY TICKET HYPOTHESIS: FINDING SPARSE, TRAINABLE NEURAL NETWORKS》ICLR 19

《Rigging the Lottery: Making All Tickets Winners》ICML 20

《Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training》ICML 21

《EFFECTIVE MODEL SPARSIFICATION BY SCHEDULED GROW-AND-PRUNE METHODS》ICLR 2022

《Learning to Win Lottery Tickets in BERT Transfer via Task-agnostic Mask Training》 NAACL 2022

《How fine can fine-tuning be? Learning efficient language models》AISTATS 2020

边栏推荐

猜你喜欢

随机推荐