当前位置：网站首页>[summary of pytoch optimizer]

[summary of pytoch optimizer]

2022-06-10 03:41:00 【Network sky (LUOC)】

List of articles

1、SGD( Stochastic gradient descent ）
2、ASGD（ Random average gradient drop ）
3、AdaGrad Algorithm
4、AdaDelta Algorithm
5、Rprop（ Elastic back propagation ）
6、RMSProp（Root Mean Square Prop, Root mean square transfer ）
7、Adam(AMSGrad)
8、Adamax
9、Nadam
10、SparseAdam
11、AdamW
12、L-BFGS
13、Radam

pytorch Several types of optimizers for
Insert picture description here

1.https://pytorch.org/docs/stable/optim.html
2.https://ptorch.com/docs/1/optim
3.https://www.cntofu.com/book/169/docs/1.0/optim.md

import torch
import torch.nn as nn
from torch.utils.data import DataLoader

SGD ： Use momentum (Momentum) You can try , Use Newtonian acceleration (NAG) You might as well not try
ASGD ： see little of
Adagrad ： Not recommended
Adadelta ： You can try
Rprop ： Not recommended
RMSprop ： recommend
Adam ： Very recommended
Adamax： Very recommended
Nadam： Very recommended
SparseAdam： recommend
AdamW： Very recommended , Wait and see
L-BFGS： Very recommended , Optional
Radam： Very recommended

1、SGD( Stochastic gradient descent ）

torch.optim.SGD(params,lr=<required parameter>,momentum=0,dampening=0,weight_decay=0,nesterov=False)
 Parameters ：
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float) –  Learning rate 
--momentum (float,  Optional ) –  Momentum factor （ Default ：0, Usually set to 0.9,0.8）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default ：0）
--dampening (float,  Optional ) –  The suppressor of momentum （ Default ：0）
--nesterov (bool,  Optional ) –  Use Nesterov momentum （ Default ：False）

advantage ：① Use mini-batch When , Can converge quickly
shortcoming ：① Noise will be introduced while randomly selecting the gradient , So that the direction of weight update is not necessarily correct ;② Can not solve the problem of local optimal solution

a、 Use momentum (Momentum) The random gradient descent method (SGD)：
The usage is in torch.optim.SGD Of momentum Parameter is not zero .
advantage ： Speed up convergence , Have the ability to get rid of local optima , To some extent, it alleviates the problem when there is no momentum ; shortcoming ： When updating, keep the direction of the previous update to a certain extent , Still inherited a part SGD The shortcomings of .
b、 Use Newtonian acceleration （NAG, Nesterov accelerated gradient） The random gradient descent method （SGD）：
It is understood that a correction factor is added to the standard momentum .
advantage ： The descending direction of the gradient is more accurate ; shortcoming ： The effect on the convergence rate is not great .

2、ASGD（ Random average gradient drop ）

torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：1e-2）
--lambd (float,  Optional ) –  Attenuation term （ Default ：1e-4）
--alpha (float,  Optional ) – eta Updated index （ Default ：0.75）
--t0 (float,  Optional ) –  Indicate when to start averaging （ Default ：1e6）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

a、 Use momentum (Momentum) The random gradient descent method (SGD);
b、 Use Newtonian acceleration （NAG, Nesterov accelerated gradient） The random gradient descent method （SGD）.
The advantages and disadvantages are similar to the above .

3、AdaGrad Algorithm

torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
 Parameters ：
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default : 1e-2）
--lr_decay (float,  Optional ) –  Learning rate decline （ Default : 0）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

Learning rate independently adapted to all model parameters , The greater the gradient , The lower the learning rate ; The smaller the gradient , The higher the learning rate . Adagrad It is suitable for data sets with sparse data or uneven distribution

4、AdaDelta Algorithm

torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
 Parameters ：
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--rho (float,  Optional ) –  Coefficient used to calculate the operating average of the square gradient （ Default ：0.9）
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-6）
--lr (float,  Optional ) –  stay delta The coefficient that is scaled before being applied to the parameter update （ Default ：1.0）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

yes Adagard Improved version , Adaptive constraint on learning rate , But the calculation is simplified , Good acceleration , Fast training .
advantage ： Avoid late training , Low learning rate ; Initial and mid-term , Good acceleration , Fast training
shortcoming ： You still need to manually specify the initial learning rate , If the initial gradient is large , It will cause the learning rate of the whole training process to be very small , At the end of the model training , The model will jitter around the local minimum repeatedly , This leads to longer learning time

5、Rprop（ Elastic back propagation ）

torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
 Parameters 
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：1e-2）
--etas (Tuple[float,float],  Optional ) –  a pair （etaminus,etaplis）,  They are the increasing and decreasing factors of multiplication （ Default ：0.5,1.2）
--step_sizes (Tuple[float,float],  Optional ) –  A pair of minimum and maximum steps allowed （ Default ：1e-6,50）

1、 First, assign an initial value to each weight change , Set acceleration factor and deceleration factor for weight change .
2、 In the feedforward iteration of the network, when the symbol of the continuous error gradient remains unchanged , Adopt an acceleration strategy , Speed up your training ; When the continuous error gradient symbol changes , Adopt deceleration strategy , In order to stabilize convergence .
3、 The network combines the current error gradient symbol with the variable step size BP, meanwhile , In order to avoid network learning oscillation or overflow , The algorithm requires to set the upper and lower limits of weight changes .
shortcoming ： The optimization method is suitable for full-batch, Do not apply to mini-batch, So it's basically useless

6、RMSProp（Root Mean Square Prop, Root mean square transfer ）

orch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
 Parameters 
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：1e-2）
--momentum (float,  Optional ) –  Momentum factor （ Default ：0）
--alpha (float,  Optional ) –  Smoothing constant （ Default ：0.99）
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）
--centered (bool,  Optional ) –  If True, Computing centric RMSProp, And its variance prediction value is used to normalize the gradient 
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

RProp Improved version , It's also Adagard Improved version
thought ： Items with large gradient vibration , When descending , Reduce its descent speed ; For items with small vibration amplitude , When descending , Accelerate its descent speed
RMSprop Take root mean square as denominator , It can relieve Adagrad The problem of rapid decline in learning rate . about RNN Good results
advantage ： It can relieve Adagrad The problem of rapid decline in learning rate , And introduce root mean square , Can reduce swing , Suitable for dealing with non-stationary targets , about RNN The effect is very good
shortcoming ： Still depends on the overall learning rate

7、Adam(AMSGrad)

torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
 Parameters 
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：1e-3）
--betas (Tuple[float,float],  Optional ) –  The coefficient used to calculate the operating average of the gradient and the square of the gradient （ Default ：0.9,0.999）
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

take Momentum Algorithm and RMSProp An algorithm used in combination with algorithms , We use momentum to accumulate gradients , It also makes the convergence speed faster and the fluctuation amplitude smaller , The deviation is corrected .
advantage ：
1、 There is no stationary requirement for the objective function , namely loss function Can change over time
2、 The updating of parameters is not affected by the scaling transformation of gradient
3、 The update step is independent of the gradient size , And the only alpha、beta_1、beta_2 It matters . And they determine the theoretical upper limit of the step size
4、 The update step can be limited to a rough range （ Initial learning rate ）
5、 It can better deal with noise samples , It can naturally realize the step annealing process （ Automatically adjust the learning rate ）
6、 It is very suitable for large-scale data and parameter scenarios 、 Unstable objective function 、 The gradient is sparse or there is a lot of noise in the gradient

8、Adamax

torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
 Parameters ：
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：2e-3）
--betas (Tuple[float,float],  Optional ) –  The coefficient used to calculate the operating average of the gradient and the square of the gradient 
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 0）

Adam Improved version , Yes Adam Added a concept of learning rate ceiling , yes Adam An infinite norm based variant of .
advantage ： The upper limit of learning rate provides a simpler range

9、Nadam

torch.optim.NAdam(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004)
 Parameters ：
params (iterable)–  Parameter list for optimizing or defining parameter groups 
lr (float, optional) –  Learning rate (default: 2e-3)
betas (Tuple[float, float], optional) –  The coefficient used to calculate the running average of the gradient and its square  (default: (0.9, 0.999))
eps (float, optional) –  Items added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) –  Weight falloff  (L2 penalty) (default: 0)
momentum_decay (float, optional) –  Momentum decay  (default: 4e-3)

Adam Improved version , Similar to with Nesterov Of momentum terms Adam,Nadam There is a stronger constraint on the learning rate , At the same time, it also has a more direct impact on the update of the gradient . generally speaking , I want to use the quantity of RMSprop, perhaps Adam The place of , Most can use Nadam Get better results .

10、SparseAdam

torch.optim.SparseAdam(params,lr=0.001,betas=(0.9,0.999),eps=1e-08)
 Parameters 
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：2e-3）
--betas (Tuple[float,float],  Optional ) –  The coefficient used to calculate the operating average of the gradient and the square of the gradient （ Default ：0.9,0.999）
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）

A kind of sparse tensor “ Castration plate ”Adam An optimization method . advantage ： amount to Adam A special version of sparse tensor

11、AdamW

torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
 Parameters 
--params (iterable) –  Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float,  Optional ) –  Learning rate （ Default ：1e-3）
--betas (Tuple[float,float],  Optional ) –  The coefficient used to calculate the operating average of the gradient and the square of the gradient （ Default ：0.9,0.999）
--eps (float,  Optional ) –  A term added to the denominator to increase the stability of numerical calculations （ Default ：1e-8）
--weight_decay (float,  Optional ) –  Weight falloff （L2 punishment ）（ Default : 1e-2）
--amsgrad(boolean, optional) –  Whether to use from the paper On the Convergence of Adam and Beyond The algorithm mentioned in AMSGrad variant （ Default ：False）

Adam The evolution of , It is the fastest way to train neural network at present
advantage ： Than Adam Converge faster
shortcoming ： Only fastai Use , Lack of a broad framework , And it is also very controversial

12、L-BFGS

（Limited-memory Broyden–Fletcher–Goldfarb–Shanno）

torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100,line_search_fn=None)
 Parameters 
--lr (float) –  Learning rate （ Default ：1）
--max_iter (int) –  The maximum number of iterations of each optimization step （ Default ：20）)
--max_eval (int) –  The maximum number of function evaluations for each optimization step （ Default ：max * 1.25）
--tolerance_grad (float) –  First order optimal termination tolerance （ Default ：1e-5）
--tolerance_change (float) –  In the function value / Termination tolerance on parameter variation （ Default ：1e-9）
--history_size (int) –  Update history size （ Default ：100）

It is an algorithm for solving function root based on Newton method , Simply speaking ,L-BFGS And gradient descent 、SGD Do the same thing , But in most cases, the convergence speed is faster
L-BFGS It's right BFGS Improvement , The feature is to save memory
It is the most commonly used method to solve unconstrained nonlinear programming problems .
Warning ：
This optimizer Setting options individually for each parameter and parameter groups are not supported （ There can only be one ）
At present, all parameters have to be on the same device . This will be improved in the future .
Be careful ：
This is a memory intensive optimizer（ It requires additional param_bytes * (history_size + 1) Bytes ）. Memory does not meet requirements , Try to reduce history size, Or use a different algorithm .

13、Radam

torch.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
 Parameters 
params (iterable) –  Parameter list for optimizing or defining parameter groups 
lr (float, optional) –  Learning rate  (default: 2e-3)
betas (Tuple[float, float], optional) –  The coefficient used to calculate the running average of the gradient and its square  (default: (0.9, 0.999))
eps (float, optional) –  Items added to the denominator to improve numerical stability  (default: 1e-8)
weight_decay (float, optional) –  Weight falloff  (L2 penalty) (default: 0)