当前位置:网站首页>[summary of pytoch optimizer]
[summary of pytoch optimizer]
2022-06-10 03:41:00 【Network sky (LUOC)】
List of articles
pytorch Several types of optimizers for 
1.https://pytorch.org/docs/stable/optim.html
2.https://ptorch.com/docs/1/optim
3.https://www.cntofu.com/book/169/docs/1.0/optim.md
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
- SGD : Use momentum (Momentum) You can try , Use Newtonian acceleration (NAG) You might as well not try
- ASGD : see little of
- Adagrad : Not recommended
- Adadelta : You can try
- Rprop : Not recommended
- RMSprop : recommend
- Adam : Very recommended
- Adamax: Very recommended
- Nadam: Very recommended
- SparseAdam: recommend
- AdamW: Very recommended , Wait and see
- L-BFGS: Very recommended , Optional
- Radam: Very recommended
1、SGD( Stochastic gradient descent )
torch.optim.SGD(params,lr=<required parameter>,momentum=0,dampening=0,weight_decay=0,nesterov=False)
Parameters :
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float) – Learning rate
--momentum (float, Optional ) – Momentum factor ( Default :0, Usually set to 0.9,0.8)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default :0)
--dampening (float, Optional ) – The suppressor of momentum ( Default :0)
--nesterov (bool, Optional ) – Use Nesterov momentum ( Default :False)
advantage :① Use mini-batch When , Can converge quickly
shortcoming :① Noise will be introduced while randomly selecting the gradient , So that the direction of weight update is not necessarily correct ;② Can not solve the problem of local optimal solution
a、 Use momentum (Momentum) The random gradient descent method (SGD):
The usage is in torch.optim.SGD Of momentum Parameter is not zero .
advantage : Speed up convergence , Have the ability to get rid of local optima , To some extent, it alleviates the problem when there is no momentum ; shortcoming : When updating, keep the direction of the previous update to a certain extent , Still inherited a part SGD The shortcomings of .
b、 Use Newtonian acceleration (NAG, Nesterov accelerated gradient) The random gradient descent method (SGD):
It is understood that a correction factor is added to the standard momentum .
advantage : The descending direction of the gradient is more accurate ; shortcoming : The effect on the convergence rate is not great .
2、ASGD( Random average gradient drop )
torch.optim.ASGD(params, lr=0.01, lambd=0.0001, alpha=0.75, t0=1000000.0, weight_decay=0)
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :1e-2)
--lambd (float, Optional ) – Attenuation term ( Default :1e-4)
--alpha (float, Optional ) – eta Updated index ( Default :0.75)
--t0 (float, Optional ) – Indicate when to start averaging ( Default :1e6)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
a、 Use momentum (Momentum) The random gradient descent method (SGD);
b、 Use Newtonian acceleration (NAG, Nesterov accelerated gradient) The random gradient descent method (SGD).
The advantages and disadvantages are similar to the above .
3、AdaGrad Algorithm
torch.optim.Adagrad(params, lr=0.01, lr_decay=0, weight_decay=0, initial_accumulator_value=0, eps=1e-10)
Parameters :
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default : 1e-2)
--lr_decay (float, Optional ) – Learning rate decline ( Default : 0)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
Learning rate independently adapted to all model parameters , The greater the gradient , The lower the learning rate ; The smaller the gradient , The higher the learning rate . Adagrad It is suitable for data sets with sparse data or uneven distribution
4、AdaDelta Algorithm
torch.optim.Adadelta(params, lr=1.0, rho=0.9, eps=1e-06, weight_decay=0)
Parameters :
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--rho (float, Optional ) – Coefficient used to calculate the operating average of the square gradient ( Default :0.9)
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-6)
--lr (float, Optional ) – stay delta The coefficient that is scaled before being applied to the parameter update ( Default :1.0)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
yes Adagard Improved version , Adaptive constraint on learning rate , But the calculation is simplified , Good acceleration , Fast training .
advantage : Avoid late training , Low learning rate ; Initial and mid-term , Good acceleration , Fast training
shortcoming : You still need to manually specify the initial learning rate , If the initial gradient is large , It will cause the learning rate of the whole training process to be very small , At the end of the model training , The model will jitter around the local minimum repeatedly , This leads to longer learning time
5、Rprop( Elastic back propagation )
torch.optim.Rprop(params, lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
Parameters
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :1e-2)
--etas (Tuple[float,float], Optional ) – a pair (etaminus,etaplis), They are the increasing and decreasing factors of multiplication ( Default :0.5,1.2)
--step_sizes (Tuple[float,float], Optional ) – A pair of minimum and maximum steps allowed ( Default :1e-6,50)
1、 First, assign an initial value to each weight change , Set acceleration factor and deceleration factor for weight change .
2、 In the feedforward iteration of the network, when the symbol of the continuous error gradient remains unchanged , Adopt an acceleration strategy , Speed up your training ; When the continuous error gradient symbol changes , Adopt deceleration strategy , In order to stabilize convergence .
3、 The network combines the current error gradient symbol with the variable step size BP, meanwhile , In order to avoid network learning oscillation or overflow , The algorithm requires to set the upper and lower limits of weight changes .
shortcoming : The optimization method is suitable for full-batch, Do not apply to mini-batch, So it's basically useless
6、RMSProp(Root Mean Square Prop, Root mean square transfer )
orch.optim.RMSprop(params, lr=0.01, alpha=0.99, eps=1e-08, weight_decay=0, momentum=0, centered=False)
Parameters
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :1e-2)
--momentum (float, Optional ) – Momentum factor ( Default :0)
--alpha (float, Optional ) – Smoothing constant ( Default :0.99)
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-8)
--centered (bool, Optional ) – If True, Computing centric RMSProp, And its variance prediction value is used to normalize the gradient
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
RProp Improved version , It's also Adagard Improved version
thought : Items with large gradient vibration , When descending , Reduce its descent speed ; For items with small vibration amplitude , When descending , Accelerate its descent speed
RMSprop Take root mean square as denominator , It can relieve Adagrad The problem of rapid decline in learning rate . about RNN Good results
advantage : It can relieve Adagrad The problem of rapid decline in learning rate , And introduce root mean square , Can reduce swing , Suitable for dealing with non-stationary targets , about RNN The effect is very good
shortcoming : Still depends on the overall learning rate
7、Adam(AMSGrad)
torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
Parameters
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :1e-3)
--betas (Tuple[float,float], Optional ) – The coefficient used to calculate the operating average of the gradient and the square of the gradient ( Default :0.9,0.999)
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-8)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
take Momentum Algorithm and RMSProp An algorithm used in combination with algorithms , We use momentum to accumulate gradients , It also makes the convergence speed faster and the fluctuation amplitude smaller , The deviation is corrected .
advantage :
1、 There is no stationary requirement for the objective function , namely loss function Can change over time
2、 The updating of parameters is not affected by the scaling transformation of gradient
3、 The update step is independent of the gradient size , And the only alpha、beta_1、beta_2 It matters . And they determine the theoretical upper limit of the step size
4、 The update step can be limited to a rough range ( Initial learning rate )
5、 It can better deal with noise samples , It can naturally realize the step annealing process ( Automatically adjust the learning rate )
6、 It is very suitable for large-scale data and parameter scenarios 、 Unstable objective function 、 The gradient is sparse or there is a lot of noise in the gradient
8、Adamax
torch.optim.Adamax(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
Parameters :
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :2e-3)
--betas (Tuple[float,float], Optional ) – The coefficient used to calculate the operating average of the gradient and the square of the gradient
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-8)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 0)
Adam Improved version , Yes Adam Added a concept of learning rate ceiling , yes Adam An infinite norm based variant of .
advantage : The upper limit of learning rate provides a simpler range
9、Nadam
torch.optim.NAdam(params, lr=0.002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, momentum_decay=0.004)
Parameters :
params (iterable)– Parameter list for optimizing or defining parameter groups
lr (float, optional) – Learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – The coefficient used to calculate the running average of the gradient and its square (default: (0.9, 0.999))
eps (float, optional) – Items added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – Weight falloff (L2 penalty) (default: 0)
momentum_decay (float, optional) – Momentum decay (default: 4e-3)
Adam Improved version , Similar to with Nesterov Of momentum terms Adam,Nadam There is a stronger constraint on the learning rate , At the same time, it also has a more direct impact on the update of the gradient . generally speaking , I want to use the quantity of RMSprop, perhaps Adam The place of , Most can use Nadam Get better results .
10、SparseAdam
torch.optim.SparseAdam(params,lr=0.001,betas=(0.9,0.999),eps=1e-08)
Parameters
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :2e-3)
--betas (Tuple[float,float], Optional ) – The coefficient used to calculate the operating average of the gradient and the square of the gradient ( Default :0.9,0.999)
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-8)
A kind of sparse tensor “ Castration plate ”Adam An optimization method . advantage : amount to Adam A special version of sparse tensor
11、AdamW
torch.optim.AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)
Parameters
--params (iterable) – Of the parameters to be optimized iterable Or the one that defines the parameter group dict
--lr (float, Optional ) – Learning rate ( Default :1e-3)
--betas (Tuple[float,float], Optional ) – The coefficient used to calculate the operating average of the gradient and the square of the gradient ( Default :0.9,0.999)
--eps (float, Optional ) – A term added to the denominator to increase the stability of numerical calculations ( Default :1e-8)
--weight_decay (float, Optional ) – Weight falloff (L2 punishment )( Default : 1e-2)
--amsgrad(boolean, optional) – Whether to use from the paper On the Convergence of Adam and Beyond The algorithm mentioned in AMSGrad variant ( Default :False)
Adam The evolution of , It is the fastest way to train neural network at present
advantage : Than Adam Converge faster
shortcoming : Only fastai Use , Lack of a broad framework , And it is also very controversial
12、L-BFGS
(Limited-memory Broyden–Fletcher–Goldfarb–Shanno)
torch.optim.LBFGS(params, lr=1, max_iter=20, max_eval=None, tolerance_grad=1e-05, tolerance_change=1e-09, history_size=100,line_search_fn=None)
Parameters
--lr (float) – Learning rate ( Default :1)
--max_iter (int) – The maximum number of iterations of each optimization step ( Default :20))
--max_eval (int) – The maximum number of function evaluations for each optimization step ( Default :max * 1.25)
--tolerance_grad (float) – First order optimal termination tolerance ( Default :1e-5)
--tolerance_change (float) – In the function value / Termination tolerance on parameter variation ( Default :1e-9)
--history_size (int) – Update history size ( Default :100)
It is an algorithm for solving function root based on Newton method , Simply speaking ,L-BFGS And gradient descent 、SGD Do the same thing , But in most cases, the convergence speed is faster
L-BFGS It's right BFGS Improvement , The feature is to save memory
It is the most commonly used method to solve unconstrained nonlinear programming problems .
Warning :
This optimizer Setting options individually for each parameter and parameter groups are not supported ( There can only be one )
At present, all parameters have to be on the same device . This will be improved in the future .
Be careful :
This is a memory intensive optimizer( It requires additional param_bytes * (history_size + 1) Bytes ). Memory does not meet requirements , Try to reduce history size, Or use a different algorithm .
13、Radam
torch.optim.RAdam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0)
Parameters
params (iterable) – Parameter list for optimizing or defining parameter groups
lr (float, optional) – Learning rate (default: 2e-3)
betas (Tuple[float, float], optional) – The coefficient used to calculate the running average of the gradient and its square (default: (0.9, 0.999))
eps (float, optional) – Items added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – Weight falloff (L2 penalty) (default: 0)
边栏推荐
- Ptrtostructure error prompt: this structure must not be a value class. Solution
- Storage of signed and unsigned shaping in memory
- Refactoring technique --replace conditional with polymer
- Refactoring technique --extract class
- 在线沙龙 | 开源小秀场——数据库技术应用实践
- C# 11 新特性:列表模式匹配
- 重构--消除重复代码
- 使用队列实现进程之间的数据共享
- 小功能实现(三)字符串分割中括号中的内容
- Matlab neural network
猜你喜欢

Monotone queue optimization DP example

Milestone events Net Maui officially released
![[loss calculation in yolov3]](/img/8c/1ad99b8fc1c5490f70dc81e1e5c27e.png)
[loss calculation in yolov3]

860. lemonade change

Error message: incompatible type. Actually, it is XXX, which requires' com alibaba. excel. enums. poi. Horizontalalignmentenum '(error in easyexcel content style code)

Storage of data (integer, floating-point, super detailed)

Redis core technology and practice - practice reading notes 20 ~ end

先序遍历二叉树

How to customize ThreadPoolExecutor thread pool gracefully
![[yolov3 loss function]](/img/79/87bcc408758403cf3993acc015381a.png)
[yolov3 loss function]
随机推荐
【yolov3损失函数】
【Pytorch的优化器总结归纳】
【L1、L2、smooth L1三类损失函数】
[cloud native | kubernetes] learn more about ingress
Prise en charge du mode range par le cadre Open Source
【TFLite, ONNX, CoreML, TensorRT Export】
PAT-2022年夏季考试-乙级 满分代码
860. lemonade change
497. random points in non overlapping rectangles
Redis core technology and practice - practice reading notes 20 ~ end
重构--Rename
【yolov5.yaml解析】
Post Microsoft Build丨畅聊技术新风潮
135. distribute candy
RPC practice and core principles - Advanced notes
Refactoring -- bad code smell
yolov5目标检测神经网络——损失函数计算原理
[bitbear story collection] x Microsoft build 2022 - Microsoft experts +mvp, full analysis of technical highlights
IDE问题(一)微信开发者工具打不开
重构--代码坏味道