当前位置：网站首页>Deep learning common optimizer summary

Deep learning common optimizer summary

2022-06-24 04:46:00 【Goose】

1. background

Choosing a good optimizer for a machine learning project is not an easy task . Popular deep learning library （ Such as PyTorch or TensorFLow） Provides a variety of optimizer options , They have their own advantages and disadvantages . also , Choosing an inappropriate optimizer may have a great negative impact on machine learning projects . This makes the selection optimizer a build 、 A key step in testing and deploying machine learning models .

2. Common optimizers

In this paper, we use w On behalf of the parameter ,g It's for gradients ,α For the global learning rate of each optimizer ,t Represents the time step （time step）.

2.1 SGD Stochastic gradient descent

w_{t+1}=w_t-a·g_t

In the random gradient descent algorithm （SGD） in , The optimizer estimates the direction where the gradient falls the fastest based on a small batch , And take a step in that direction . Because the step size is fixed , therefore SGD It's likely to stop in the stable zone soon （plateaus） Or the local minimum .

2.2 With momentum SGD momentum

v_{t+1}=\beta·v_t+g_t

w_{t+1}=w_t-a·v_{t+1}

among β<1. When there is momentum ,SGD Will accelerate in the direction of continuous descent （ This is what this method is called 「 Heavy ball method 」 Why ）. This acceleration helps the model escape from the stationary region , It is not easy to fall into local minimum .

2.3 AdaGrad

G_t=\sum^t_{i=1}g_i·g_i^T

w_{t+1}=w_t-a·diag(G)^{\frac{1}{2}}·g_t

AdaGrad It's one of the first successful ways to use adaptive learning rate .AdaGrad be based on The square root of the reciprocal of the sum of square gradients To scale the learning rate of each parameter . This process amplifies the sparse gradient direction , To allow large adjustments in these directions . The result is that the Sparse features In the scene ,AdaGrad Can converge faster .

2.4 RMSprop

v_{t+1}=\beta·v_t+(1-\beta)·g_t^2

w_{t+1}=w_t-\frac{a}{\sqrt{v_{t+1}}+e} ·g_t

RMSprop The idea is similar to AdaGrad, But the re scaling of the gradient is not very positive ： Replace the sum of the square gradients with the moving mean of the square gradients .RMSprop Usually used with momentum , It can be understood as Rprop Adaptation to small batch settings .

2.5 Adam

m_{t+1}=\beta_1·m_t+(1-\beta_1)·g_t

v_{t+1}=\beta_2·v_t+(1-\beta_2)·g_t^2

m_{t+1}=\frac{m_{t+1}}{1-\beta_1^t}

w_{t+1}=w_t-(\frac{a}{\sqrt{v_{t+1}}+e}·m_{t+1}+\lambda·w_t)

Adam take AdaGrad、RMSprop Combined with the momentum method . The direction of the next step is determined by the moving average of the gradient , The step size is set by the global step size . Besides , Be similar to RMSprop,Adam Rescale each dimension of the gradient .

Adam and RMSprop（ or AdaGrad） One of the main differences between them is the instantaneous estimation m and v The zero deviation of is corrected .Adam It is well known that good performance can be achieved with a small amount of super parameter tuning .

2.6 AdamW

Loshchilov and Hutter In the adaptive gradient method L2 Inequalities for regularization and weight reduction , And suppose that this inequality limits Adam Performance of . then , They propose to decouple weight decay from learning rate . Experimental results show that AdamW Than Adam（ Use momentum to reduce and SGD The gap between ） Better generalization performance , And for the AdamW for , The range of optimal super parameters is wider .

2.7 LARS

LARS yes SGD We have momentum expansion , It can adapt to the learning rate of each layer .LARS Recently, it has attracted much attention in the research field . This is due to the steady growth of available data , Distributed training of machine learning is becoming more and more popular . This makes the batch size start to grow , But it also makes training unstable . There are researchers （Yang et al） It is considered that these instabilities are due to the imbalance between the gradient criterion and the weight criterion of some layers . So they came up with an optimizer , The optimizer is based on 「 trust 」 Parameters η<1 And the inverse norm of the gradient of the layer to readjust the learning rate of each layer .

2.8 FTRL

w_{t+1}=\arg \min_w \left ( g_{1:t}\cdot w + \frac{1}{2}\sum_{s=1}^t \sigma_s ||w-w_s||_2^2 + \lambda_1||w||_1 \right )

It is mainly used for CTR Predictive online training , Thousands of dimensions result in a large number of sparse features . It is generally expected that the model parameters will be more sparse , But simple L1 Regular cannot really be sparse , Some gradient truncation methods （TG） The proposal of is to solve this problem , In the Middle East: FTRL It is an online learning method with both precision and sparsity .

FTRL The basic idea of Will be close to 0 The gradient of is set directly to zero , Skip the calculation directly to reduce the amount of calculation .

Here is the pseudocode of the project , The four parameters are adjustable .

Statement of rights ： This paper is about CSDN Blogger 「 Entropy of banana fork 」 The original article of , follow CC 4.0 BY-

3. summary

If the data is sparse , Just use the self applicable method , namely Adagrad, Adadelta, RMSprop, Adam.

RMSprop, Adadelta, Adam In many cases the effect is similar .

Adam Is in the RMSprop On the basis of bias-correction and momentum,

As the gradient becomes sparse ,Adam Than RMSprop The effect will be good .

Overall, ,Adam Is the best choice .

Many papers will use SGD, No, momentum etc. .SGD Although it can reach a minimum , But it takes longer than other algorithms , And may be trapped at the saddle point .

If you need faster convergence , Or train deeper and more complex neural networks , An adaptive algorithm is needed .