当前位置：网站首页>04 automatic learning rate - learning notes - lihongyi's in-depth learning 2021

04 automatic learning rate - learning notes - lihongyi's in-depth learning 2021

2022-06-11 23:16:00 【iioSnail】

Last one ：03 gradient （Gradient） What if it's small （Local Minima And Saddle Point）- Learning notes - Li Hongyi studies deeply 2021 year

Next ：05 Classification- Learning notes - Li Hongyi studies deeply 2021 year

Contents of this section and related links

Automatic adjustment Learning Rate Common strategies for

Class notes

When training When stuck in a bottleneck , Is not necessarily gradient Too small , It may be due to Learning rate is too high , Cause it to vibrate between valleys , The minimum value cannot be reached

Insert picture description here
Corresponding to gradient The function image of is shown in the following figure ：

$x$ Axis is the number of updates , $y$ by gradient Size

According to the number of iterations , Current gradient and other factors , Automatic adjustment Learning Rate. $\theta$ The updated formula of is changed to : $\theta_i^{t+1}\leftarrow \theta_i^t - \frac{\eta}{\sigma_i^t}g^t_i$

about Learning Rate Adjustment of , All through adjustment $\sigma$ To achieve

Common adjustment strategies are ：

Root Mean Square： Consider this gradient and all gradients in the past
RMSProp： Focus on this gradient , Think a little about all the gradients in the past
Adam： Combined with the RMSProp and Momentum
Learning Rate Decay： As the number of updates increases , Because the closer we get to our goal , So will Learning Rate The small
Warm Up： In limine Learning Rate Smaller one , Then it increases as the number of iterations increases , And then at some point , And then it decreases with the increase of the number of iterations . As shown in the figure ：

Root Mean Square Formula for ： $\sigma_{i}^{t}=\sqrt{\frac{1}{t+1} \sum_{i=0}^{t}\left(g_{i}^{t}\right)^{2}}$

RMSProp Formula for ： $\sigma_{i}^{t}=\sqrt{\alpha\left(\sigma_{i}^{t-1}\right)^{2}+(1-\alpha)\left(g_{i}^{t}\right)^{2}}$ among $\alpha$ For the super parameter to be adjusted , $0<\alpha<1$