当前位置：网站首页>Common optimization methods

Common optimization methods

2022-07-05 05:33:00 【Li Junfeng】

Preface

In the training of neural networks , Its essence is to find the appropriate parameters , bring loss function Minimum . However, this minimum value is too difficult to find , Because there are too many parameters . To solve this problem , At present, there are several common optimization methods .

Random gradient descent method

This is the most classic algorithm , It is also a common method . Compared with random search , This method is already excellent , But there are still some deficiencies .

shortcoming

The value of step size has a great influence on the result ： Too many steps will not converge , The step size is too small , Training time is too long , It can't even converge .
The gradient of some functions does not point to their minimum , For example, the function is $z=\frac{x^2}{100}+y^2$ , This function image is symmetric , Long and narrow “ The valley ”.
Because part of the gradient does not point to the minimum , This will cause its path to find the optimal solution to be very tortuous , And the correspondence is relatively “ flat ” The place of , It may not be able to find the optimal solution .

AdaGrad

Because of the nature of the gradient , It is difficult to improve in some places , But the step size can also be optimized .
A common method is Step attenuation , The initial step size is relatively large , Because at this time, it is generally far from the optimal solution , You can walk more , To improve the speed of training . As the training goes on , The step size decreases , Because at this time, it is closer to the optimal solution , If the step size is too large, it may miss the optimal solution or fail to converge .
Attenuation parameters are generally related to the gradient that has been trained
$h\leftarrow h + \frac{\partial L}{\partial W}\cdot\frac{\partial L}{\partial W} \newline W\leftarrow W - \eta\frac{1}{\sqrt{h}}\cdot\frac{\partial L}{\partial W}$

Momentum

The explanation of this word is momentum , According to the definition in Physics $F\cdot t = m\cdot v$ .
In order to understand this method more vividly , Consider a surface in three-dimensional space , There is a ball on it , Need to roll to the lowest point .
For the sake of calculation , Part regards the mass of the ball as a unit 1, Then find the derivative of time for this formula ： $\frac{dF}{dt}\cdot dt=m\cdot\frac{dv}{dt} \Rightarrow dF=m\cdot\frac{dv}{dt}$ .
Consider the impact of this little ball “ force ”： The component of gravity caused by the inclination of the current position （ gradient ）, Friction that blocks motion （ When there is no gradient, the velocity will attenuation ）.
Then you can easily write the speed and position of the ball at the next moment ： $v\leftarrow\alpha\cdot v - \frac{\partial F}{\partial W} \newline w\leftarrow w + v$

advantage

This method can be very close to the problem that the gradient does not point to the optimal solution , Even if a gradient does not point to the optimal solution , But only it exists to the optimal solution Speed , Then it can continue to approach the optimal solution .

原网站

版权声明
本文为[Li Junfeng]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207050527288392.html