当前位置：网站首页>Summary of gradient descent optimizer (rmsprop, momentum, Adam)

Summary of gradient descent optimizer (rmsprop, momentum, Adam)

2022-06-30 15:35:00 【Zi Yan Ruoshui】

Recommended links ：

Gradient descent optimizer visualization RMSprop - The search results - You know

pytorch relevant api torch.optim.Adam The meaning of parameters in the algorithm _ Ziyan Ruoshui's blog -CSDN Blog _adam Medium weight_decay

Text content ：

Original gradient descent algorithm

delta = - learning_rate * gradient

theta += delta

The problem is that the same method is used in particularly steep places and particularly gentle places learning_rate, If learning_rate It's easy to be too big. In steep places, a large iteration will cross the optimal solution far , If learning_rate Too small , In a flat place, you will learn very slowly .

So there is RMSProp Version of the optimizer ,

sum_of_gradient_squared = previous_sum_of_gradient_squared* decay_rate+ gradient²* (1- decay_rate)

delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)

theta += delta

RMSProp The advantage of is the learning rate $\eta$ To divide by one $\delta$ , and $\delta$ The main component of is the size of the module of the current gradient , That is, the steepness of the current slope , This makes the current slope steeper , The smaller the steps taken , If the current slope is more gentle , The bigger the step .

Of course $\delta$ There is another minor part , In the last iteration $\delta$ The size of the module , This value records the historical steepness of the slope experienced recently , It's a context value , It can be speculated to a certain extent in several future iterations .

Only the gradient of history is considered above , If the historical component changes from the historical gradient to the direction and size of the previous movement, it is called the momentum method （momentum）.

If you look carefully, v Change process of , In fact, only consider the previous moving direction （ And size ） Is to consider all the moving directions in history （ And size ）.

If we consider both the gradient of history , At the same time, the moving direction of history is also considered , Then there are Adam Optimization method ：

原网站

版权声明
本文为[Zi Yan Ruoshui]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160503330955.html

当前位置：网站首页>Summary of gradient descent optimizer (rmsprop, momentum, Adam)

Summary of gradient descent optimizer (rmsprop, momentum, Adam)

边栏推荐

猜你喜欢

随机推荐