当前位置:网站首页>Introduction to deep learning Ann neural network parameter optimization problem (SGD, momentum, adagrad, rmsprop, Adam)
Introduction to deep learning Ann neural network parameter optimization problem (SGD, momentum, adagrad, rmsprop, Adam)
2022-07-04 06:57:00 【Mud stick】
Here's the catalog title
1. Parameter optimization
This article summarizes ANN Various methods of parameter update
Some mathematical formulas that may be used The mathematical formula
1.1 Random gradient descent method (SGD)
We used the random gradient descent method in the previous chapter (SGD) Update parameters , Here it is , We will propose other methods to optimize parameters , And point out that SGD Some shortcomings of .
Let's understand the process of finding the optimal parameters through a story :
Under such harsh conditions , Ground slope is particularly important , Feel the inclination of the ground through the soles of your feet , Namely SGD The strategy of . brave Explorers may think that just repeat this strategy , One day we can reach “ The deepest place ”.
1.1.2SGD
The random gradient descent method can be written as the following formula
Let's implement a program named SGD Class
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
If you have studied the previous chapters , This code is easy to understand .
update Functions are constantly called to update parameters .
We can train neural networks as follows ( Pseudo code , Can not run )
network = TwoLayerNet(...)
optimizer = SGD()
for i in range(10000):
...
x_batch, t_batch = get_mini_batch(...) # mini-batch
grads = network.gradient(x_batch, t_batch)
params = network.params
optimizer.update(params, grads)
here optimizer Express “ People who optimize ” It means , It is responsible for updating parameters .
such , Function modules will become very simple .
1.1.3 SGD The shortcomings of
SGD It's simple and easy to implement , But it may not be efficient in solving some problems , Let's give you an example , Find the minimum value of the following function .
We can go through matplotlib Let's take a look at this function . The code is as follows
import matplotlib.pyplot as plt
import numpy as np
from mpl_toolkits.mplot3d import axes3d
def f(x,y):
return 0.05*(x**2)+y**2
# Wireframe
fig = plt.figure()
ax = plt.axes(projection='3d')
#X = np.linspace(-10,10,256)
#Y = np.linspace(-10,10,256)
X = np.linspace(-10,10)
Y = np.linspace(-10,10)
print(f(0,0))
X, Y = np.meshgrid(X, Y)# Generate grid point coordinate matrix .
Z = f(X,Y)
plt.xlabel("x")
plt.ylabel("y")
ax.set_zlabel("z")
ax.plot_wireframe(X, Y, Z, color='black')
ax.set_title('wireframe')
# Contour map
fig2 = plt.figure()
level = np.array([0.0,1,2,3])
C = plt.contour(X, Y, Z,levels=level)
plt.clabel(C,inline=True,fontsize=10)
plt.contour(X,Y,Z)
plt.xlabel("x")
plt.ylabel("y")
plt.show()
It is not difficult to see from the two pictures ,Z The value of x and y All to 0 Point down , however y The gradient of descent is larger .
Next, let's use a graph to show the gradient . The code is as follows , The function of finding gradient is familiar to you in the previous chapter
import numpy as np
import matplotlib.pyplot as plt
from deeplearning.fuction import numerical_gradient
def f(x):
return np.sum(0.05*x[0]**2+x[1]**2)
x = np.arange(-10,10)
y = np.arange(-5,6)
X,Y = np.meshgrid(x,y)
X = X.flatten()
Y = Y.flatten()
a = np.array([X, Y])
grad = numerical_gradient(f,a)
plt.figure()
plt.quiver(X, Y, -grad[0], -grad[1], angles="xy", color="#666666") # ,headwidth=10,scale=40,color="#444444")
plt.xlim([-10, 10])
plt.ylim([-5, 5])
plt.xlabel('x0')
plt.ylabel('x1')
plt.grid()
plt.draw()
plt.show()
that , Next we use SGD To update the parameters , First, implement SGD Of optimizer.
class SGD:
def __init__(self,lr = 0.01):
self.lr = lr
def update(self,params,grads):
for key in params.keys():
params[key]-=self.lr * grads[key]
import numpy as np
import matplotlib.pyplot as plt
from optimizer import SGD
def f(x, y):
return x**2 / 20.0 + y**2
def df(x, y): # Write the derivative function directly , Omit the gradient
return x / 10.0, 2.0*y
init_pos = (-7.0, 2.0)
params = {
}
params['x'], params['y'] = init_pos[0], init_pos[1]
grads = {
}
grads['x'], grads['y'] = 0, 0
x_history = []
y_history = []
optimizer = SGD(0.95)
for i in range(30):
x_history.append(params['x'])
y_history.append(params['y'])
grads['x'], grads['y'] = df(params['x'], params['y'])
optimizer.update(params,grads)
for i in range(len(x_history)):
print(x_history[i],y_history[i])
x = np.arange(-10,10,0.01)
y = np.arange(-5, 5, 0.01)
X,Y = np.meshgrid(x,y)
Z = f(X,Y)
mask = Z > 7
Z[mask] = 0
plt.plot(x_history, y_history, 'o-', color="red")
plt.contour(X, Y, Z)
plt.ylim(-10, 10)
plt.xlim(-10, 10)
plt.plot(0, 0, '+')
plt.xlabel("x")
plt.ylabel("y")
plt.title("SGD")
plt.show()
In the picture ,SGD a “ And ” Glyph movement . This is a rather inefficient path . in other words , SGD The disadvantage of , If the shape of the function is non-uniform (anisotropic), For example, it is extended , Search for The path will be very inefficient . There is a loop print in the middle of this code , Let's look directly at the printed data .
-7.0 2.0
-6.335 -1.7999999999999998
-5.733175 1.6199999999999997
-5.188523375 -1.4579999999999997
-4.695613654375 1.3121999999999998
-4.249530357209375 -1.18098
-3.8458249732744845 1.0628819999999997
-3.4804716008134085 -0.9565937999999994
-3.1498267987361346 0.8609344199999993
-2.8505932528562017 -0.7748409779999994
-2.5797868938348625 0.6973568801999994
-2.3347071389205505 -0.6276211921799995
-2.112909960723098 0.5648590729619996
-1.9121835144544037 -0.5083731656657995
-1.7305260805812355 0.4575358490992195
-1.566126102926018 -0.41178226418929753
-1.4173441231480464 0.37060403777036777
-1.282696431448982 -0.333543633993331
-1.1608402704613288 0.3001892705939978
-1.0505604447675025 -0.270170343534598
-0.9507572025145898 0.2431533091811382
-0.8604352682757038 -0.21883797826302437
-0.778693917789512 0.19695418043672192
-0.7047179955995084 -0.17725876239304972
-0.6377697860175552 0.15953288615374472
-0.5771816563458875 -0.14357959753837024
-0.5223493989930281 0.12922163778453322
-0.4727262060886904 -0.11629947400607987
-0.42781721651026483 0.10466952660547188
-0.3871745809417897 -0.09420257394492468
Process finished with exit code 0
It's easy to see y Although the value of is approaching 0 But it jumps repeatedly between positive and negative ,x The value of is slowly approaching 0.
In the vertical direction , The gradient is very large , In the horizontal direction , The gradient is relatively small , So when we set the learning rate, we can't set it too large , In order to prevent parameter updating in the vertical direction, too much , Such a small learning rate leads to too slow updating of parameters in the horizontal direction , So it leads to very slow convergence .
therefore , We need more than just moving in a gradient SGD More intelligent A clear way .SGD The root cause of inefficiency is , The direction of the gradient does not point in the direction of the minimum .
1.2 Momentum Momentum method
Momentum yes “ momentum ” It means , It's about physics . The mathematical formula is as follows
New variable v Corresponding to the physical speed . It's like a ball rolling on the ground
You can think about it , The ball will be faster and faster before it reaches the lowest point , After the lowest point , The speed will rise more and more slowly , Then the speed drops faster and faster , Repeat this process until it stops at the lowest point .
Analyze the above formula ,α Is a momentum parameter , We usually set it to be less than 1 Positive number of , for example 0.9.v Is the parameter added in the last update . We are familiar with others
for instance , If the above ball is at the lowest point x The value is 0, Left x The value is negative. , To the right x The value is positive . Initial value ,x The value is -3,v by 0, Learning rate * gradient =-1,( hypothesis α=0.9).
Update parameters for the first time ,v=0.9 * 0-(-1)=1,x = x+1=-2.
Update parameters for the second time , Suppose the rate of learning * The gradient is still -1:v =0.9 * 1-(-1)= 1.9,x=x+v=-0.1
Update parameters for the third time , Suppose the rate of learning * The gradient is still -1:v=0.9 * 1.9 = 1.71,x=x+v = 1.61
Update parameters for the fourth time ,( In the opposite direction ) Learning rate * gradient =1:v = 0.9 *1.71 - 1=0.539,x = x+v=3.149
We can clearly see , If the direction does not change ,v The value of will gradually increase , That is, the update range will be larger and larger , If the direction changes ,v The value of becomes smaller , The update range will be smaller . It is equivalent to every time when updating parameters , Will take into account the previous speed , The moving range of each parameter in each direction depends not only on the current gradient , It also depends on whether the gradients in the past are consistent in all directions , If a gradient keeps updating along the current direction , Then the range of each update will be larger and larger , If a gradient changes constantly in one direction , Then its update amplitude will be attenuated , In this way, we can use a larger learning rate , Make convergence faster , At the same time, the direction with larger gradient will reduce the amplitude of each update due to momentum .( This paragraph is quoted from here , I think it's very clear )
Now let's talk about the code :
class Momentum:
def __init__(self,lr = 0.01,momentum = 0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self,params,grads):
if self.v is None: # initialization v For a and parameter params Same structure , All elements are 0
self.v = {
}
for key,val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
Then let's take a look Momentum Optimized update path , Code and SGD The same code , Just put optimizer=Momentum(0.1), And then change it plt.title(“Momentum”) Just fine , The figure is as follows . In order to see clearly , I enlarged it
Obviously Momentum Be more “ smooth ”.
1.3 AdaGrad
In the learning of neural networks , Learning rate ( It's written as η) The value of is very important . Low learning rate , It will lead to spending too much time on learning ; In turn, , Excessive learning rate , It will lead to learning divergence and can not Proceed correctly .
Among the effective techniques for learning rate , There is a kind of learning rate decay (learning rate decay) Methods , That is, as learning goes on , Gradually reduce the rate of learning . actually , In limine “ many ” learn , And then gradually “ Less ” The method of learning , It is often used in the learning of neural networks .
AdaGrad The learning rate will be adjusted appropriately for each element of the parameter , At the same time, learn (AdaGrad Of Ada From English words Adaptive, namely “ proper ” It means ). below , Give Way We use mathematical expression AdaGrad How to update .
And SGD comparison , One more. h( Circles represent the multiplication of the corresponding matrix elements , Not dot product ).h Save the sum of squares of all previous gradient values . then , When updating parameters , By multiplying , You can adjust the scale of learning . It means , There are large changes in the elements of the parameter ( Greatly updated ) The learning rate of the elements will be smaller . in other words , The learning rate can be attenuated according to the elements of the parameter , Make the learning rate of parameters with large changes gradually decrease .
AdaGrad Will record the sum of squares of all gradients in the past . therefore , The deeper you study , to update The smaller the range . actually , If you study endlessly , The update amount will become 0, No more updates at all . To improve the problem , have access to RMSProp Method . RMSProp The method is not to add all the gradients in the past equally , But gradually Forget the gradient of the past , In addition, the information of the new gradient is more reflected . Professionally speaking, this kind of operation , be called “ Exponentially moving average ”, Decrease exponentially The scale of the past gradient .
Now let's implement it with code AdaDrad:
class AdaGrad:
def __init__(self,lr = 0.01):
self.lr = lr
self.h = None
def update(self,params,grads):
if self.h is None:# initialization h
self.h={
}
for key,val in params.items():
self.h[key] = np.zeros_like(val)
for key in params:
self.h[key] += grads[key]*grads[key]
params[key] -= self.lr*grads[key]/(np.sqrt(self.h[key])+1e-7)
#1e-7 prevent 0 Used as the denominator
Same as Momentum Like drawing , Let's draw its path optimization diagram :
Is the effect great ! The value of the function moves towards the minimum efficiently . because y Axial square The upward gradient is larger , Therefore, it has changed greatly at the beginning , But later, according to this larger change, press Adjust the proportion , Reduce the pace of updates . therefore ,y The update degree in the axis direction is weakened ,“ And ” The degree of change of the glyph is attenuated .
1.4 RMSprop
Because there is no explanation in the book , I suggest you look directly at Wu Enda's deeplearning About the exponential weighted average , A total of half an hour , No one should say better , So we have theories and don't know where to recommend watching his video . Remind everyone to pay attention to the explanation in the video β=0.9 It's the average temperature of ten days ,β=0.98 The time is 50 The average temperature of the day , You should understand why . Exponential weighted average is also called Exponentially moving average , And the essence of mobile is here . Give you a link , because b There are too many courses in Wu Enda , from p63 Start . Exponentially weighted average
After understanding the theory, directly use code to realize , and AdaGrad Only part of it is different .
class RMSprop:
"""RMSprop"""
def __init__(self, lr=0.01, decay_rate = 0.99):
self.lr = lr
self.decay_rate = decay_rate
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {
}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] *= self.decay_rate
self.h[key] += (1 - self.decay_rate) * grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
Here we use lr=0.1, The code in the book lr=0.01, however 0.01 The drawing is too far from the center point , Convergence is not enough , The rate of decline is α=0.98. Than the one above AdaGrad A little smooth .
1.5 Adam
Have studied computer network or operating system , When we design algorithms, we like stitching monsters best , Is to combine the two algorithms , At the same time, it has the advantages of both .
Momentum Move according to the physical rules of the ball rolling in the bowl ,AdaGrad For reference Adjust the update pace appropriately for each element of the number . What if these two methods are combined
Adam yes 2015 The new method proposed in . Its theory is a little complicated , Directly speaking , Is to melt All right Momentum and AdaGrad Methods . By combining the advantages of the previous two methods , Is expected to be Realize efficient search of parameter space . Besides , Carry out hyperparametric “ Offset correction ” It's also Adam Characteristics of . In Wu Enda's course RMSprop Then I'll talk about Adam 了 , It is suggested to go directly to this part of the theory class ,7 About minutes , Then come back to the code .
class Adam:
def __init__(self, lr=0.001, beta1=0.9, beta2=0.999):
self.lr = lr
self.beta1 = beta1 # Momentum method parameters
self.beta2 = beta2#RMSprop Parameters
self.iter = 0
self.m = None # Momentum method
self.v = None #RMSprop Of
def update(self, params, grads):
if self.m is None:
self.m, self.v = {
}, {
}
for key, val in params.items():
self.m[key] = np.zeros_like(val)
self.v[key] = np.zeros_like(val)
self.iter += 1
lr_t = self.lr * np.sqrt(1.0 - self.beta2 ** self.iter) / (1.0 - self.beta1 ** self.iter)
# Deviation correction lr_t The learning rate and correction are calculated together
for key in params.keys():
self.m[key] = self.beta1*self.m[key] + (1-self.beta1)*grads[key]
self.v[key] = self.beta2*self.v[key] + (1-self.beta2)*(grads[key]**2)
#self.m[key] += (1 - self.beta1) * (grads[key] - self.m[key])# Teacher Wu Enda's method
#self.v[key] += (1 - self.beta2) * (grads[key] ** 2 - self.v[key])# Teacher Wu Enda's method
params[key] -= lr_t * self.m[key] / (np.sqrt(self.v[key]) + 1e-7)
You can observe that this is different from what Wu Enda said , In the calculation mt and vt One more hour is lost mt-1,vt-1, The following two lines of notes are the same as what Wu Enda said , The figures of these two calculation methods are drawn as follows
Have you found the same , So I guess the method in the book may be to correct another error that we haven't learned , But we can just follow what teacher Wu Enda said , After all, I don't see any difference between the two . The only thing that's hard to understand is in the code , The learning rate is multiplied by the two corrections , You can understand it with a push of a pen . Three or four line formula .
We introduced it above SGD、Momentum、AdaGrad、Adam,RMSprop this 5 Methods , that So which method is better ? Very regret ,( at present ) There is no such thing as being able to perform well in all problems Methods . this 4 Each method has its own characteristics , Both have their own problems that they are good at solving and are not good at solving The problem of . It is still used in many studies SGD.Momentum and AdaGrad It's also worth trying Methods . lately , Many researchers and technicians like to use Adam. In the future study, it is mainly used SGD and Adam.
边栏推荐
- tars源码分析之8
- 【网络数据传输】基于FPGA的百兆网/兆网千UDP数据包收发系统开发,PC到FPGA
- Highly paid programmers & interview questions: how does redis of series 119 realize distributed locks?
- [problem record] 03 connect to MySQL database prompt: 1040 too many connections
- Mysql 45讲学习笔记(十四)count(*)
- Fundamentals of SQL database operation
- Selection (021) - what is the output of the following code?
- About how idea sets up shortcut key sets
- Tar source code analysis 6
- Shopping malls, storerooms, flat display, user-defined maps can also be played like this!
猜你喜欢
Deep profile data leakage prevention scheme
leetcode 310. Minimum Height Trees
Four sets of APIs for queues
The final week, I split
About how idea sets up shortcut key sets
Mobile adaptation: vw/vh
Crawler (III) crawling house prices in Tianjin
Bottom problem of figure
Selenium ide plug-in download, installation and use tutorial
Selenium driver ie common problem solving message: currently focused window has been closed
随机推荐
tars源码分析之3
2022 Xinjiang's latest eight members (Safety Officer) simulated examination questions and answers
The final week, I split
The important role of host reinforcement concept in medical industry
tars源码分析之5
Tar source code analysis 9
ABCD four sequential execution methods, extended application
内卷怎么破?
what the fuck! If you can't grab it, write it yourself. Use code to realize a Bing Dwen Dwen. It's so beautiful ~!
selenium IDE插件下载安装使用教程
Another company raised the price of SAIC Roewe new energy products from March 1
Download address of the official website of national economic industry classification gb/t 4754-2017
图的底部问题
Summary of leetcode BFS question brushing
Bottom problem of figure
【GF(q)+LDPC】基于二值图GF(q)域的规则LDPC编译码设计与matlab仿真
2022年,或许是未来10年经济最好的一年,2022年你毕业了吗?毕业后是怎么计划的?
2022年,或許是未來10年經濟最好的一年,2022年你畢業了嗎?畢業後是怎麼計劃的?
【FPGA教程案例8】基于verilog的分频器设计与实现
2022 where to find enterprise e-mail and which is the security of enterprise e-mail system?