当前位置：网站首页>[foundation of deep learning] learning of neural network (3)

[foundation of deep learning] learning of neural network (3)

2022-06-11 17:36:00 【Programmer Xiao Li】

Through the previous study , We know how to calculate the partial derivative under multidimensional variables . The process of neural network learning is the process of constantly seeking the optimal parameters , The calculation of partial derivatives and gradients is the fastest way to pursue the optimal parameters .

What is gradient

gradient , Is a set of vectors composed of partial derivatives of multidimensional independent variables . For example, for the following function ：

Yes x0 The partial derivative calculation of ：

Yes x1 The partial derivative calculation of ：

So the gradient is a vector of two partial derivatives ：

According to the previous study , The partial derivative can be calculated by means of central difference , When calculating a one-dimensional independent variable , Arguments to other dimensions are treated as constants ：

def numerical_gradient(f, x):
    #  Minor variation 
    h = 1e-4 # 0.0001

    #  Generate an array of equal dimensions 
    grad = np.zeros_like(x) #  Generate and x An array of the same shape 

    for idx in range(x.size): 
        #  The value of the argument of the current dimension 
        tmp_val = x[idx]
        
        # f(x+h) The calculation of 
        x[idx] = tmp_val + h 
        fxh1 = f(x)

        # f(x-h) The calculation of 
        x[idx] = tmp_val - h 
        fxh2 = f(x)

        #  Differential calculation 
        grad[idx] = (fxh1 - fxh2) / (2*h) 

        x[idx] = tmp_val #  Reduction value 

    return grad

Then calculate the partial derivative of the function at some position 、 Gradients are convenient ：

def function_2(x):
    return x[0]**2 + x[1]**2

>>> numerical_gradient(function_2, np.array([3.0, 4.0])) 
array([ 6., 8.])A
>>> numerical_gradient(function_2, np.array([0.0, 2.0])) 
array([ 0., 4.])
>>> numerical_gradient(function_2, np.array([3.0, 0.0])) 
array([ 6., 0.])

We see functions in different places , Their partial derivatives are different , So what do these partial derivatives and gradients mean ？

If the negative gradient is used as a vector to draw the image , We get the image above . We found that , The arrows point to the position of the minimum value of the function , Just like our 3D images , They all point to the bottom of the groove ：

actually , Direction of negative gradient , Is that the function is in that position The direction of greatest reduction .

Gradient descent method

Since the negative gradient points to the function at that position The direction of greatest reduction , We can take advantage of this feature , Slowly adjust the parameters , Make it optimal （ Reach the lowest point of the cost function ）.

in other words , Every time we calculate the gradient at that location , Then move a small unit here to the position where the gradient descends , In this way, we approach the goal step by step .

there η yes Learning rate , Feel the size of each step we take . Learning rate should not be too high , Otherwise, it is easy to jump back and forth , Can't find the best value ; The learning rate should not be too small , Otherwise, the learning speed is too slow , Low efficiency .
（ Generally, the learning rate is dynamically adjusted , The previous steps use a large learning rate , Fast , The next steps use a smaller learning rate , It is slow but can avoid exceeding the optimal value .）

First, let's update the position in the process of gradient descent ：

def gradient_descent(f, init_x, lr=0.01, step_num=100):
    x = init_x
    #  Start iteration update 
    for i in range(step_num):
        #  Calculate the gradient 
        grad = numerical_gradient(f, x) 
        #  Mobile location 
        x -= lr * grad
    
    return x

Let's try to calculate the minimum ：

>>> def function_2(x):
... return x[0]**2 + x[1]**2 ...
>>> init_x = np.array([-3.0, 4.0])
>>> gradient_descent(function_2, init_x=init_x, lr=0.1, step_num=100) 
array([ -6.11110793e-10, 8.14814391e-10])

We found that the results were very close （0,0） This shows that the lowest point of the function is really found