当前位置：网站首页>6_ Gradient descent method

6_ Gradient descent method

2022-07-27 00:40:00 【Acowardintheworld】

6_ Gradient descent method （Gradient Descent）

Gradient descent method It is an important search strategy in the field of machine learning . In this chapter , We will explain the basic principle of gradient descent method in detail , Step by step to improve the gradient descent algorithm , Let's understand the parameters of gradient descent method , Especially the significance of learning rate .
meanwhile , We will also extend the random gradient descent method and the small batch gradient descent method , Let's have a comprehensive understanding of the gradient descent method family .…

6-1 What is gradient descent method

Insert picture description here

The derivative represents theta When the unit changes ,J Corresponding changes
Derivative can represent direction , Corresponding J Direction of increase
- Not all functions have unique extreme points （ Multivariate multiple function ）

Insert picture description here

6-3 Gradient descent method in linear regression

Insert picture description here

Because the size of the gradient is affected by the number of samples m influence , It's obviously unreasonable , So divide by the number of samples m, So that it is not affected by the number of samples .

6-4 The gradient descent method in linear regression is realized

6-5 Vectorization and data standardization of gradient descent method

Insert picture description here

6-6 Random gradient descent method

Insert picture description here

The idea of simulated annealing algorithm ： Simulated annealing phenomena in nature , It makes use of the similarity between the annealing process of solid matter in physics and general optimization problems .
Start from a certain initial temperature , With the continuous decline of temperature , Combined with the characteristics of probabilistic jump, the global optimal solution is found randomly in the solution space

6-8 How to determine the accuracy of gradient calculation ？ Debug gradient descent method

Insert picture description here

# ipynb The code on , No, print()


import numpy as np
import matplotlib.pyplot as plt

np.random.seed(666)
X = np.random.random(size = (1000,10))

true_theta = np.arange(1,12,dtype = float)

X_b = np.hstack([np.ones((len(X),1)),X]   )
y = X_b.dot(true_theta) + np.random.normal(size = 1000)

print(X.shape)
print(y.shape)
print(true_theta)

def J(theta,X_b,y):  # Define the loss function 
    try:
        return np.sum((y-X_b.dot(theta))**2  ) / len(X_b)  
    except:
        return float("inf")

def dJ_math(theta,X_b,y):  #  Define the gradient   Mathematical formula calculation 
    return X_b.T.dot(X_b.dot(theta) - y)*2. / len(y)

def dJ_debug(theta,X_b,y,epsilon=.01):    #  Define the gradient  debug Calculation  
    res = np.empty(len(theta))
    for i in range(len(theta)):
        theta_1 = theta.copy()
        theta_1[i] += epsilon
        theta_2 = theta.copy()
        theta_2[i] -= epsilon
        res[i] = (J(theta_1,X_b,y) - J(theta_2,X_b,y))/(2*epsilon)
    
    return res

def gradient_descent(dJ,X_b,y,initial_theta,eta=1e-2,epsilon=1e-8,n_iters=1e4):
    theta = initial_theta
    i_iters = 0
    while i_iters < n_iters:
        gradient = dJ(theta,X_b,y)
        last_theta = theta
        theta = theta - eta * gradient
        if (abs(J(theta,X_b,y)-J(last_theta,X_b,y)))< epsilon:
            break
        i_iters += 1 

    return theta

X_b = np.hstack( ( np.ones((len(X),1)),X) )
initial_theta = np.zeros(X_b.shape[1])
eta = 0.01

%time theta = gradient_descent(dJ_debug,X_b,y,initial_theta,eta)
theta

%time theta = gradient_descent(dJ_math,X_b,y,initial_theta,eta)
theta

tip
dJ_debug Used to verify the debug gradient , Slower , A small number of samples can be taken dJ_debug Get the right results , Then use the formula to deduce the mathematical solution , Comparing the results .

dJ_debug Not affected by the current loss function J Influence , Finding the gradient is universal .

6-9 More in-depth discussion on gradient descent method

Insert picture description here
BGD： You need to traverse the entire sample every time , The direction of the fastest descent of each gradient must be , Stable but slow .
SGD： Only one sample at a time , The direction of gradient descent is uncertain , It may even move in the opposite direction , Fast but unstable .