当前位置：网站首页>Neural network optimization

Neural network optimization

2022-07-28 06:14:00 【Jiyu Wangchuan】

Neural network optimization

One 、 neural network （NN） Complexity

NN Complexity ： multi-purpose NN Number of layers and NN The number of parameters indicates

Spatial complexity
The layer number = The number of hidden layers + 1 Output layers
Total parameters = total w + total b
Upper figure 3x4+4 + 4x2+2 = 26
Time complexity
Times of multiplication and addition
Upper figure 3x4 + 4x2 = 20

Two 、 Learning rate

$w_{t+1}=w_t-lr*\frac{\partial loss}{\partial w_t}$
$w_{t+1}$ ： Updated parameters , $w_{t}$ ： Current parameters , $l r$ ： Learning rate , $\frac{\partial loss}{\partial w_t}$ ： Gradient of loss function

Exponential decay learning rate

You can use a larger learning rate first , Get a better solution quickly , Then gradually reduce the learning rate , Make the model stable in the later stage of training .
$^{( The current number of rounds / How many rounds of attenuation once )}$

import tensorflow as tf

w = tf.Variable(tf.constant(5, dtype=tf.float32))

epoch = 40
LR_BASE = 0.2  #  Initial learning rate 
LR_DECAY = 0.99  #  Learning rate decay rate 
LR_STEP = 1  #  How many rounds of feeding BATCH_SIZE after , Update the learning rate once 

for epoch in range(epoch):  # for epoch  Define the top-level loop , Represents a loop over a dataset epoch Time , In this example, the dataset data is only 1 individual w, Initialization time constant The assignment is 5, loop 100 Sub iteration .
    lr = LR_BASE * LR_DECAY ** (epoch / LR_STEP)
    with tf.GradientTape() as tape:  # with Structure to grads The calculation process of gradient is framed .
        loss = tf.square(w + 1)
    grads = tape.gradient(loss, w)  # .gradient The function tells who takes the derivative from whom 

    w.assign_sub(lr * grads)  # .assign_sub  Self subtraction of variables   namely ：w -= lr*grads  namely  w = w - lr*grads
    print("After %s epoch,w is %f,loss is %f,lr is %f" % (epoch, w.numpy(), loss, lr))

The operation results are as follows ：

After 0 epoch,w is 2.600000,loss is 36.000000,lr is 0.200000
After 1 epoch,w is 1.174400,loss is 12.959999,lr is 0.198000
After 2 epoch,w is 0.321948,loss is 4.728015,lr is 0.196020
After 3 epoch,w is -0.191126,loss is 1.747547,lr is 0.194060
After 4 epoch,w is -0.501926,loss is 0.654277,lr is 0.192119
After 5 epoch,w is -0.691392,loss is 0.248077,lr is 0.190198

3、 ... and 、 Activation function

3.1 Simplify the model and MP Model

$y = x * w + b$

$y = f (x * w + b)$
$f$ Is the activation function

3.2 Excellent activation function

nonlinear ： When the activation function is nonlinear , Multilayer neural network can approximate all functions
Differentiability ： Most optimizers update parameters with gradient descent
monotonicity ： When the activation function is monotonic , It can ensure that the loss function of single-layer network is convex function
Approximate identity ： f(x)≈x When the parameter is initialized to a random small value , Neural networks are more stable

The range of the output value of the activation function ：

When the output of the activation function is a finite value , The gradient based optimization method is more stable
When the output of the activation function is infinite , It is suggested to reduce the learning rate

3.3 Commonly used activation functions

3.3.1 Sigmoid function

tf.nn.sigmoid(x)
$f(x)=\frac{1}{1+e^{-x}}$

characteristic ：
（1） It is easy to cause the gradient to disappear
（2） Output is not 0 mean value , Slow convergence
（3） The power operation is complex , Long training time

3.3.2 Tanh function

tf.math. tanh(x)
$f(x)=\frac{1-e^{-2x}}{1+e^{-2x}}$

characteristic ：
（1） The output is 0 mean value
（2） It is easy to cause the gradient to disappear
（3） The power operation is complex , Long training time

3.3.3 Relu function

tf.nn.relu(x)
$f (x) = ma x (x, 0)$

advantage ：
（1） Solved the problem of gradient disappearance ( In the positive range )
（2） Just judge whether the input is greater than 0, Fast calculation
（3） The convergence rate is much faster than sigmoid and tanh

shortcoming ：
（1） Output is not 0 mean value , Slow convergence
（2） Dead Relu problem ： Some neurons may never be activated , So that the corresponding parameters can never be updated .

3.3.4 Leaky Relu function

tf.nn.leaky_relu(x)
$f(x)=max(\alpha x,x)$

In theory ,Leaky Relu Yes Relu All the advantages of , Plus there won't be Dead Relu problem , But in practice , There is no complete proof Leaky Relu Always better than Relu.

3.4 Advice for beginners

The preferred relu Activation function ;
Set the learning rate to a smaller value ;
Input feature standardization , That is, let the input characteristics meet 0 Is the mean ,1 Is the normal distribution of standard deviation ;
Initial parameter centralization , That is, let the randomly generated parameters meet 0 Is the mean , $\frac{2}{\sqrt{ The number of input features of the current layer }}$ Is the normal distribution of standard deviation .

Four 、 Loss function

Loss function （loss）： Predictive value （y） With known answers （y_） The gap between
Neural network optimization objective ： Make the loss function loss Minimum

4.1 Mean square error MSE

loss_mse = tf.reduce_mean(tf.square(y_ -y))
$MSE(y\_-y)=\frac{\sum_{i-1}^n(y-y\_)^2}{n}$

4.2 Cross entropy CE

Cross entropy loss function CE (Cross Entropy)： Characterize the distance between two probability distributions
$h(y\_,y)=-\sum y\_*\ln y$
tf.losses.categorical_crossentropy(y_,y)
example ： Two categories of known answers y_=(1, 0) forecast y1=(0.6, 0.4) y2=(0.8, 0.2) Which is closer to the standard answer ？
$\\ H2((1,0),(0.8,0.2)) = -(1*ln0.8 + 0*ln0.2) ≈ -(-0.223 + 0) = 0.223$ because H1> H2, therefore y2 More accurate prediction

softmax Combined with cross entropy

Output first pass softmax function , Calculate again y And y_ The cross entropy loss function of .
tf.nn.softmax_cross_entropy_with_logits(y_,y)

5、 ... and 、 Under fitting and over fitting

The above figure shows Under fitting 、 Correct fitting and over fitting

The solution of under fitting

Add input feature item
Add network parameters
Reduce regularization parameters

The solution to over fitting

Data cleaning
Increase the training set
Using regularization
Increase the regularization parameter

6、 ... and 、 Regularization alleviates over fitting

Regularization introduces the model complexity index into the loss function , Use to W Weighted value , The noise of training data is weakened （ General non regularization b）

L1 Regularization

$loss_{L1}(w)=\sum_i |w_i|$

L2 Regularization

$loss_{L2}(w)=\sum_i |w_i^2|$

Selection of regularization

L1 Regularization Probability will make many parameters become zero , Therefore, the method can be realized by sparse parameters , That is, reduce the number of parameters , Reduce complexity .
L2 Regularization Will make the parameter very close to zero but not zero , Therefore, this method can reduce the complexity by reducing the size of the parameter value .

with tf.GradientTape() as tape:  #  Record gradient information 

    h1 = tf.matmul(x_train, w1) + b1  #  Record the neural network multiplication and addition operation 
    h1 = tf.nn.relu(h1)
    y = tf.matmul(h1, w2) + b2

    #  The mean square error loss function mse = mean(sum(y-out)^2)
    loss_mse = tf.reduce_mean(tf.square(y_train - y))
    #  add to l2 Regularization 
    loss_regularization = []
    # tf.nn.l2_loss(w)=sum(w ** 2) / 2
    loss_regularization.append(tf.nn.l2_loss(w1))
    loss_regularization.append(tf.nn.l2_loss(w2))
    
    loss_regularization = tf.reduce_sum(loss_regularization)
    loss = loss_mse + 0.03 * loss_regularization #REGULARIZER = 0.03

7、 ... and 、 Neural network parameter optimizer

Parameters to be optimized w, Loss function loss, Learning rate lr, One at a time batch,t At present batch The total number of iterations ：

Calculation t The gradient of the time loss function with respect to the current parameter $g_t=\nabla loss\frac{\partial loss}{\partial(w_t)}$
Calculation t Moment first-order momentum $m_t$ And second-order momentum $V_t$
Calculation t Time descent gradient ： $\eta_t=lr*m_t/\sqrt{V_t}$
Calculation t+1 Time parameters ：
$w_{t+1}=w_t-\eta _t=w_t-lr*m_t/\sqrt{V_t}$

First order momentum ： Gradient dependent functions
Second order momentum ： A function related to the square of the gradient

7.1 SGD

SGD（ nothing momentum）, The commonly used gradient descent method
$m_t=g\\ v_t=1\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*g_t\\ w_{t+1}=w_t-\eta_t=w_t-lr*g_t$

#  Calculation loss The gradient of each parameter 
grads = tape.gradient(loss, [w1, b1])

#  Achieve gradient update  w1 = w1 - lr * w1_grad b = b - lr * b_grad
w1.assign_sub(lr * grads[0])  #  Parameters w1 Self updating 
b1.assign_sub(lr * grads[1])  #  Parameters b Self updating

7.2 SGDM

SGDM（ contain momentum Of SGD）, stay SGD Add the first-order momentum
$m_t=\beta \cdot m_{t-1} + (1-\beta)\cdot g_t\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*(\beta \cdot m_{t-1} + (1-\beta)\cdot g_t)\\ w_{t+1}=w_t-\eta_t=w_t-lr*(\beta \cdot m_{t-1} + (1-\beta)\cdot g_t)$

m_w, m_b = 0, 0
beta = 0.9

m_w = beta * m_w + (1 - beta) * grads[0]
m_b = beta * m_b + (1 - beta) * grads[1]
w1.assign_sub(lr * m_w)
b1.assign_sub(lr * m_b)

7.3 Adagrad

Adagrad, stay SGD Add second-order momentum
$m_t=g_t\\ V_t=\sum^t_{\tau=1}g^2_{\tau}\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*g_t/\sqrt{\sum^t_{\tau=1}g^2_{\tau}}\\ w_{t+1}=w_t-\eta_t=w_t-lr*g_t/\sqrt{\sum^t_{\tau=1}g^2_{\tau}}$

v_w, v_b = 0, 0

v_w += tf.square(grads[0])
v_b += tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))

7.4 RMSProp

RMSProp,SGD Add second-order momentum
$m_t=g_t\\ V_t=\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2\\ \eta_t = lr*m_t/\sqrt{V_t}=lr*g_t/\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2}\\ w_{t+1}=w_t-lr*m_t/\sqrt{V_t}=w_t-lr*g_t/\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2}$

v_w, v_b = 0, 0
beta = 0.9

v_w = beta * v_w + (1 - beta) * tf.square(grads[0])
v_b = beta * v_b + (1 - beta) * tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))

7.5 Adam

Adam, At the same time combined with SGDM First order momentum and RMSProp Second order momentum
$m_t=\beta \cdot m_{t-1} + (1-\beta)\cdot g_t\\ Correct the deviation of the first-order momentum ：\hat{m_t}=\frac{m_t}{1-\beta_1^t}\\ V_t=\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2\\ Correct the deviation of the first-order momentum ：\hat{V_t}=\frac{V_t}{1-\beta_2^t}\\ \eta_t = lr*\hat{m_t}/\sqrt{\hat{V_t}}=lr*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}}\\ w_{t+1}=w_t-\eta_t=w_t-lr*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}}$

m_w, m_b = 0, 0
v_w, v_b = 0, 0
beta1, beta2 = 0.9, 0.999
delta_w, delta_b = 0, 0
global_step = 0

m_w = beta1 * m_w + (1 - beta1) * grads[0]
m_b = beta1 * m_b + (1 - beta1) * grads[1]
v_w = beta2 * v_w + (1 - beta2) * tf.square(grads[0])
v_b = beta2 * v_b + (1 - beta2) * tf.square(grads[1])

m_w_correction = m_w / (1 - tf.pow(beta1, int(global_step)))
m_b_correction = m_b / (1 - tf.pow(beta1, int(global_step)))
v_w_correction = v_w / (1 - tf.pow(beta2, int(global_step)))
v_b_correction = v_b / (1 - tf.pow(beta2, int(global_step)))

w1.assign_sub(lr * m_w_correction / tf.sqrt(v_w_correction))
b1.assign_sub(lr * m_b_correction / tf.sqrt(v_b_correction))