当前位置:网站首页>Neural network optimization
Neural network optimization
2022-07-28 06:14:00 【Jiyu Wangchuan】
Neural network optimization
One 、 neural network (NN) Complexity
NN Complexity : multi-purpose NN Number of layers and NN The number of parameters indicates 
Spatial complexity
The layer number = The number of hidden layers + 1 Output layers
Total parameters = total w + total b
Upper figure 3x4+4 + 4x2+2 = 26Time complexity
Times of multiplication and addition
Upper figure 3x4 + 4x2 = 20
Two 、 Learning rate
w t + 1 = w t − l r ∗ ∂ l o s s ∂ w t w_{t+1}=w_t-lr*\frac{\partial loss}{\partial w_t} wt+1=wt−lr∗∂wt∂loss
w t + 1 w_{t+1} wt+1: Updated parameters , w t w_{t} wt: Current parameters , l r lr lr: Learning rate , ∂ l o s s ∂ w t \frac{\partial loss}{\partial w_t} ∂wt∂loss: Gradient of loss function
Exponential decay learning rate
You can use a larger learning rate first , Get a better solution quickly , Then gradually reduce the learning rate , Make the model stable in the later stage of training .
Exponential decay learning rate = Initial learning rate ∗ Learning rate decline rate ( The current number of rounds / How many rounds of attenuation once ) Exponential decay learning rate = Initial learning rate * Learning rate decay rate ^{( The current number of rounds / How many rounds of attenuation once )} Exponential decay learning rate = Initial learning rate ∗ Learning rate decline rate ( The current number of rounds / How many rounds of attenuation once )
import tensorflow as tf
w = tf.Variable(tf.constant(5, dtype=tf.float32))
epoch = 40
LR_BASE = 0.2 # Initial learning rate
LR_DECAY = 0.99 # Learning rate decay rate
LR_STEP = 1 # How many rounds of feeding BATCH_SIZE after , Update the learning rate once
for epoch in range(epoch): # for epoch Define the top-level loop , Represents a loop over a dataset epoch Time , In this example, the dataset data is only 1 individual w, Initialization time constant The assignment is 5, loop 100 Sub iteration .
lr = LR_BASE * LR_DECAY ** (epoch / LR_STEP)
with tf.GradientTape() as tape: # with Structure to grads The calculation process of gradient is framed .
loss = tf.square(w + 1)
grads = tape.gradient(loss, w) # .gradient The function tells who takes the derivative from whom
w.assign_sub(lr * grads) # .assign_sub Self subtraction of variables namely :w -= lr*grads namely w = w - lr*grads
print("After %s epoch,w is %f,loss is %f,lr is %f" % (epoch, w.numpy(), loss, lr))
The operation results are as follows :
After 0 epoch,w is 2.600000,loss is 36.000000,lr is 0.200000
After 1 epoch,w is 1.174400,loss is 12.959999,lr is 0.198000
After 2 epoch,w is 0.321948,loss is 4.728015,lr is 0.196020
After 3 epoch,w is -0.191126,loss is 1.747547,lr is 0.194060
After 4 epoch,w is -0.501926,loss is 0.654277,lr is 0.192119
After 5 epoch,w is -0.691392,loss is 0.248077,lr is 0.190198
3、 ... and 、 Activation function
3.1 Simplify the model and MP Model

y = x ∗ w + b y=x*w+b y=x∗w+b

y = f ( x ∗ w + b ) y=f(x*w+b) y=f(x∗w+b)
f f f Is the activation function
3.2 Excellent activation function
- nonlinear : When the activation function is nonlinear , Multilayer neural network can approximate all functions
- Differentiability : Most optimizers update parameters with gradient descent
- monotonicity : When the activation function is monotonic , It can ensure that the loss function of single-layer network is convex function
- Approximate identity : f(x)≈x When the parameter is initialized to a random small value , Neural networks are more stable
The range of the output value of the activation function :
- When the output of the activation function is a finite value , The gradient based optimization method is more stable
- When the output of the activation function is infinite , It is suggested to reduce the learning rate
3.3 Commonly used activation functions
3.3.1 Sigmoid function
tf.nn.sigmoid(x)
f ( x ) = 1 1 + e − x f(x)=\frac{1}{1+e^{-x}} f(x)=1+e−x1


characteristic :
(1) It is easy to cause the gradient to disappear
(2) Output is not 0 mean value , Slow convergence
(3) The power operation is complex , Long training time
3.3.2 Tanh function
tf.math. tanh(x)
f ( x ) = 1 − e − 2 x 1 + e − 2 x f(x)=\frac{1-e^{-2x}}{1+e^{-2x}} f(x)=1+e−2x1−e−2x


characteristic :
(1) The output is 0 mean value
(2) It is easy to cause the gradient to disappear
(3) The power operation is complex , Long training time
3.3.3 Relu function
tf.nn.relu(x)
f ( x ) = m a x ( x , 0 ) f(x)=max(x,0) f(x)=max(x,0)


advantage :
(1) Solved the problem of gradient disappearance ( In the positive range )
(2) Just judge whether the input is greater than 0, Fast calculation
(3) The convergence rate is much faster than sigmoid and tanh
shortcoming :
(1) Output is not 0 mean value , Slow convergence
(2) Dead Relu problem : Some neurons may never be activated , So that the corresponding parameters can never be updated .
3.3.4 Leaky Relu function
tf.nn.leaky_relu(x)
f ( x ) = m a x ( α x , x ) f(x)=max(\alpha x,x) f(x)=max(αx,x)


3.4 Advice for beginners
- The preferred relu Activation function ;
- Set the learning rate to a smaller value ;
- Input feature standardization , That is, let the input characteristics meet 0 Is the mean ,1 Is the normal distribution of standard deviation ;
- Initial parameter centralization , That is, let the randomly generated parameters meet 0 Is the mean , 2 The number of input features of the current layer \frac{2}{\sqrt{ The number of input features of the current layer }} The number of input features of the current layer 2 Is the normal distribution of standard deviation .
Four 、 Loss function
Loss function (loss): Predictive value (y) With known answers (y_) The gap between
Neural network optimization objective : Make the loss function loss Minimum
4.1 Mean square error MSE
loss_mse = tf.reduce_mean(tf.square(y_ -y))
M S E ( y _ − y ) = ∑ i − 1 n ( y − y _ ) 2 n MSE(y\_-y)=\frac{\sum_{i-1}^n(y-y\_)^2}{n} MSE(y_−y)=n∑i−1n(y−y_)2
4.2 Cross entropy CE
Cross entropy loss function CE (Cross Entropy): Characterize the distance between two probability distributions
h ( y _ , y ) = − ∑ y _ ∗ ln y h(y\_,y)=-\sum y\_*\ln y h(y_,y)=−∑y_∗lny
tf.losses.categorical_crossentropy(y_,y)
example : Two categories of known answers y_=(1, 0) forecast y1=(0.6, 0.4) y2=(0.8, 0.2) Which is closer to the standard answer ?
H 1 ( ( 1 , 0 ) , ( 0.6 , 0.4 ) ) = − ( 1 ∗ l n 0.6 + 0 ∗ l n 0.4 ) ≈ − ( − 0.511 + 0 ) = 0.511 H 2 ( ( 1 , 0 ) , ( 0.8 , 0.2 ) ) = − ( 1 ∗ l n 0.8 + 0 ∗ l n 0.2 ) ≈ − ( − 0.223 + 0 ) = 0.223 H1((1,0),(0.6,0.4)) = -(1*ln0.6 + 0*ln0.4) ≈ -(-0.511 + 0) = 0.511 \\ H2((1,0),(0.8,0.2)) = -(1*ln0.8 + 0*ln0.2) ≈ -(-0.223 + 0) = 0.223 H1((1,0),(0.6,0.4))=−(1∗ln0.6+0∗ln0.4)≈−(−0.511+0)=0.511H2((1,0),(0.8,0.2))=−(1∗ln0.8+0∗ln0.2)≈−(−0.223+0)=0.223 because H1> H2, therefore y2 More accurate prediction
softmax Combined with cross entropy
Output first pass softmax function , Calculate again y And y_ The cross entropy loss function of .
tf.nn.softmax_cross_entropy_with_logits(y_,y)
5、 ... and 、 Under fitting and over fitting

The above figure shows Under fitting 、 Correct fitting and over fitting
The solution of under fitting
- Add input feature item
- Add network parameters
- Reduce regularization parameters
The solution to over fitting
- Data cleaning
- Increase the training set
- Using regularization
- Increase the regularization parameter
6、 ... and 、 Regularization alleviates over fitting
Regularization introduces the model complexity index into the loss function , Use to W Weighted value , The noise of training data is weakened ( General non regularization b)
L1 Regularization
l o s s L 1 ( w ) = ∑ i ∣ w i ∣ loss_{L1}(w)=\sum_i |w_i| lossL1(w)=i∑∣wi∣
L2 Regularization
l o s s L 2 ( w ) = ∑ i ∣ w i 2 ∣ loss_{L2}(w)=\sum_i |w_i^2| lossL2(w)=i∑∣wi2∣
Selection of regularization
L1 Regularization Probability will make many parameters become zero , Therefore, the method can be realized by sparse parameters , That is, reduce the number of parameters , Reduce complexity .
L2 Regularization Will make the parameter very close to zero but not zero , Therefore, this method can reduce the complexity by reducing the size of the parameter value .
with tf.GradientTape() as tape: # Record gradient information
h1 = tf.matmul(x_train, w1) + b1 # Record the neural network multiplication and addition operation
h1 = tf.nn.relu(h1)
y = tf.matmul(h1, w2) + b2
# The mean square error loss function mse = mean(sum(y-out)^2)
loss_mse = tf.reduce_mean(tf.square(y_train - y))
# add to l2 Regularization
loss_regularization = []
# tf.nn.l2_loss(w)=sum(w ** 2) / 2
loss_regularization.append(tf.nn.l2_loss(w1))
loss_regularization.append(tf.nn.l2_loss(w2))
loss_regularization = tf.reduce_sum(loss_regularization)
loss = loss_mse + 0.03 * loss_regularization #REGULARIZER = 0.03


7、 ... and 、 Neural network parameter optimizer
Parameters to be optimized w, Loss function loss, Learning rate lr, One at a time batch,t At present batch The total number of iterations :
- Calculation t The gradient of the time loss function with respect to the current parameter g t = ∇ l o s s ∂ l o s s ∂ ( w t ) g_t=\nabla loss\frac{\partial loss}{\partial(w_t)} gt=∇loss∂(wt)∂loss
- Calculation t Moment first-order momentum m t m_t mt And second-order momentum V t V_t Vt
- Calculation t Time descent gradient : η t = l r ∗ m t / V t \eta_t=lr*m_t/\sqrt{V_t} ηt=lr∗mt/Vt
- Calculation t+1 Time parameters :
w t + 1 = w t − η t = w t − l r ∗ m t / V t w_{t+1}=w_t-\eta _t=w_t-lr*m_t/\sqrt{V_t} wt+1=wt−ηt=wt−lr∗mt/Vt
First order momentum : Gradient dependent functions
Second order momentum : A function related to the square of the gradient
7.1 SGD
SGD( nothing momentum), The commonly used gradient descent method
m t = g v t = 1 η t = l r ∗ m t / V t = l r ∗ g t w t + 1 = w t − η t = w t − l r ∗ g t m_t=g\\ v_t=1\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*g_t\\ w_{t+1}=w_t-\eta_t=w_t-lr*g_t mt=gvt=1ηt=lr∗mt/Vt=lr∗gtwt+1=wt−ηt=wt−lr∗gt
# Calculation loss The gradient of each parameter
grads = tape.gradient(loss, [w1, b1])
# Achieve gradient update w1 = w1 - lr * w1_grad b = b - lr * b_grad
w1.assign_sub(lr * grads[0]) # Parameters w1 Self updating
b1.assign_sub(lr * grads[1]) # Parameters b Self updating
7.2 SGDM
SGDM( contain momentum Of SGD), stay SGD Add the first-order momentum
m t = β ⋅ m t − 1 + ( 1 − β ) ⋅ g t η t = l r ∗ m t / V t = l r ∗ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) w t + 1 = w t − η t = w t − l r ∗ ( β ⋅ m t − 1 + ( 1 − β ) ⋅ g t ) m_t=\beta \cdot m_{t-1} + (1-\beta)\cdot g_t\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*(\beta \cdot m_{t-1} + (1-\beta)\cdot g_t)\\ w_{t+1}=w_t-\eta_t=w_t-lr*(\beta \cdot m_{t-1} + (1-\beta)\cdot g_t) mt=β⋅mt−1+(1−β)⋅gtηt=lr∗mt/Vt=lr∗(β⋅mt−1+(1−β)⋅gt)wt+1=wt−ηt=wt−lr∗(β⋅mt−1+(1−β)⋅gt)
m_w, m_b = 0, 0
beta = 0.9
m_w = beta * m_w + (1 - beta) * grads[0]
m_b = beta * m_b + (1 - beta) * grads[1]
w1.assign_sub(lr * m_w)
b1.assign_sub(lr * m_b)
7.3 Adagrad
Adagrad, stay SGD Add second-order momentum
m t = g t V t = ∑ τ = 1 t g τ 2 η t = l r ∗ m t / V t = l r ∗ g t / ∑ τ = 1 t g τ 2 w t + 1 = w t − η t = w t − l r ∗ g t / ∑ τ = 1 t g τ 2 m_t=g_t\\ V_t=\sum^t_{\tau=1}g^2_{\tau}\\ \eta_t=lr*m_t/\sqrt{V_t}=lr*g_t/\sqrt{\sum^t_{\tau=1}g^2_{\tau}}\\ w_{t+1}=w_t-\eta_t=w_t-lr*g_t/\sqrt{\sum^t_{\tau=1}g^2_{\tau}} mt=gtVt=τ=1∑tgτ2ηt=lr∗mt/Vt=lr∗gt/τ=1∑tgτ2wt+1=wt−ηt=wt−lr∗gt/τ=1∑tgτ2
v_w, v_b = 0, 0
v_w += tf.square(grads[0])
v_b += tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
7.4 RMSProp
RMSProp,SGD Add second-order momentum
m t = g t V t = β ⋅ V t − 1 + ( 1 − β ) ⋅ g t 2 η t = l r ∗ m t / V t = l r ∗ g t / β ⋅ V t − 1 + ( 1 − β ) ⋅ g t 2 w t + 1 = w t − l r ∗ m t / V t = w t − l r ∗ g t / β ⋅ V t − 1 + ( 1 − β ) ⋅ g t 2 m_t=g_t\\ V_t=\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2\\ \eta_t = lr*m_t/\sqrt{V_t}=lr*g_t/\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2}\\ w_{t+1}=w_t-lr*m_t/\sqrt{V_t}=w_t-lr*g_t/\sqrt{\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2} mt=gtVt=β⋅Vt−1+(1−β)⋅gt2ηt=lr∗mt/Vt=lr∗gt/β⋅Vt−1+(1−β)⋅gt2wt+1=wt−lr∗mt/Vt=wt−lr∗gt/β⋅Vt−1+(1−β)⋅gt2
v_w, v_b = 0, 0
beta = 0.9
v_w = beta * v_w + (1 - beta) * tf.square(grads[0])
v_b = beta * v_b + (1 - beta) * tf.square(grads[1])
w1.assign_sub(lr * grads[0] / tf.sqrt(v_w))
b1.assign_sub(lr * grads[1] / tf.sqrt(v_b))
7.5 Adam
Adam, At the same time combined with SGDM First order momentum and RMSProp Second order momentum
m t = β ⋅ m t − 1 + ( 1 − β ) ⋅ g t Correct the deviation of the first-order momentum : m t ^ = m t 1 − β 1 t V t = β ⋅ V t − 1 + ( 1 − β ) ⋅ g t 2 Correct the deviation of the first-order momentum : V t ^ = V t 1 − β 2 t η t = l r ∗ m t ^ / V t ^ = l r ∗ m t 1 − β 1 t / V t 1 − β 2 t w t + 1 = w t − η t = w t − l r ∗ m t 1 − β 1 t / V t 1 − β 2 t m_t=\beta \cdot m_{t-1} + (1-\beta)\cdot g_t\\ Correct the deviation of the first-order momentum :\hat{m_t}=\frac{m_t}{1-\beta_1^t}\\ V_t=\beta \cdot V_{t-1}+(1-\beta)\cdot g_t^2\\ Correct the deviation of the first-order momentum :\hat{V_t}=\frac{V_t}{1-\beta_2^t}\\ \eta_t = lr*\hat{m_t}/\sqrt{\hat{V_t}}=lr*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}}\\ w_{t+1}=w_t-\eta_t=w_t-lr*\frac{m_t}{1-\beta_1^t}/\sqrt{\frac{V_t}{1-\beta_2^t}} mt=β⋅mt−1+(1−β)⋅gt Correct the deviation of the first-order momentum :mt^=1−β1tmtVt=β⋅Vt−1+(1−β)⋅gt2 Correct the deviation of the first-order momentum :Vt^=1−β2tVtηt=lr∗mt^/Vt^=lr∗1−β1tmt/1−β2tVtwt+1=wt−ηt=wt−lr∗1−β1tmt/1−β2tVt
m_w, m_b = 0, 0
v_w, v_b = 0, 0
beta1, beta2 = 0.9, 0.999
delta_w, delta_b = 0, 0
global_step = 0
m_w = beta1 * m_w + (1 - beta1) * grads[0]
m_b = beta1 * m_b + (1 - beta1) * grads[1]
v_w = beta2 * v_w + (1 - beta2) * tf.square(grads[0])
v_b = beta2 * v_b + (1 - beta2) * tf.square(grads[1])
m_w_correction = m_w / (1 - tf.pow(beta1, int(global_step)))
m_b_correction = m_b / (1 - tf.pow(beta1, int(global_step)))
v_w_correction = v_w / (1 - tf.pow(beta2, int(global_step)))
v_b_correction = v_b / (1 - tf.pow(beta2, int(global_step)))
w1.assign_sub(lr * m_w_correction / tf.sqrt(v_w_correction))
b1.assign_sub(lr * m_b_correction / tf.sqrt(v_b_correction))
边栏推荐
- Pytorch deep learning single card training and multi card training
- 【3】 Redis features and functions
- UNL-类图
- Deep learning (II) into machine learning and deep learning programming
- 深度学习(增量学习)——ICCV2021:SS-IL: Separated Softmax for Incremental Learning
- NLP中基于Bert的数据预处理
- 小程序开发系统有哪些优点?为什么要选择它?
- 微信小程序开发费用制作费用是多少?
- What are the detailed steps of wechat applet development?
- 强化学习——Proximal Policy Optimization Algorithms
猜你喜欢

Dataset class loads datasets in batches

强化学习——策略学习

强化学习——多智能体强化学习

Deep learning (incremental learning) - (iccv) striking a balance between stability and plasticity for class incremental learning

强化学习——价值学习中的DQN

Reinforcement learning - Strategic Learning

Self attention learning notes

使用神经网络实现对天气的预测

将项目部署到GPU上,并且运行

深度学习——MetaFormer Is Actually What You Need for Vision
随机推荐
There is a problem with MySQL paging
深度学习(一):走进机器学习与深度学习理论部分
《AdaFace: Quality Adaptive Margin for Face Recognition》用于人脸识别的图像质量自适应边缘损失
Model Inversion Attacks that Exploit Confidence Informati on and Basic Countermeasures 阅读心得
深度学习(增量学习)——(ICCV)Striking a Balance between Stability and Plasticity for Class-Incremental Learning
word2vec和bert的基本使用方法
深度学习——MetaFormer Is Actually What You Need for Vision
强化学习——价值学习中的SARSA
四、模型优化器与推理引擎
神经网络优化
vscode uniapp
How to choose an applet development enterprise
Bert的使用方法
NLP中常用的utils
Nlp项目实战自定义模板框架
Using neural network to predict the weather
知识点21-泛型
How much does it cost to make a small program mall? What are the general expenses?
Deep learning (self supervision: CPC V2) -- data efficient image recognition with contractual predictive coding
小程序开发解决零售业的焦虑