当前位置：网站首页>Training mode of deep learning network

Training mode of deep learning network

2022-06-29 15:39:00 【Maomaozhen nice】

We know that when we use the deep learning network model , It is often to put the input value into the network , Then we can get a good prediction , But do you know the operation rules and specific operation process in the network model ？ Today we will talk about the training method of deep learning network , Unlock the secrets in the black box .
The ownership value and bias in deep learning are trained by minimizing the loss function . Usually , It is difficult to calculate the global minimum of the loss function by analytical method , However, the gradient of the loss function to the weight can be calculated analytically . Therefore, the loss function can be minimized by iterative numerical optimization . The simplest of these optimization algorithms is the gradient descent method (gradient descent). In a convolutional neural network , The gradient of the loss function to the weight can be calculated very effectively by the back propagation algorithm .

1. Back propagation algorithm in convolutional neural network

The back propagation algorithm first computes a called “ Error term ” The intermediate variable of , Then it is related to the gradient of the loss function to the weight . For each node of the output layer , The error term is the difference between the real category and the prediction probability of the model . Then the error term can be propagated from back to front , Up to the input layer .
Although the pooled layer does not contain trainable weights , However, the error term on the nodes of the pool layer also needs to be propagated forward .
After completing the calculation of the mean square error term of each layer , The gradient of loss function to convolution kernel and offset can be calculated conveniently .

2. Small batch gradient descent method with momentum term

For batch gradient descent method , In each iteration, all training samples are needed to estimate the gradient of the loss function to the weight . However, even the true gradient direction is only the direction in which the loss function drops fastest locally , Instead of pointing in the direction of the global minimum , Therefore, it is meaningless to estimate the gradient value more accurately . Therefore, less training samples are selected to estimate the gradient in each iteration , Increase the number of weight updates , Instead, it helps to search more parameter space , Converge to a better minimum . So usually , Some of the training samples are selected to estimate the gradient in each iteration , This method is called small batch gradient descent method (mini-batch gradient decent). In each small batch, the number of training samples of each type should be approximately the same .
Momentum is an effective method to improve training speed . In this method, gradients with different symbols are superimposed to reduce the left-right swing in this direction , Superimpose gradients with the same sign to accelerate learning in this direction . In the momentum term method , Instead of using the current gradient estimate to update the weight , Instead, the current gradient estimate is used to update the velocity parameter (velocity parameter), Then update the weight with the current speed parameter .

3. Weight initialization

For most deep convolution networks , The weights are calculated by means of a mean value of 0、 The standard deviation is 0.01 Gaussian distribution to initialize , Offsets are initialized to constants 0.1. It is very important to initialize the weights randomly , Because if all weights are initialized to the same value , All nodes will initially output the same value , The same gradient value will be generated when the back propagation algorithm is carried out , Therefore, the ownership value is updated by the same amount in each iteration . In a general way , If you use ReLU Nonlinear activation function , Then the weight should pass the mean value of 0、 The standard deviation is the root sign 2/n Initialization of Gaussian distribution , among n Is the input parameter of each node .

4. Learning rate

Gradually reduce the learning rate during training (learning rate) Usually advantageous . It is best to start with a larger learning rate , Let the weight adjust faster . But if you keep a high learning rate , Then the weights will eventually swing from side to side , Cannot converge to a better value . Generally, the starting value of the learning rate is of the order of magnitude 0.01 perhaps 0.001, Specifically, the value that can make the loss function decrease as soon as possible . A heuristic method to reduce the learning rate is to observe the performance on the verification set during the training process , If the accuracy of the validation set stops improving over a period of time , Change the learning rate to the original 1/10 or 1/2.

5. Stop training early

When training a deep model , It is usually observed that the accuracy on the verification set at the beginning increases gradually with the progress of training , But after a period of training , Instead, accuracy began to decline . Stop training early (early stopping) It's a regularization method , Avoid model from under fitting (underfitting） Transition to overfitting (overfitting). It is common practice to verify that every time the accuracy on the set is improved , Keep the weight at this time , At the end of the training, the weights that make the accuracy of the verification set the highest are stored , Instead of the weight obtained in the last iteration .
notes ： The article is excerpted from 《 Intelligent interpretation of synthetic aperture radar image 》 Xu Feng et al

原网站

版权声明
本文为[Maomaozhen nice]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291452166208.html