当前位置:网站首页>Training mode of deep learning network
Training mode of deep learning network
2022-06-29 15:39:00 【Maomaozhen nice】
We know that when we use the deep learning network model , It is often to put the input value into the network , Then we can get a good prediction , But do you know the operation rules and specific operation process in the network model ? Today we will talk about the training method of deep learning network , Unlock the secrets in the black box .
The ownership value and bias in deep learning are trained by minimizing the loss function . Usually , It is difficult to calculate the global minimum of the loss function by analytical method , However, the gradient of the loss function to the weight can be calculated analytically . Therefore, the loss function can be minimized by iterative numerical optimization . The simplest of these optimization algorithms is the gradient descent method (gradient descent). In a convolutional neural network , The gradient of the loss function to the weight can be calculated very effectively by the back propagation algorithm .
1. Back propagation algorithm in convolutional neural network
The back propagation algorithm first computes a called “ Error term ” The intermediate variable of , Then it is related to the gradient of the loss function to the weight . For each node of the output layer , The error term is the difference between the real category and the prediction probability of the model . Then the error term can be propagated from back to front , Up to the input layer .
Although the pooled layer does not contain trainable weights , However, the error term on the nodes of the pool layer also needs to be propagated forward .
After completing the calculation of the mean square error term of each layer , The gradient of loss function to convolution kernel and offset can be calculated conveniently .
2. Small batch gradient descent method with momentum term
For batch gradient descent method , In each iteration, all training samples are needed to estimate the gradient of the loss function to the weight . However, even the true gradient direction is only the direction in which the loss function drops fastest locally , Instead of pointing in the direction of the global minimum , Therefore, it is meaningless to estimate the gradient value more accurately . Therefore, less training samples are selected to estimate the gradient in each iteration , Increase the number of weight updates , Instead, it helps to search more parameter space , Converge to a better minimum . So usually , Some of the training samples are selected to estimate the gradient in each iteration , This method is called small batch gradient descent method (mini-batch gradient decent). In each small batch, the number of training samples of each type should be approximately the same .
Momentum is an effective method to improve training speed . In this method, gradients with different symbols are superimposed to reduce the left-right swing in this direction , Superimpose gradients with the same sign to accelerate learning in this direction . In the momentum term method , Instead of using the current gradient estimate to update the weight , Instead, the current gradient estimate is used to update the velocity parameter (velocity parameter), Then update the weight with the current speed parameter .
3. Weight initialization
For most deep convolution networks , The weights are calculated by means of a mean value of 0、 The standard deviation is 0.01 Gaussian distribution to initialize , Offsets are initialized to constants 0.1. It is very important to initialize the weights randomly , Because if all weights are initialized to the same value , All nodes will initially output the same value , The same gradient value will be generated when the back propagation algorithm is carried out , Therefore, the ownership value is updated by the same amount in each iteration . In a general way , If you use ReLU Nonlinear activation function , Then the weight should pass the mean value of 0、 The standard deviation is the root sign 2/n Initialization of Gaussian distribution , among n Is the input parameter of each node .
4. Learning rate
Gradually reduce the learning rate during training (learning rate) Usually advantageous . It is best to start with a larger learning rate , Let the weight adjust faster . But if you keep a high learning rate , Then the weights will eventually swing from side to side , Cannot converge to a better value . Generally, the starting value of the learning rate is of the order of magnitude 0.01 perhaps 0.001, Specifically, the value that can make the loss function decrease as soon as possible . A heuristic method to reduce the learning rate is to observe the performance on the verification set during the training process , If the accuracy of the validation set stops improving over a period of time , Change the learning rate to the original 1/10 or 1/2.
5. Stop training early
When training a deep model , It is usually observed that the accuracy on the verification set at the beginning increases gradually with the progress of training , But after a period of training , Instead, accuracy began to decline . Stop training early (early stopping) It's a regularization method , Avoid model from under fitting (underfitting) Transition to overfitting (overfitting). It is common practice to verify that every time the accuracy on the set is improved , Keep the weight at this time , At the end of the training, the weights that make the accuracy of the verification set the highest are stored , Instead of the weight obtained in the last iteration .
notes : The article is excerpted from 《 Intelligent interpretation of synthetic aperture radar image 》 Xu Feng et al
边栏推荐
猜你喜欢

Classe d'outils commune de fichier, application liée au flux (enregistrement)

ImgUtil 图片处理工具类,文字提取,图片水印
![Abnormal logic reasoning problem of Huawei software test written test [2] Huawei hot interview problem](/img/f0/5c2504d51532dcda0ac115f3703384.gif)
Abnormal logic reasoning problem of Huawei software test written test [2] Huawei hot interview problem

西北工业大学遭境外电邮攻击

Lumiprobe click chemistry - non fluorescent azide: azide-peg3-oh

Take another picture of cloud redis' improvement path

Construction and application of medical field Atlas of dingxiangyuan

13.TCP-bite

PostgreSQL source code learning (24) -- transaction log ⑤ - log writing to wal buffer

I am 35 years old. Can I change my career to be a programmer?
随机推荐
Is Guangzhou futures regular? If someone asks you to log in with your own mobile phone and help open an account, is it safe?
获取Text组件内容的宽度
Render follows, encapsulating a form and adding data to the table
雷达的类型
12.udp protocol -bite
Lumiprobe 脱氧核糖核酸丨磷酸盐 CPG 1000 固体载体
Unity C basic review 28 - delegation with return (p449)
MySQL development specification pdf
BioVendor遊離輕鏈(κ和λ)Elisa 試劑盒的化學性質
在shop工程中,实现一个菜单(增删改查)
Lumiprobe deoxyribonucleic acid phosphate CpG 1000 solid carrier
.NET程序配置文件操作(ini,cfg,config)
Leetcode notes: Weekly contest 299
卷积神经网络中各层的作用
Unity C # basic review 26 - first acquaintance Commission (p447)
雷达相关内容简介
pwm转0-5V/0-10V/1-5V线性信号变送器
Building SQL statements in Excel
如临现场的视觉感染力,NBA决赛直播还能这样看?
数字图像处理复习