当前位置：网站首页>Deep learning: a solution to over fitting in deep neural networks

Deep learning: a solution to over fitting in deep neural networks

2022-07-02 01:51:00 【ShadyPi】

List of articles

L2 Regularization
Random deactivation （dropout）
Dataset expansion
Early termination method

L2 Regularization

With the Regularization in machine learning Agreement , Add the Frobenius norm of all weight matrices after the cost function , That is, the sum of the squares of all weight elements ：
$J(W^{[1]},b^{[1]},\cdots,W^{[L]},b^{[L]})=\frac{1}{m}\sum_{i=1}^m\mathcal{L}(\hat{y}^{(i)},y^{(i)})+\frac{\lambda}{2m}\sum_{l=1}^L||W^{[l]}||^2_F\\ ||W^{[L]}||^2_F=\sum_{i=1}^{n^{[l]}}\sum_{j=1}^{n^{[l-1]}}(w_{ij}^{[l]})^2$
Finally, the expression of gradient is the same as machine learning regularization , Added a $\frac{\lambda}{m}w$ , So the formula of backward propagation is changed to
$\begin{aligned} &dZ^{[l]}=dA^{[l]}*g^{[l]'}(Z^{[l]})\\ &dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}+\frac{\lambda}{m}dW^{[l]}\\ &db^{[l]}=\frac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)\\ &dA^{[l-1]}=W^{[l]T}dZ^{[l]} \end{aligned}$
The principle is probably by adding norms to the cost function , Make the algorithm try to reduce the weight （ near 0）, Equivalent to reducing the influence of hidden nodes , And fewer hidden nodes , The model will change from high square error to high deviation , When you have a suitable $\lambda$ You can get a reasonable model .

Why not add about the offset matrix $b$ Regularization of ？ It's because of the comparison $W$ , $b$ The number of elements is very small , So the value of its Frobenius norm is compared $W$ It's also very small , In the cost function, we get “ Focus on ” Still $W$ The elements in , and $b$ Whether you participate in regularization or not doesn't matter much , So we won't waste time .

The disadvantage is to debug parameters $\lambda$ , It needs repeated training .

Random deactivation （dropout）

Random deactivation refers to every training , Make some nodes fail randomly , Delete all paths connected to these nodes , Carry out forward propagation and back propagation on this network to train the model .

A common implementation is reverse random deactivation , For the first $l$ Node vectors of hidden layers $a^{[l]}$ , We generate a random vector of the same size , The value of each element is $[0, 1]$ . For elements less than the set threshold, set to 1, Greater than is set to 0, Use this vector and $a^{[l]}$ Element corresponding multiplication , You can reset some nodes randomly . The set threshold is called keep-prob, That is, the probability of retaining nodes .

Set the node 0 after , For the remaining nodes , We have to divide 1/keep-prob. Because the expectation of output will become original after setting zero keep-prob times , In order to keep the output expectation unchanged （ This is very helpful in testing ）, We need to divide keep-prob.

In the actual test , We don't need to randomly disable nodes anymore , Directly operate according to the normal forward propagation , Because when we train, we have made the expectation of the output of the whole model unchanged .

The reason why random deactivation regularization is effective , Because it is equivalent to every training on a smaller neural network , Small scale networks are not easy to fit . meanwhile , Because every node can be set to zero , So the model will not depend on a certain eigenvalue , Instead, we will try to integrate all the information spread .

in addition , Threshold of each layer keep-prob It can be different , For some hidden layers with a large number of nodes , They are more likely to be fitted , We will take keep-prob Set it smaller , Those that are not easy to fit can be larger , It can even be set to 1 Keep it all . But then there are more super parameters ……

Another disadvantage is that after using random deactivation , There is no well-defined cost function , If the cost function is forcibly drawn - Iteration number image , The curve is probably not simply subtracted .

Dataset expansion

Simply speaking , Is to use a larger training set , In this way, there will be no over fitting , If you can't collect data , You can add some magic changes to the existing data .

Early termination method

Simply speaking , That is to stop training directly in the process of the model gradually being trained from high deviation to high variance , If the time to stop is clever enough , We can get a good model . A scientific judgment method draws the cost function curve of the verification set , Stop at the lowest cost of the validation set .

The advantage of this method is that there are no super parameters to debug , We just need to look at the lowest value of the cost function of the verification set in the first few iterations , Then restart the training and stop the training at the corresponding iteration number , It's very simple and fast . The disadvantage is that this method mixes the two tasks of reducing the value of the cost function and avoiding over fitting , It violates the idea of orthogonalization （ One module only does one thing ）, Sometimes it makes things complicated .

原网站

版权声明
本文为[ShadyPi]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202152123166584.html