当前位置：网站首页>Deep learning: implementation skills of deep neural network

Deep learning: implementation skills of deep neural network

2022-06-30 03:12:00 【ShadyPi】

List of articles

normalization
Weight initialization
Gradient inspection

normalization

Follow Feature scaling It's like , We have used it many times in the previous machine learning courses . Its main function is to transform the range of eigenvalues into a high-dimensional sphere with the origin as the center , Find the mean vector of each sample $\mu$ And standard deviation vector $\sigma$ , namely
$\mu=\frac{1}{m}\sum_{i=1}^mx^{(i)}\\ \sigma^2=\frac{1}{m}\sum_{i=1}^m(x^{(i)}-\mu)^2$
after , Make $x:=\frac{x-\mu}{\sigma}$ The normalization is completed .

Weight initialization

In deep neural networks , Sometimes there is a gradient explosion / Problems disappear , This is because in deep networks , When propagating from one segment to the other, many weights will be accumulated , Even if the weight matrix is only a little larger or smaller than the identity matrix , After the multiplication, it will still increase exponentially and become a large or small value , The same goes for gradients , This is the gradient explosion / Problems disappear .

Reasonable initialization can effectively alleviate this problem , We can see that there is positive propagation in the process of propagation
$A^{[l]}=\sigma(Z^{[l]}) =\sigma(W^{[l]}A^{[l-1]}+b^{[l]})$
You can see , Number of nodes passed to this layer （ That is, the number of nodes in the upper layer $n^{[l-1]}$ ） The more , Got $Z^{[l]}$ The more likely the value is to be larger , On the contrary, it is more likely to be smaller , So we initialize the weight value to mean 0, The variance of $\frac{C}{n^{[l-1]}}$ Is a normal distribution , So that the size of the calculated value is as moderate as possible , Where the constant $C$ In the use of ReLU When the function is used as the excitation function, it usually takes 2, Use logical functions or $\tanh$ Function generally takes 1.

Gradient inspection

Follow Machine learning It is one thing , But there is a new measure , For the gradient vector calculated by back propagation $d\theta$ And using the derivative to define the approximately calculated gradient vector $d\theta_\text{approx}$ , We calculated
$\frac{||d\theta_\text{approx}-d\theta||_2}{||d\theta_\text{approx}||_2+||d\theta||_2}$ among $||\vec{x}||_2$ Measure for Euclid .

When $\varepsilon=10^{-7}$ when , If the above value is less than $10^{-7}$ , That our calculation is correct . If the value is in the order of $10^{-5}$ , Need to be checked carefully . If in $10^{-3}$ , It can be considered that the algorithm has been written .