当前位置：网站首页>Deep learning: derivation of shallow neural networks and deep neural networks

Deep learning: derivation of shallow neural networks and deep neural networks

2022-07-06 08:21:00 【ShadyPi】

List of articles

Basic structure and symbol convention
Spread forward
Other activation functions
Back propagation
- Shallow neural networks
- Deep neural network

I have written several blogs about neural networks before learning machine learning , Recently, I watched the video of Wu Enda's in-depth learning , The neural network is different from before , So take a note .

Basic structure and symbol convention

Insert picture description here
The basic structure is the input layer 、 Hidden layer , Middle layer , Incentive letter $a$ Express , The layer label of the unit is placed in square brackets , The sample number in parentheses , So there is an input layer $x$ （ $a^{[0]}$ ）、 Hidden layer $a^{[1]}$ And output layer $a^{[2]}$ . Weight is needed in the operation $w$ And offset $b$ , The functions in the unit are still logical functions $\sigma(z)=\frac{1}{1+e^{-z}}$ .

Declare data matrix $X(n\times m)$ , Weight matrices $W^{[l]}(s_l\times s_{l-1})$ And bias matrix $b^{[l]}(s_l\times 1)$ , Excitation matrix $A^{[l]}(s_l\times m)$ , $n^{[l]}$ It means the first one $l$ The number of cells in the layer , Make
$A^{[0]}=X=\left[\begin{matrix} |&|& &|\\ x^{(1)}&x^{(2)}&\cdots&x^{(m)}\\ |&|& &|\\ \end{matrix}\right], W^{[l]}=\left[\begin{matrix} -&w_1^{[l]T}&-\\ -&w_2^{[l]T}&-\\ &\cdots&\\ -&w_{n^{[l]}}^{[l]T}&-\\ \end{matrix}\right], b^{[l]}=\left[\begin{matrix} b^{[l]}_1\\ b^{[l]}_2\\ \vdots\\ b^{[l]}_{n^{[l]}}\\ \end{matrix}\right]$

There are also some supplements that can be seen Neural networks in machine learning .

Spread forward

Yes Neural network vectorization derivation in machine learning Bottoming , Plus forward propagation is relatively simple , Let's go directly to multiple groups of data + Multiple hidden layers .

The middle vector $z^{[l]}$ by
$z^{[l]}=\left[\begin{matrix} z^{[l]}_1\\ z^{[l]}_2\\ \vdots\\ z^{[l]}_{n^{[l]}}\\ \end{matrix}\right]= \left[\begin{matrix} w^{[l]T}_1a^{[l-1]}+b_1^{[l]}\\ w^{[l]T}_2a^{[l-1]}+b_2^{[l]}\\ \vdots\\ w^{[l]T}_{n^{[l]}}a^{[l-1]}+b_{n^{[l]}}^{[l]}\\ \end{matrix}\right]= W^{[l]}a^{[l-1]}+b^{[l]}$
So by the $z^{[l](i)}$ A matrix of $Z^{[l]}$ by
$Z^{[l]}=\left[\begin{matrix} |&|& &|\\ z^{[l](1)}&z^{[l](2)}&\cdots&z^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]= W^{[l]}A^{[l-1]}+b^{[l]}$
And the excitation matrix of the hidden layer $A^{[l]}$ Namely
$A^{[l]}=\left[\begin{matrix} |&|& &|\\ a^{[l](1)}&a^{[l](2)}&\cdots&a^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]=\sigma(Z^{[l]}) =\sigma(W^{[l]}A^{[l-1]}+b^{[l]})$

Other activation functions

Before, our neural networks were all logical functions used by logistic regression , But in fact, there are many better choices in Neural Networks .

tanh function

$\tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}}$
The image below ：
Insert picture description here
$\tanh$ Functions are almost strictly superior to logical functions , because $\tanh$ Function so that the average value of the excitation is 0 about , This can make the calculation of the next level easier . Except at the output layer , What we expect is $0\sim 1$ Between the output , At this time, we can use logical functions in the output layer .

The derivative of this function is
$tanh'(z)=1-(\tanh(z))^2$

ReLU function

But logical functions and $\tanh$ Every function has a problem , That is when the absolute value of coordinates is very large , The gradient of the function becomes very small , In this way, when we run an algorithm similar to gradient descent, the convergence speed will become very slow . and ReLU Function can solve this problem , The expression is
$\text{ReLU}(z)=\max(0,z)$
The image is ：
Insert picture description here
such , as long as $z > 0$ , The derivative is 1, and $z < 0$ The time derivative is 0. Although mathematically $z = 0$ There is no derivative at , however $z$ The value is just 0 The probability is very small , And we can artificially define it as 1 or 0, This is harmless in practical application .

In general , If you want to do a binary classification problem , We might use it $\tanh$ function , Add a logic function to the output layer , At other times, it is generally the default ReLU function .

Leaky ReLU

In practice ,ReLU Functions usually perform well , But because the derivative of its negative part is 0, So for this part, its gradient descent rate will be very slow , Although in a network , We will have many positive parts , Make the whole parameters still learn at a faster speed . If you are not at ease , You can also set a smaller slope for the negative part , such as $0.01$ , In this way, the activation function is expressed as
$\text{Leaky ReLU}(z)=\max(0.01z,z)$
Insert picture description here

Back propagation

Shallow neural networks

Backward propagation is more complex , Let's first push a shallow neural network ：
Insert picture description here
First , For the last cost function , Its value is
$\mathcal{L}(a^{[2]},y)=-y\log a^{[2]}-(1-y)\log(1-a^{[2]})$
Yes $a^{[2]}$ Find differentiation , Available
$\frac{d\mathcal{L}}{da^{[2]}}=-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}}$
be $z^{[2]}$ The differential of the cost function is
$\frac{d\mathcal{L}}{dz^{[2]}}=\frac{d\mathcal{L}}{da^{[2]}}\frac{d a^{[2]}}{dz^{[2]}}=(-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}})a^{[2]}(1-a^{[2]})=a^{[2]}-y$
Calculated after $\frac{d\mathcal{L}}{dW^{[2]}}$ and $\frac{d\mathcal{L}}{db^{[2]}}$ by
$\frac{d\mathcal{L}}{dW^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}}a^{[1]T}\\ \frac{d\mathcal{L}}{db^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}}$
Now the derivation is half done , Let's calculate again $a^{[1]}$ The derivative of is
$\frac{d\mathcal{L}}{da^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}}$
because $z^{[2]}$ yes $n^{[2]}\times 1$ Of , $W^{[2]}$ yes $n^{[2]}\times n^{[1]}$ Of , So it needs to be transposed here . after , Ask again for $z^{[1]}$ The derivative of , Just multiply this by $\frac{d a^{[1]}}{dz^{[1]}}$ （ $*$ Indicates bitwise multiplication ）：
$\frac{d\mathcal{L}}{dz^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}}*g^{[1]'}(z^{[1]})$
And calculation $\frac{d\mathcal{L}}{dW^{[1]}}$ and $\frac{d\mathcal{L}}{db^{[1]}}$ Process and Chapter 2 The layers are almost identical ：
$\frac{d\mathcal{L}}{dW^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}}a^{[0]T}\\ \frac{d\mathcal{L}}{db^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}}$
The above derivation is for a single sample , Backward propagation is required for multiple samples , Stack the sample column vectors by column , You can apply the results derived above , It's all $n^{[l]}\times1$ The matrix becomes $n^{[l]}\times m$ , then $b$ The vector needs to be summed once in the horizontal direction （ To simplify the expression , We use it $dZ^{[2]}$ According to matrix $Z^{[2]}$ The result of deriving the cost function , Other matrices are the same ）：
$\begin{aligned} &dZ^{[2]}=(A^{[2]}-Y)\\ &dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ &db^{[2]}=\frac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)\\ &dZ^{[1]}=W^{[2]T}dZ^{[2]}*g^{[1]'}(Z^{[1]})\\ &dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ &db^{[1]}=\frac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)\\ \end{aligned}$ keepdims The function of is to let Python Don't put our column vector (n,1) Become rank 1 Matrix (n,), That could lead to hard to find bug.

Deep neural network

The shallow network above has only two layers , And for complex problems , Increase the number of layers of the network （ depth ） It is much more effective than forcing nodes to be added in a hidden layer , So we need to transform the above derivation into a more general form , That is, the following four formulas ：
$\begin{aligned} &dZ^{[l]}=dA^{[l]}*g^{[l]'}(Z^{[l]})\\ &dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}\\ &db^{[l]}=\frac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)\\ &dA^{[l-1]}=W^{[l]T}dZ^{[l]} \end{aligned}$
Initial input $dA^{[L]}$ Determined by the excitation function of the output node , about $m$ Group samples , Its value is
$dA^{[L]}=\left[\begin{matrix} \frac{d\mathcal{L}}{da^{[L](1)}}&\frac{d\mathcal{L}}{da^{[L](2)}}&\cdots&\frac{d\mathcal{L}}{da^{[L](m)}} \end{matrix}\right]$

From the above formula , We can input $dA^{[l]}$ , Output $dA^{[l-1]}$ , At the same time, calculate the weight of each layer and the gradient of offset .