当前位置：网站首页>Ml backward propagation

Ml backward propagation

2022-07-08 01:59:00 【xcrj】

neural network

Insert picture description here

Introduce

$a_n^{(l)}$ ：l（layer） The layer number ,n（number） Number , The number of neurons
$w_{i,j}^{(l)}$ ： $=a_n^{(l)} Of n$ , $=a_n^{(l+1)} Of n$
$w_{i,j}^{(l)}$ ：i,j Represents the position of this element in the matrix ,i On behalf of the line ,j Representative column
$w^{(2)}=\begin{pmatrix} w_{1,1}^{(2)} & w_{1,2}^{(2)} & w_{1,3}^{(2)} \\ w_{2,1}^{(2)} & w_{2,2}^{(2)} & w_{2,3}^{(2)} \\ w_{3,1}^{(2)} & w_{3,2}^{(2)} & w_{3,3}^{(2)} \end{pmatrix}$
$w^{(3)}=\begin{pmatrix} w_{1,1}^{(3)} & w_{1,2}^{(3)} \\ w_{2,1}^{(3)} & w_{2,2}^{(3)} \\ w_{3,1}^{(3)} & w_{3,2}^{(3)} \end{pmatrix}$
$a_n^{(l)}=g(z_n^{(l)})$ g yes sigmoid function
$x_n=a_n^{(1)}$
$\theta_1^{(2)}=w_{0,1}^{(2)}, \theta_2^{(2)}=w_{0,2}^{(2)}, \theta_3^{(2)}=w_{0,3}^{(2)}, \theta_1^{(3)}=w_{0,1}^{(3)}, \theta_2^{(3)}=w_{0,2}^{(3)}$

Arithmetic representation ：

$a_1^{(2)}=g(z_1^{(2)})=g(\theta_1^{(2)}*1+w_{1,1}^{(2)}x_1+w_{2,1}^{(2)}x_2+w_{3,1}^{(2)}x_3)$
$a_2^{(2)}=g(z_2^{(2)})=g(\theta_2^{(2)}*1+w_{1,2}^{(2)}x_1+w_{2,2}^{(2)}x_2+w_{3,2}^{(2)}x_3)$
$a_3^{(2)}=g(z_3^{(2)})=g(\theta_3^{(2)}*1+w_{1,3}^{(2)}x_1+w_{2,3}^{(2)}x_2+w_{3,3}^{(2)}x_3)$
$a_1^{(3)}=g(z_1^{(3)})=g(\theta_1^{(3)}*1+w_{1,1}^{(3)}x_1+w_{2,1}^{(3)}x_2+w_{3,1}^{(3)}x_3)$
$a_2^{(3)}=g(z_2^{(3)})=g(\theta_2^{(3)}*1+w_{1,2}^{(3)}x_1+w_{2,2}^{(3)}x_2+w_{3,2}^{(3)}x_3)$
$a_1^{(3)}=y_1$
$a_2^{(3)}=y_2$

The matrix represents ：

$z^{(l)}=w^{(l)}a^{(l-1)}+\theta^{(l)}$ , Weight of this layer * The output of the upper layer + The weight of the offset

Forward propagation

Definition

Input 》 Handle 》 Output , Take the output of the previous layer as the input of the next layer

It is known that

$x^{(l)}$ and $y$
Using forward propagation, we can find ： $z^{(l)}, a^{(l)}=g(z^{(l)})$

Back propagation

Introduce

Backward propagation or reverse propagation is Backward propagation of loss
MLP Is the loss back propagation and optimization method （ gradient descent ） combination
Back propagation calculates the gradient of weight in the loss function of neural network , The random gradient descent algorithm uses this gradient for learning
Machine learning requires computation $w$ ( The weight ), After defining the loss function , Calculate the loss function $w$ ( The weight ) Gradient of , The gradient descent algorithm uses this gradient for learning （ to update $w$ ）

Determine the loss function ：

Mean square error （MSE）： $C(w,\theta)=\frac{1}{2}||a^{(L)}-y||_2^2=\frac{1}{2}\sum\limits_{i=1}^n(a_i-y_i)^2$ ,C yes cost（ Cost loss ）,L（layer） Said the last 1 layer , $a^{(L)}$ Is the output layer vector , $y$ Is the real output vector , $x||_2$ Is a two norm , That's the distance.

Calculate the loss function $w$ ( The weight ) Gradient of ：
One ： Calculate the output layer In the loss function $w$ ( The weight ) Gradient of

$\begin{aligned} \frac{\partial{C(w,\theta)}}{\partial{w^{(L)}}} &=\frac{\partial{C(w,\theta)}}{\partial{a^{(L)}}}\frac{\partial{a^{(L)}}}{\partial{z^{(L)}}}\frac{\partial{z^{(L)}}}{\partial{w^{(L)}}} \\ &=(a^{(L)}-y)\odot g(z^{(L)})^{'}a^{(L-1)} \end{aligned}$
$\begin{aligned} \frac{\partial{C(w,\theta)}}{\partial{\theta^{(L)}}} &=\frac{\partial{C(w,\theta)}}{\partial{a^{(L)}}}\frac{\partial{a^{(L)}}}{\partial{z^{(L)}}}\frac{\partial{z^{(L)}}}{\partial{\theta^{(L)}}} \\ &=(a^{(L)}-y)\odot g(z^{(L)})^{'} \end{aligned}$
$\odot$ yes Hadamaji Hadamard Product, The matrix is multiplied by the corresponding position elements

Two ： Count the penultimate 2 layer In the loss function $w$ ( The weight ) Gradient of

$\begin{aligned} \frac{\partial{C(w,\theta)}}{\partial{w^{(L-1)}}} &=\frac{\partial{C(w,\theta)}}{\partial{a^{(L)}}}\frac{\partial{a^{(L)}}}{\partial{z^{(L)}}}\frac{\partial{z^{(L)}}}{\partial{a^{(L-1)}}}\frac{\partial{a^{(L-1)}}}{\partial{z^{(L-1)}}}\frac{\partial{z^{(L-1)}}}{\partial{w^{(L-1)}}} \\ &=(a^{(L)}-y)\odot g(z^{(L)})^{'}\odot w^{(L)}g(z^{(L-1)})^{'}a^{(L-2)} \end{aligned}$
$\begin{aligned} \frac{\partial{C(w,\theta)}}{\partial{\theta^{(L-1)}}} &=\frac{\partial{C(w,\theta)}}{\partial{a^{(L)}}}\frac{\partial{a^{(L)}}}{\partial{z^{(L)}}}\frac{\partial{z^{(L)}}}{\partial{a^{(L-1)}}}\frac{\partial{a^{(L-1)}}}{\partial{z^{(L-1)}}}\frac{\partial{z^{(L-1)}}}{\partial{\theta^{(L-1)}}} \\ &=(a^{(L)}-y)\odot g(z^{(L)})^{'}\odot w^{(L)}g(z^{(L-1)})^{'} \end{aligned}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L)}}}$ And $\frac{\partial{C(w,\theta)}}{\partial{w^{(L-1)}}}$ The difference between 3 A formula

3、 ... and ： Draw out the public part of No. 1 and No. 2 middle school

Make $\delta^{(L)}=\frac{\partial{C(w,\theta)}}{\partial{z^{(L)}}}=(a^{(L)}-y)\odot g(z^{(L)})^{'}$
be $\delta^{(L-1)}=\frac{\partial{C(w,\theta)}}{\partial{z^{(L-1)}}}=(a^{(L)}-y)\odot g(z^{(L)})^{'}\odot w^{(L)}g(z^{(L-1)})^{'}$
be $\delta^{(L-1)}=\delta^{(L)}\odot w^{(L)}g(z^{(L-1)})^{'}$
Then I know $\delta^{(L-1)}$ and $\delta^{(L)}$ The recurrence relation of

Four ： Calculate all layers In the loss function $w$ ( The weight ) Gradient of
The first L layer （ Output layer ） Loss function pair $w$ ( The weight ) Gradient of ：

$\frac{\partial{C(w,\theta)}}{\partial{w^{(L)}}}=\delta^{(L)}a^{(L-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L)}}}=\delta^{(L)}$

The first L-1 layer （ Output layer ） Loss function pair $w$ ( The weight ) Gradient of ：

$\frac{\partial{C(w,\theta)}}{\partial{w^{(L-1)}}}=\delta^{(L-1)}a^{(L-2)}=\delta^{(L)}\odot w^{(L)}g(z^{(L-1)})^{'}a^{(L-2)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L-1)}}}=\delta^{(L-1)}=\delta^{(L)}\odot w^{(L)}g(z^{(L-1)})^{'}$

The first $l$ Layer loss function pair $w$ ( The weight ) Gradient of ：

$\frac{\partial{C(w,\theta)}}{\partial{w^{(l)}}}=\delta^{(l)}a^{(l-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(l)}}}=\delta^{(l)}$
$\frac{\partial{C(w,\theta)}}{\partial{w^{(l-1)}}}=\delta^{(l-1)}a^{(l-2)}=\delta^{(l)}\odot w^{(l)}g(z^{(l-1)})^{'}a^{(l-2)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(l-1)}}}=\delta^{(l-1)}=\delta^{(l)}\odot w^{(l)}g(z^{(l-1)})^{'}$
know $\delta^{(l)}$ You know $\frac{\partial{C(w,\theta)}}{\partial{w^{(l)}}}$ You know The first $l$ Layer loss function pair $w$ ( The weight ) Gradient of

5、 ... and ： summary
It is known that ：

$x^{(l)}$ and $y$

Forward propagation can be found ：

$z^{(l)}, a^{(l)}=g(z^{(l)})$

Backward propagation can be found ：

$\delta^{(L)}$ ： because $\delta^{(L)}=\frac{\partial{C(w,\theta)}}{\partial{z^{(L)}}}=(a^{(L)}-y)\odot g(z^{(L)})^{'}$ in $a^{(L)}, y, g(z^{(L)})=a^{(L)}$ All known
$\delta^{(L-1)}$ ： because $\delta^{(L-1)}=\delta^{(L)}\odot w^{(L)}g(z^{(L-1)})^{'}$ in $w^{(L)}, g(z^{(L-1)})$ All known ,== $\delta^{(L)}$ == It can be seen from the above that
$\delta^{(L-2)}$ ： because $\delta^{(L-2)}=\delta^{(L-1)}\odot w^{(L-1)}g(z^{(L-2)})^{'}$ in $w^{(L-1)}, g(z^{(L-2)})$ All known ,== $\delta^{(L-1)}$ == It can be seen from the above that
$\delta^{(l-1)}$ ： because $\delta^{(l-1)}=\delta^{(l)}\odot w^{(l)}g(z^{(l-1)})^{'}$ in $w^{(l)}, g(z^{(l-1)})$ All known ,== $\delta^{(l)}$ == It can be seen from the above that
$\frac{\partial{C(w,\theta)}}{\partial{w^{(L)}}}=\delta^{(L)}a^{(L-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L)}}}=\delta^{(L)}$
$\frac{\partial{C(w,\theta)}}{\partial{w^{(L-1)}}}=\delta^{(L-1)}a^{(L-2)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L-1)}}}=\delta^{(L-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{w^{(L-2)}}}=\delta^{(L-2)}a^{(L-3)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(L-2)}}}=\delta^{(L-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{w^{(l)}}}=\delta^{(l)}a^{(l-1)}$
$\frac{\partial{C(w,\theta)}}{\partial{\theta^{(l)}}}=\delta^{(l)}$

summary

Clear purpose ： Calculate the weight $w$ And offset $\theta$ （ $z=wx+\theta, a=g(z)=g(wx+\theta)$ ）
initialization $\theta$
Forward propagation calculation $z^{(l)}, a^{(l)}$
Define the loss function - Mean square error (MSE) $C(w,\theta)=\frac{1}{2}||a^{(L)}-y||_2^2=\frac{1}{2}\sum\limits_{i=1}^n(a_i-y_i)^2$
Calculate the output layer (L layer ) Of $\delta^{(L)}$

- $\delta^{(L)}=\frac{\partial{C(w,\theta)}}{\partial{z^{(L)}}}=(a^{(L)}-y)\odot g(z^{(L)})^{'}$

Backward propagation calculation $l$ Layer of $\delta^{(l)}, l=2,...,L-1$
Using machine learning methods - The gradient descent algorithm updates the weight $w$ And offset $\theta$

- $w^{(l)}=w^{(l)}-\alpha \frac{\partial{C(w,\theta)}}{\partial{w^{(l)}}}$
- $\theta^{(l)}=\theta^{(l)}-\alpha \frac{\partial{C(w,\theta)}}{\partial{\theta^{(l)}}}$

if $\theta$ The change of is less than the given threshold （ Express $\theta$ No more changes ） Or the number of iterations , Exit iteration
Output $\theta$