当前位置:网站首页>Deep learning: derivation of shallow neural networks and deep neural networks
Deep learning: derivation of shallow neural networks and deep neural networks
2022-07-06 08:21:00 【ShadyPi】
List of articles
I have written several blogs about neural networks before learning machine learning , Recently, I watched the video of Wu Enda's in-depth learning , The neural network is different from before , So take a note .
Basic structure and symbol convention

The basic structure is the input layer 、 Hidden layer , Middle layer , Incentive letter a a a Express , The layer label of the unit is placed in square brackets , The sample number in parentheses , So there is an input layer x x x( a [ 0 ] a^{[0]} a[0])、 Hidden layer a [ 1 ] a^{[1]} a[1] And output layer a [ 2 ] a^{[2]} a[2]. Weight is needed in the operation w w w And offset b b b, The functions in the unit are still logical functions σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}} σ(z)=1+e−z1.
Declare data matrix X ( n × m ) X(n\times m) X(n×m), Weight matrices W [ l ] ( s l × s l − 1 ) W^{[l]}(s_l\times s_{l-1}) W[l](sl×sl−1) And bias matrix b [ l ] ( s l × 1 ) b^{[l]}(s_l\times 1) b[l](sl×1), Excitation matrix A [ l ] ( s l × m ) A^{[l]}(s_l\times m) A[l](sl×m), n [ l ] n^{[l]} n[l] It means the first one l l l The number of cells in the layer , Make
A [ 0 ] = X = [ ∣ ∣ ∣ x ( 1 ) x ( 2 ) ⋯ x ( m ) ∣ ∣ ∣ ] , W [ l ] = [ − w 1 [ l ] T − − w 2 [ l ] T − ⋯ − w n [ l ] [ l ] T − ] , b [ l ] = [ b 1 [ l ] b 2 [ l ] ⋮ b n [ l ] [ l ] ] A^{[0]}=X=\left[\begin{matrix} |&|& &|\\ x^{(1)}&x^{(2)}&\cdots&x^{(m)}\\ |&|& &|\\ \end{matrix}\right], W^{[l]}=\left[\begin{matrix} -&w_1^{[l]T}&-\\ -&w_2^{[l]T}&-\\ &\cdots&\\ -&w_{n^{[l]}}^{[l]T}&-\\ \end{matrix}\right], b^{[l]}=\left[\begin{matrix} b^{[l]}_1\\ b^{[l]}_2\\ \vdots\\ b^{[l]}_{n^{[l]}}\\ \end{matrix}\right] A[0]=X=⎣⎡∣x(1)∣∣x(2)∣⋯∣x(m)∣⎦⎤,W[l]=⎣⎢⎢⎢⎡−−−w1[l]Tw2[l]T⋯wn[l][l]T−−−⎦⎥⎥⎥⎤,b[l]=⎣⎢⎢⎢⎢⎡b1[l]b2[l]⋮bn[l][l]⎦⎥⎥⎥⎥⎤
There are also some supplements that can be seen Neural networks in machine learning .
Spread forward
Yes Neural network vectorization derivation in machine learning Bottoming , Plus forward propagation is relatively simple , Let's go directly to multiple groups of data + Multiple hidden layers .
The middle vector z [ l ] z^{[l]} z[l] by
z [ l ] = [ z 1 [ l ] z 2 [ l ] ⋮ z n [ l ] [ l ] ] = [ w 1 [ l ] T a [ l − 1 ] + b 1 [ l ] w 2 [ l ] T a [ l − 1 ] + b 2 [ l ] ⋮ w n [ l ] [ l ] T a [ l − 1 ] + b n [ l ] [ l ] ] = W [ l ] a [ l − 1 ] + b [ l ] z^{[l]}=\left[\begin{matrix} z^{[l]}_1\\ z^{[l]}_2\\ \vdots\\ z^{[l]}_{n^{[l]}}\\ \end{matrix}\right]= \left[\begin{matrix} w^{[l]T}_1a^{[l-1]}+b_1^{[l]}\\ w^{[l]T}_2a^{[l-1]}+b_2^{[l]}\\ \vdots\\ w^{[l]T}_{n^{[l]}}a^{[l-1]}+b_{n^{[l]}}^{[l]}\\ \end{matrix}\right]= W^{[l]}a^{[l-1]}+b^{[l]} z[l]=⎣⎢⎢⎢⎢⎡z1[l]z2[l]⋮zn[l][l]⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡w1[l]Ta[l−1]+b1[l]w2[l]Ta[l−1]+b2[l]⋮wn[l][l]Ta[l−1]+bn[l][l]⎦⎥⎥⎥⎥⎤=W[l]a[l−1]+b[l]
So by the z [ l ] ( i ) z^{[l](i)} z[l](i) A matrix of Z [ l ] Z^{[l]} Z[l] by
Z [ l ] = [ ∣ ∣ ∣ z [ l ] ( 1 ) z [ l ] ( 2 ) ⋯ z [ l ] ( m ) ∣ ∣ ∣ ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{[l]}=\left[\begin{matrix} |&|& &|\\ z^{[l](1)}&z^{[l](2)}&\cdots&z^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]= W^{[l]}A^{[l-1]}+b^{[l]} Z[l]=⎣⎡∣z[l](1)∣∣z[l](2)∣⋯∣z[l](m)∣⎦⎤=W[l]A[l−1]+b[l]
And the excitation matrix of the hidden layer A [ l ] A^{[l]} A[l] Namely
A [ l ] = [ ∣ ∣ ∣ a [ l ] ( 1 ) a [ l ] ( 2 ) ⋯ a [ l ] ( m ) ∣ ∣ ∣ ] = σ ( Z [ l ] ) = σ ( W [ l ] A [ l − 1 ] + b [ l ] ) A^{[l]}=\left[\begin{matrix} |&|& &|\\ a^{[l](1)}&a^{[l](2)}&\cdots&a^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]=\sigma(Z^{[l]}) =\sigma(W^{[l]}A^{[l-1]}+b^{[l]}) A[l]=⎣⎡∣a[l](1)∣∣a[l](2)∣⋯∣a[l](m)∣⎦⎤=σ(Z[l])=σ(W[l]A[l−1]+b[l])
Other activation functions
Before, our neural networks were all logical functions used by logistic regression , But in fact, there are many better choices in Neural Networks .
tanh function
tanh ( z ) = e z − e − z e z + e − z \tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} tanh(z)=ez+e−zez−e−z
The image below :
tanh \tanh tanh Functions are almost strictly superior to logical functions , because tanh \tanh tanh Function so that the average value of the excitation is 0 about , This can make the calculation of the next level easier . Except at the output layer , What we expect is 0 ∼ 1 0\sim 1 0∼1 Between the output , At this time, we can use logical functions in the output layer .
The derivative of this function is
tanh ′ ( z ) = 1 − ( tanh ( z ) ) 2 \tanh'(z)=1-(\tanh(z))^2 tanh′(z)=1−(tanh(z))2
ReLU function
But logical functions and tanh \tanh tanh Every function has a problem , That is when the absolute value of coordinates is very large , The gradient of the function becomes very small , In this way, when we run an algorithm similar to gradient descent, the convergence speed will become very slow . and ReLU Function can solve this problem , The expression is
ReLU ( z ) = max ( 0 , z ) \text{ReLU}(z)=\max(0,z) ReLU(z)=max(0,z)
The image is :
such , as long as z > 0 z>0 z>0, The derivative is 1, and z < 0 z<0 z<0 The time derivative is 0. Although mathematically z = 0 z=0 z=0 There is no derivative at , however z z z The value is just 0 The probability is very small , And we can artificially define it as 1 or 0, This is harmless in practical application .
In general , If you want to do a binary classification problem , We might use it tanh \tanh tanh function , Add a logic function to the output layer , At other times, it is generally the default ReLU function .
Leaky ReLU
In practice ,ReLU Functions usually perform well , But because the derivative of its negative part is 0, So for this part, its gradient descent rate will be very slow , Although in a network , We will have many positive parts , Make the whole parameters still learn at a faster speed . If you are not at ease , You can also set a smaller slope for the negative part , such as 0.01 0.01 0.01, In this way, the activation function is expressed as
Leaky ReLU ( z ) = max ( 0.01 z , z ) \text{Leaky ReLU}(z)=\max(0.01z,z) Leaky ReLU(z)=max(0.01z,z)
Back propagation
Shallow neural networks
Backward propagation is more complex , Let's first push a shallow neural network :
First , For the last cost function , Its value is
L ( a [ 2 ] , y ) = − y log a [ 2 ] − ( 1 − y ) log ( 1 − a [ 2 ] ) \mathcal{L}(a^{[2]},y)=-y\log a^{[2]}-(1-y)\log(1-a^{[2]}) L(a[2],y)=−yloga[2]−(1−y)log(1−a[2])
Yes a [ 2 ] a^{[2]} a[2] Find differentiation , Available
d L d a [ 2 ] = − y a [ 2 ] + 1 − y 1 − a [ 2 ] \frac{d\mathcal{L}}{da^{[2]}}=-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}} da[2]dL=−a[2]y+1−a[2]1−y
be z [ 2 ] z^{[2]} z[2] The differential of the cost function is
d L d z [ 2 ] = d L d a [ 2 ] d a [ 2 ] d z [ 2 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) a [ 2 ] ( 1 − a [ 2 ] ) = a [ 2 ] − y \frac{d\mathcal{L}}{dz^{[2]}}=\frac{d\mathcal{L}}{da^{[2]}}\frac{d a^{[2]}}{dz^{[2]}}=(-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}})a^{[2]}(1-a^{[2]})=a^{[2]}-y dz[2]dL=da[2]dLdz[2]da[2]=(−a[2]y+1−a[2]1−y)a[2](1−a[2])=a[2]−y
Calculated after d L d W [ 2 ] \frac{d\mathcal{L}}{dW^{[2]}} dW[2]dL and d L d b [ 2 ] \frac{d\mathcal{L}}{db^{[2]}} db[2]dL by
d L d W [ 2 ] = d L d z [ 2 ] a [ 1 ] T d L d b [ 2 ] = d L d z [ 2 ] \frac{d\mathcal{L}}{dW^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}}a^{[1]T}\\ \frac{d\mathcal{L}}{db^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}} dW[2]dL=dz[2]dLa[1]Tdb[2]dL=dz[2]dL
Now the derivation is half done , Let's calculate again a [ 1 ] a^{[1]} a[1] The derivative of is
d L d a [ 1 ] = W [ 2 ] T d L d z [ 2 ] \frac{d\mathcal{L}}{da^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}} da[1]dL=W[2]Tdz[2]dL
because z [ 2 ] z^{[2]} z[2] yes n [ 2 ] × 1 n^{[2]}\times 1 n[2]×1 Of , W [ 2 ] W^{[2]} W[2] yes n [ 2 ] × n [ 1 ] n^{[2]}\times n^{[1]} n[2]×n[1] Of , So it needs to be transposed here . after , Ask again for z [ 1 ] z^{[1]} z[1] The derivative of , Just multiply this by d a [ 1 ] d z [ 1 ] \frac{d a^{[1]}}{dz^{[1]}} dz[1]da[1]( ∗ * ∗ Indicates bitwise multiplication ):
d L d z [ 1 ] = W [ 2 ] T d L d z [ 2 ] ∗ g [ 1 ] ′ ( z [ 1 ] ) \frac{d\mathcal{L}}{dz^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}}*g^{[1]'}(z^{[1]}) dz[1]dL=W[2]Tdz[2]dL∗g[1]′(z[1])
And calculation d L d W [ 1 ] \frac{d\mathcal{L}}{dW^{[1]}} dW[1]dL and d L d b [ 1 ] \frac{d\mathcal{L}}{db^{[1]}} db[1]dL Process and Chapter 2 The layers are almost identical :
d L d W [ 1 ] = d L d z [ 1 ] a [ 0 ] T d L d b [ 1 ] = d L d z [ 1 ] \frac{d\mathcal{L}}{dW^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}}a^{[0]T}\\ \frac{d\mathcal{L}}{db^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}} dW[1]dL=dz[1]dLa[0]Tdb[1]dL=dz[1]dL
The above derivation is for a single sample , Backward propagation is required for multiple samples , Stack the sample column vectors by column , You can apply the results derived above , It's all n [ l ] × 1 n^{[l]}\times1 n[l]×1 The matrix becomes n [ l ] × m n^{[l]}\times m n[l]×m, then b b b The vector needs to be summed once in the horizontal direction ( To simplify the expression , We use it d Z [ 2 ] dZ^{[2]} dZ[2] According to matrix Z [ 2 ] Z^{[2]} Z[2] The result of deriving the cost function , Other matrices are the same ):
d Z [ 2 ] = ( A [ 2 ] − Y ) d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ g [ 1 ] ′ ( Z [ 1 ] ) d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) \begin{aligned} &dZ^{[2]}=(A^{[2]}-Y)\\ &dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ &db^{[2]}=\frac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)\\ &dZ^{[1]}=W^{[2]T}dZ^{[2]}*g^{[1]'}(Z^{[1]})\\ &dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ &db^{[1]}=\frac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)\\ \end{aligned} dZ[2]=(A[2]−Y)dW[2]=m1dZ[2]A[1]Tdb[2]=m1np.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]∗g[1]′(Z[1])dW[1]=m1dZ[1]XTdb[1]=m1np.sum(dZ[1],axis=1,keepdims=True)keepdims The function of is to let Python Don't put our column vector (n,1) Become rank 1 Matrix (n,), That could lead to hard to find bug.
Deep neural network
The shallow network above has only two layers , And for complex problems , Increase the number of layers of the network ( depth ) It is much more effective than forcing nodes to be added in a hidden layer , So we need to transform the above derivation into a more general form , That is, the following four formulas :
d Z [ l ] = d A [ l ] ∗ g [ l ] ′ ( Z [ l ] ) d W [ l ] = 1 m d Z [ l ] A [ l − 1 ] T d b [ l ] = 1 m n p . s u m ( d Z [ l ] , a x i s = 1 , k e e p d i m s = T r u e ) d A [ l − 1 ] = W [ l ] T d Z [ l ] \begin{aligned} &dZ^{[l]}=dA^{[l]}*g^{[l]'}(Z^{[l]})\\ &dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}\\ &db^{[l]}=\frac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)\\ &dA^{[l-1]}=W^{[l]T}dZ^{[l]} \end{aligned} dZ[l]=dA[l]∗g[l]′(Z[l])dW[l]=m1dZ[l]A[l−1]Tdb[l]=m1np.sum(dZ[l],axis=1,keepdims=True)dA[l−1]=W[l]TdZ[l]
Initial input d A [ L ] dA^{[L]} dA[L] Determined by the excitation function of the output node , about m m m Group samples , Its value is
d A [ L ] = [ d L d a [ L ] ( 1 ) d L d a [ L ] ( 2 ) ⋯ d L d a [ L ] ( m ) ] dA^{[L]}=\left[\begin{matrix} \frac{d\mathcal{L}}{da^{[L](1)}}&\frac{d\mathcal{L}}{da^{[L](2)}}&\cdots&\frac{d\mathcal{L}}{da^{[L](m)}} \end{matrix}\right] dA[L]=[da[L](1)dLda[L](2)dL⋯da[L](m)dL]
From the above formula , We can input d A [ l ] dA^{[l]} dA[l], Output d A [ l − 1 ] dA^{[l-1]} dA[l−1], At the same time, calculate the weight of each layer and the gradient of offset .
边栏推荐
- wincc7.5下载安装教程(Win10系统)
- Nacos Development Manual
- [Yugong series] creation of 009 unity object of U3D full stack class in February 2022
- LDAP應用篇(4)Jenkins接入
- Flash return file download
- C language - bit segment
- [research materials] 2021 Research Report on China's smart medical industry - Download attached
- [untitled]
- 备份与恢复 CR 介绍
- Résumé des diagrammes de description des broches de la série ESP
猜你喜欢

Sanzi chess (C language)

ESP series pin description diagram summary

22. Empty the table

让学指针变得更简单(三)
![[research materials] 2022 enterprise wechat Ecosystem Research Report - Download attached](/img/35/898a8086bc35462b0fcb9e6b58b86b.jpg)
[research materials] 2022 enterprise wechat Ecosystem Research Report - Download attached
![[untitled]](/img/38/bc025310b9742b5bf0bd28c586ec0d.jpg)
[untitled]

Ruffian Heng embedded bimonthly, issue 49

IP lab, the first weekly recheck

Online yaml to CSV tool
![[research materials] 2022 China yuancosmos white paper - Download attached](/img/b4/422dff0510bbe67f3578202d6e80b7.jpg)
[research materials] 2022 China yuancosmos white paper - Download attached
随机推荐
[2022 广东省赛M] 拉格朗日插值 (多元函数极值 分治NTT)
从表中名称映射关系修改视频名称
Understanding of law of large numbers and central limit theorem
C语言 - 位段
[untitled]
Asia Pacific Financial Media | "APEC industry +" Western Silicon Valley invests 2trillion yuan in Chengdu Chongqing economic circle to catch up with Shanghai | stable strategy industry fund observatio
Image fusion -- challenges, opportunities and Countermeasures
Use br to back up tidb cluster data to S3 compatible storage
leetcode刷题 (5.28) 哈希表
07- [istio] istio destinationrule (purpose rule)
The Vice Minister of the Ministry of industry and information technology of "APEC industry +" of the national economic and information technology center led a team to Sichuan to investigate the operat
Use Alibaba icon in uniapp
备份与恢复 CR 介绍
Secure captcha (unsafe verification code) of DVWA range
使用 BR 恢复 S3 兼容存储上的备份数据
Oracle time display adjustment
指针进阶---指针数组,数组指针
Asia Pacific Financial Media | designer universe | Guangdong responds to the opinions of the national development and Reform Commission. Primary school students incarnate as small community designers
24. Query table data (basic)
Hill sort c language