当前位置:网站首页>Deep learning: derivation of shallow neural networks and deep neural networks
Deep learning: derivation of shallow neural networks and deep neural networks
2022-07-06 08:21:00 【ShadyPi】
List of articles
I have written several blogs about neural networks before learning machine learning , Recently, I watched the video of Wu Enda's in-depth learning , The neural network is different from before , So take a note .
Basic structure and symbol convention
The basic structure is the input layer 、 Hidden layer , Middle layer , Incentive letter a a a Express , The layer label of the unit is placed in square brackets , The sample number in parentheses , So there is an input layer x x x( a [ 0 ] a^{[0]} a[0])、 Hidden layer a [ 1 ] a^{[1]} a[1] And output layer a [ 2 ] a^{[2]} a[2]. Weight is needed in the operation w w w And offset b b b, The functions in the unit are still logical functions σ ( z ) = 1 1 + e − z \sigma(z)=\frac{1}{1+e^{-z}} σ(z)=1+e−z1.
Declare data matrix X ( n × m ) X(n\times m) X(n×m), Weight matrices W [ l ] ( s l × s l − 1 ) W^{[l]}(s_l\times s_{l-1}) W[l](sl×sl−1) And bias matrix b [ l ] ( s l × 1 ) b^{[l]}(s_l\times 1) b[l](sl×1), Excitation matrix A [ l ] ( s l × m ) A^{[l]}(s_l\times m) A[l](sl×m), n [ l ] n^{[l]} n[l] It means the first one l l l The number of cells in the layer , Make
A [ 0 ] = X = [ ∣ ∣ ∣ x ( 1 ) x ( 2 ) ⋯ x ( m ) ∣ ∣ ∣ ] , W [ l ] = [ − w 1 [ l ] T − − w 2 [ l ] T − ⋯ − w n [ l ] [ l ] T − ] , b [ l ] = [ b 1 [ l ] b 2 [ l ] ⋮ b n [ l ] [ l ] ] A^{[0]}=X=\left[\begin{matrix} |&|& &|\\ x^{(1)}&x^{(2)}&\cdots&x^{(m)}\\ |&|& &|\\ \end{matrix}\right], W^{[l]}=\left[\begin{matrix} -&w_1^{[l]T}&-\\ -&w_2^{[l]T}&-\\ &\cdots&\\ -&w_{n^{[l]}}^{[l]T}&-\\ \end{matrix}\right], b^{[l]}=\left[\begin{matrix} b^{[l]}_1\\ b^{[l]}_2\\ \vdots\\ b^{[l]}_{n^{[l]}}\\ \end{matrix}\right] A[0]=X=⎣⎡∣x(1)∣∣x(2)∣⋯∣x(m)∣⎦⎤,W[l]=⎣⎢⎢⎢⎡−−−w1[l]Tw2[l]T⋯wn[l][l]T−−−⎦⎥⎥⎥⎤,b[l]=⎣⎢⎢⎢⎢⎡b1[l]b2[l]⋮bn[l][l]⎦⎥⎥⎥⎥⎤
There are also some supplements that can be seen Neural networks in machine learning .
Spread forward
Yes Neural network vectorization derivation in machine learning Bottoming , Plus forward propagation is relatively simple , Let's go directly to multiple groups of data + Multiple hidden layers .
The middle vector z [ l ] z^{[l]} z[l] by
z [ l ] = [ z 1 [ l ] z 2 [ l ] ⋮ z n [ l ] [ l ] ] = [ w 1 [ l ] T a [ l − 1 ] + b 1 [ l ] w 2 [ l ] T a [ l − 1 ] + b 2 [ l ] ⋮ w n [ l ] [ l ] T a [ l − 1 ] + b n [ l ] [ l ] ] = W [ l ] a [ l − 1 ] + b [ l ] z^{[l]}=\left[\begin{matrix} z^{[l]}_1\\ z^{[l]}_2\\ \vdots\\ z^{[l]}_{n^{[l]}}\\ \end{matrix}\right]= \left[\begin{matrix} w^{[l]T}_1a^{[l-1]}+b_1^{[l]}\\ w^{[l]T}_2a^{[l-1]}+b_2^{[l]}\\ \vdots\\ w^{[l]T}_{n^{[l]}}a^{[l-1]}+b_{n^{[l]}}^{[l]}\\ \end{matrix}\right]= W^{[l]}a^{[l-1]}+b^{[l]} z[l]=⎣⎢⎢⎢⎢⎡z1[l]z2[l]⋮zn[l][l]⎦⎥⎥⎥⎥⎤=⎣⎢⎢⎢⎢⎡w1[l]Ta[l−1]+b1[l]w2[l]Ta[l−1]+b2[l]⋮wn[l][l]Ta[l−1]+bn[l][l]⎦⎥⎥⎥⎥⎤=W[l]a[l−1]+b[l]
So by the z [ l ] ( i ) z^{[l](i)} z[l](i) A matrix of Z [ l ] Z^{[l]} Z[l] by
Z [ l ] = [ ∣ ∣ ∣ z [ l ] ( 1 ) z [ l ] ( 2 ) ⋯ z [ l ] ( m ) ∣ ∣ ∣ ] = W [ l ] A [ l − 1 ] + b [ l ] Z^{[l]}=\left[\begin{matrix} |&|& &|\\ z^{[l](1)}&z^{[l](2)}&\cdots&z^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]= W^{[l]}A^{[l-1]}+b^{[l]} Z[l]=⎣⎡∣z[l](1)∣∣z[l](2)∣⋯∣z[l](m)∣⎦⎤=W[l]A[l−1]+b[l]
And the excitation matrix of the hidden layer A [ l ] A^{[l]} A[l] Namely
A [ l ] = [ ∣ ∣ ∣ a [ l ] ( 1 ) a [ l ] ( 2 ) ⋯ a [ l ] ( m ) ∣ ∣ ∣ ] = σ ( Z [ l ] ) = σ ( W [ l ] A [ l − 1 ] + b [ l ] ) A^{[l]}=\left[\begin{matrix} |&|& &|\\ a^{[l](1)}&a^{[l](2)}&\cdots&a^{[l](m)}\\ |&|& &|\\ \end{matrix}\right]=\sigma(Z^{[l]}) =\sigma(W^{[l]}A^{[l-1]}+b^{[l]}) A[l]=⎣⎡∣a[l](1)∣∣a[l](2)∣⋯∣a[l](m)∣⎦⎤=σ(Z[l])=σ(W[l]A[l−1]+b[l])
Other activation functions
Before, our neural networks were all logical functions used by logistic regression , But in fact, there are many better choices in Neural Networks .
tanh function
tanh ( z ) = e z − e − z e z + e − z \tanh(z)=\frac{e^z-e^{-z}}{e^z+e^{-z}} tanh(z)=ez+e−zez−e−z
The image below :
tanh \tanh tanh Functions are almost strictly superior to logical functions , because tanh \tanh tanh Function so that the average value of the excitation is 0 about , This can make the calculation of the next level easier . Except at the output layer , What we expect is 0 ∼ 1 0\sim 1 0∼1 Between the output , At this time, we can use logical functions in the output layer .
The derivative of this function is
tanh ′ ( z ) = 1 − ( tanh ( z ) ) 2 \tanh'(z)=1-(\tanh(z))^2 tanh′(z)=1−(tanh(z))2
ReLU function
But logical functions and tanh \tanh tanh Every function has a problem , That is when the absolute value of coordinates is very large , The gradient of the function becomes very small , In this way, when we run an algorithm similar to gradient descent, the convergence speed will become very slow . and ReLU Function can solve this problem , The expression is
ReLU ( z ) = max ( 0 , z ) \text{ReLU}(z)=\max(0,z) ReLU(z)=max(0,z)
The image is :
such , as long as z > 0 z>0 z>0, The derivative is 1, and z < 0 z<0 z<0 The time derivative is 0. Although mathematically z = 0 z=0 z=0 There is no derivative at , however z z z The value is just 0 The probability is very small , And we can artificially define it as 1 or 0, This is harmless in practical application .
In general , If you want to do a binary classification problem , We might use it tanh \tanh tanh function , Add a logic function to the output layer , At other times, it is generally the default ReLU function .
Leaky ReLU
In practice ,ReLU Functions usually perform well , But because the derivative of its negative part is 0, So for this part, its gradient descent rate will be very slow , Although in a network , We will have many positive parts , Make the whole parameters still learn at a faster speed . If you are not at ease , You can also set a smaller slope for the negative part , such as 0.01 0.01 0.01, In this way, the activation function is expressed as
Leaky ReLU ( z ) = max ( 0.01 z , z ) \text{Leaky ReLU}(z)=\max(0.01z,z) Leaky ReLU(z)=max(0.01z,z)
Back propagation
Shallow neural networks
Backward propagation is more complex , Let's first push a shallow neural network :
First , For the last cost function , Its value is
L ( a [ 2 ] , y ) = − y log a [ 2 ] − ( 1 − y ) log ( 1 − a [ 2 ] ) \mathcal{L}(a^{[2]},y)=-y\log a^{[2]}-(1-y)\log(1-a^{[2]}) L(a[2],y)=−yloga[2]−(1−y)log(1−a[2])
Yes a [ 2 ] a^{[2]} a[2] Find differentiation , Available
d L d a [ 2 ] = − y a [ 2 ] + 1 − y 1 − a [ 2 ] \frac{d\mathcal{L}}{da^{[2]}}=-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}} da[2]dL=−a[2]y+1−a[2]1−y
be z [ 2 ] z^{[2]} z[2] The differential of the cost function is
d L d z [ 2 ] = d L d a [ 2 ] d a [ 2 ] d z [ 2 ] = ( − y a [ 2 ] + 1 − y 1 − a [ 2 ] ) a [ 2 ] ( 1 − a [ 2 ] ) = a [ 2 ] − y \frac{d\mathcal{L}}{dz^{[2]}}=\frac{d\mathcal{L}}{da^{[2]}}\frac{d a^{[2]}}{dz^{[2]}}=(-\frac{y}{a^{[2]}}+\frac{1-y}{1-a^{[2]}})a^{[2]}(1-a^{[2]})=a^{[2]}-y dz[2]dL=da[2]dLdz[2]da[2]=(−a[2]y+1−a[2]1−y)a[2](1−a[2])=a[2]−y
Calculated after d L d W [ 2 ] \frac{d\mathcal{L}}{dW^{[2]}} dW[2]dL and d L d b [ 2 ] \frac{d\mathcal{L}}{db^{[2]}} db[2]dL by
d L d W [ 2 ] = d L d z [ 2 ] a [ 1 ] T d L d b [ 2 ] = d L d z [ 2 ] \frac{d\mathcal{L}}{dW^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}}a^{[1]T}\\ \frac{d\mathcal{L}}{db^{[2]}}=\frac{d\mathcal{L}}{dz^{[2]}} dW[2]dL=dz[2]dLa[1]Tdb[2]dL=dz[2]dL
Now the derivation is half done , Let's calculate again a [ 1 ] a^{[1]} a[1] The derivative of is
d L d a [ 1 ] = W [ 2 ] T d L d z [ 2 ] \frac{d\mathcal{L}}{da^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}} da[1]dL=W[2]Tdz[2]dL
because z [ 2 ] z^{[2]} z[2] yes n [ 2 ] × 1 n^{[2]}\times 1 n[2]×1 Of , W [ 2 ] W^{[2]} W[2] yes n [ 2 ] × n [ 1 ] n^{[2]}\times n^{[1]} n[2]×n[1] Of , So it needs to be transposed here . after , Ask again for z [ 1 ] z^{[1]} z[1] The derivative of , Just multiply this by d a [ 1 ] d z [ 1 ] \frac{d a^{[1]}}{dz^{[1]}} dz[1]da[1]( ∗ * ∗ Indicates bitwise multiplication ):
d L d z [ 1 ] = W [ 2 ] T d L d z [ 2 ] ∗ g [ 1 ] ′ ( z [ 1 ] ) \frac{d\mathcal{L}}{dz^{[1]}}=W^{[2]T}\frac{d\mathcal{L}}{dz^{[2]}}*g^{[1]'}(z^{[1]}) dz[1]dL=W[2]Tdz[2]dL∗g[1]′(z[1])
And calculation d L d W [ 1 ] \frac{d\mathcal{L}}{dW^{[1]}} dW[1]dL and d L d b [ 1 ] \frac{d\mathcal{L}}{db^{[1]}} db[1]dL Process and Chapter 2 The layers are almost identical :
d L d W [ 1 ] = d L d z [ 1 ] a [ 0 ] T d L d b [ 1 ] = d L d z [ 1 ] \frac{d\mathcal{L}}{dW^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}}a^{[0]T}\\ \frac{d\mathcal{L}}{db^{[1]}}=\frac{d\mathcal{L}}{dz^{[1]}} dW[1]dL=dz[1]dLa[0]Tdb[1]dL=dz[1]dL
The above derivation is for a single sample , Backward propagation is required for multiple samples , Stack the sample column vectors by column , You can apply the results derived above , It's all n [ l ] × 1 n^{[l]}\times1 n[l]×1 The matrix becomes n [ l ] × m n^{[l]}\times m n[l]×m, then b b b The vector needs to be summed once in the horizontal direction ( To simplify the expression , We use it d Z [ 2 ] dZ^{[2]} dZ[2] According to matrix Z [ 2 ] Z^{[2]} Z[2] The result of deriving the cost function , Other matrices are the same ):
d Z [ 2 ] = ( A [ 2 ] − Y ) d W [ 2 ] = 1 m d Z [ 2 ] A [ 1 ] T d b [ 2 ] = 1 m n p . s u m ( d Z [ 2 ] , a x i s = 1 , k e e p d i m s = T r u e ) d Z [ 1 ] = W [ 2 ] T d Z [ 2 ] ∗ g [ 1 ] ′ ( Z [ 1 ] ) d W [ 1 ] = 1 m d Z [ 1 ] X T d b [ 1 ] = 1 m n p . s u m ( d Z [ 1 ] , a x i s = 1 , k e e p d i m s = T r u e ) \begin{aligned} &dZ^{[2]}=(A^{[2]}-Y)\\ &dW^{[2]}=\frac{1}{m}dZ^{[2]}A^{[1]T}\\ &db^{[2]}=\frac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)\\ &dZ^{[1]}=W^{[2]T}dZ^{[2]}*g^{[1]'}(Z^{[1]})\\ &dW^{[1]}=\frac{1}{m}dZ^{[1]}X^T\\ &db^{[1]}=\frac{1}{m}np.sum(dZ^{[1]},axis=1,keepdims=True)\\ \end{aligned} dZ[2]=(A[2]−Y)dW[2]=m1dZ[2]A[1]Tdb[2]=m1np.sum(dZ[2],axis=1,keepdims=True)dZ[1]=W[2]TdZ[2]∗g[1]′(Z[1])dW[1]=m1dZ[1]XTdb[1]=m1np.sum(dZ[1],axis=1,keepdims=True)keepdims
The function of is to let Python Don't put our column vector (n,1) Become rank 1 Matrix (n,), That could lead to hard to find bug.
Deep neural network
The shallow network above has only two layers , And for complex problems , Increase the number of layers of the network ( depth ) It is much more effective than forcing nodes to be added in a hidden layer , So we need to transform the above derivation into a more general form , That is, the following four formulas :
d Z [ l ] = d A [ l ] ∗ g [ l ] ′ ( Z [ l ] ) d W [ l ] = 1 m d Z [ l ] A [ l − 1 ] T d b [ l ] = 1 m n p . s u m ( d Z [ l ] , a x i s = 1 , k e e p d i m s = T r u e ) d A [ l − 1 ] = W [ l ] T d Z [ l ] \begin{aligned} &dZ^{[l]}=dA^{[l]}*g^{[l]'}(Z^{[l]})\\ &dW^{[l]}=\frac{1}{m}dZ^{[l]}A^{[l-1]T}\\ &db^{[l]}=\frac{1}{m}np.sum(dZ^{[l]},axis=1,keepdims=True)\\ &dA^{[l-1]}=W^{[l]T}dZ^{[l]} \end{aligned} dZ[l]=dA[l]∗g[l]′(Z[l])dW[l]=m1dZ[l]A[l−1]Tdb[l]=m1np.sum(dZ[l],axis=1,keepdims=True)dA[l−1]=W[l]TdZ[l]
Initial input d A [ L ] dA^{[L]} dA[L] Determined by the excitation function of the output node , about m m m Group samples , Its value is
d A [ L ] = [ d L d a [ L ] ( 1 ) d L d a [ L ] ( 2 ) ⋯ d L d a [ L ] ( m ) ] dA^{[L]}=\left[\begin{matrix} \frac{d\mathcal{L}}{da^{[L](1)}}&\frac{d\mathcal{L}}{da^{[L](2)}}&\cdots&\frac{d\mathcal{L}}{da^{[L](m)}} \end{matrix}\right] dA[L]=[da[L](1)dLda[L](2)dL⋯da[L](m)dL]
From the above formula , We can input d A [ l ] dA^{[l]} dA[l], Output d A [ l − 1 ] dA^{[l-1]} dA[l−1], At the same time, calculate the weight of each layer and the gradient of offset .
边栏推荐
- 2022 Inner Mongolia latest construction tower crane (construction special operation) simulation examination question bank and answers
- [untitled]
- "Designer universe" Guangdong responds to the opinions of the national development and Reform Commission. Primary school students incarnate as small community designers | national economic and Informa
- A Closer Look at How Fine-tuning Changes BERT
- The ECU of 21 Audi q5l 45tfsi brushes is upgraded to master special adjustment, and the horsepower is safely and stably increased to 305 horsepower
- Easy to use tcp-udp_ Debug tool download and use
- Fibonacci sequence
- Huawei cloud OBS file upload and download tool class
- Migrate data from CSV files to tidb
- Personalized online cloud database hybrid optimization system | SIGMOD 2022 selected papers interpretation
猜你喜欢
[research materials] 2022 China yuancosmos white paper - Download attached
22. Empty the table
"Designer universe": "benefit dimension" APEC public welfare + 2022 the latest slogan and the new platform will be launched soon | Asia Pacific Financial Media
Fibonacci sequence
Database basic commands
Ruffian Heng embedded bimonthly, issue 49
将 NFT 设置为 ENS 个人资料头像的分步指南
C语言 - 位段
"Friendship and righteousness" of the center for national economy and information technology: China's friendship wine - the "unparalleled loyalty and righteousness" of the solidarity group released th
Nft智能合约发行,盲盒,公开发售技术实战--拼图篇
随机推荐
ESP series pin description diagram summary
[cloud native] teach you how to build ferry open source work order system
灰度升级 TiDB Operator
备份与恢复 CR 介绍
Golang DNS 随便写写
Learn Arduino with examples
[untitled]
[research materials] 2021 live broadcast annual data report of e-commerce - Download attached
Upgrade tidb with tiup
Ruffian Heng embedded bimonthly, issue 49
Go learning notes (3) basic types and statements (2)
Analysis of Top1 accuracy and top5 accuracy examples
Leetcode question brushing record | 203_ Remove linked list elements
从 SQL 文件迁移数据到 TiDB
C语言自定义类型:结构体
从 TiDB 集群迁移数据至另一 TiDB 集群
07- [istio] istio destinationrule (purpose rule)
从 CSV 文件迁移数据到 TiDB
Epoll and IO multiplexing of redis
NFT smart contract release, blind box, public offering technology practice -- jigsaw puzzle