当前位置:网站首页>Natural language processing series (I) -- RNN Foundation
Natural language processing series (I) -- RNN Foundation
2022-07-02 12:01:00 【raelum】
notes : This article is a concluding article , The narration is relatively simple , Not for beginners
Catalog
One 、 Why would there be RNN?
ordinary MLP Unable to process sequence information ( Text 、 Voice etc. ), This is because the sequence is Indefinite length Of , and MLP The number of neurons in the input layer is fixed .
Two 、RNN Structure
Ordinary MLP Structure ( Take a single hidden layer as an example ):
Ordinary RNN( also called Vanilla RNN, This statement will be used next ) Structure ( In single hidden layer MLP On the basis of ):
namely t t t The input received by the time hiding layer comes from t − 1 t-1 t−1 Always hide the output of the layer and t t t Sample input of time . Expressed by mathematical formula , Namely
h ( t ) = tanh ( W h ( t − 1 ) + U x ( t ) + b ) , o ( t ) = V h ( t ) + c , y ^ ( t ) = softmax ( o ( t ) ) h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b),\quad o^{(t)}=Vh^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)}) h(t)=tanh(Wh(t−1)+Ux(t)+b),o(t)=Vh(t)+c,y^(t)=softmax(o(t))
Training RNN In the process of , It's actually learning U , V , W , b , c U,V,W,b,c U,V,W,b,c These parameters .
After positive propagation , We need to calculate the loss , Set a time step t t t The loss obtained at is L ( t ) = L ( t ) ( y ^ ( t ) , y ( t ) ) L^{(t)}=L^{(t)}(\hat{y}^{(t)},y^{(t)}) L(t)=L(t)(y^(t),y(t)), Then the total loss is L = ∑ t = 1 T L ( t ) L=\sum_{t=1}^T L^{(t)} L=∑t=1TL(t).
2.1 BPTT
BPTT(BackPropagation Through Time), Back propagation through time is RNN A term in the training process . Because the forward propagation is along the direction of time passing , And back propagation is carried out against time .
For the convenience of subsequent derivation , Let's improve the symbolic representation first :
h ( t ) = tanh ( W h h h ( t − 1 ) + W x h x ( t ) + b ) , o ( t ) = W h o h ( t ) + c , y ^ ( t ) = softmax ( o ( t ) ) h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b),\quad o^{(t)}=W_{ho}h^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)}) h(t)=tanh(Whhh(t−1)+Wxhx(t)+b),o(t)=Whoh(t)+c,y^(t)=softmax(o(t))
Make a horizontal one concatenation: W = ( W h h , W x h ) W=(W_{hh},W_{xh}) W=(Whh,Wxh), For simplicity , Omit offset b b b, Then there are
h ( t ) = tanh ( W ( h ( t − 1 ) x ( t ) ) ) h^{(t)}=\tanh\left(W \begin{pmatrix} h^{(t-1)} \\ x^{(t)} \end{pmatrix} \right) h(t)=tanh(W(h(t−1)x(t)))
, Next we will focus on parameters W W W Learning from .
be aware
∂ h ( t ) ∂ h ( t − 1 ) = tanh ′ ( W h h h ( t − 1 ) + W x h x ( t ) ) W h h , ∂ L ∂ W = ∑ t = 1 T ∂ L ( t ) ∂ W \frac{\partial h^{(t)}}{\partial h^{(t-1)}}=\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})W_{hh},\quad \frac{\partial L}{\partial W}=\sum_{t=1}^T\frac{\partial L^{(t)}}{\partial W} ∂h(t−1)∂h(t)=tanh′(Whhh(t−1)+Wxhx(t))Whh,∂W∂L=t=1∑T∂W∂L(t)
thus
∂ L ( T ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ∂ h ( T ) ∂ h ( T − 1 ) ⋯ ∂ h ( 2 ) ∂ h ( 1 ) ⋅ ∂ h ( 1 ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ∏ t = 2 T ∂ h ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( 1 ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ( ∏ t = 2 T tanh ′ ( W h h h ( t − 1 ) + W x h x ( t ) ) ) ⋅ W h h T − 1 ⋅ ∂ h ( 1 ) ∂ W \begin{aligned} \frac{\partial L^{(T)}}{\partial W}&=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \frac{\partial h^{(T)}}{\partial h^{(T-1)}}\cdots \frac{\partial h^{(2)}}{\partial h^{(1)}}\cdot\frac{\partial h^{(1)}}{\partial W} \\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \prod_{t=2}^T\frac{\partial h^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(1)}}{\partial W}\\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \left(\prod_{t=2}^T\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})\right)\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W}\\ \end{aligned} ∂W∂L(T)=∂h(T)∂L(T)⋅∂h(T−1)∂h(T)⋯∂h(1)∂h(2)⋅∂W∂h(1)=∂h(T)∂L(T)⋅t=2∏T∂h(t−1)∂h(t)⋅∂W∂h(1)=∂h(T)∂L(T)⋅(t=2∏Ttanh′(Whhh(t−1)+Wxhx(t)))⋅WhhT−1⋅∂W∂h(1)
because tanh ′ ( ⋅ ) \tanh'(\cdot) tanh′(⋅) Almost always less than 1 1 1 Of , When T T T When it is large enough, the gradient will disappear .
If the nonlinear activation function is not used , For simplicity , Let's set the activation function as an identity map f ( x ) = x f(x)=x f(x)=x, So there is
∂ L ( T ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ W h h T − 1 ⋅ ∂ h ( 1 ) ∂ W \frac{\partial L^{(T)}}{\partial W}=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W} ∂W∂L(T)=∂h(T)∂L(T)⋅WhhT−1⋅∂W∂h(1)
- When W h h W_{hh} Whh The maximum singular value of is greater than 1 1 1 when , There will be a gradient explosion .
- When W h h W_{hh} Whh The maximum singular value of is less than 1 1 1 when , The gradient disappears .
3、 ... and 、RNN The classification of
According to the structure of input and output RNN Classify as follows :
- 1 vs N(vec2seq):Image Captioning;
- N vs 1(seq2vec):Sentiment Analysis;
- N vs M(seq2seq):Machine Translation;
- N vs N(seq2seq):Sequence Labeling(POS Tagging)
Be careful 1 vs 1 It's traditional MLP.
If you classify according to the internal structure, you will get :
- RNN、Bi-RNN、…
- LSTM、Bi-LSTM、…
- GRU、Bi-GRU、…
Four 、Vanilla RNN Advantages and disadvantages
advantage :
- It can handle sequences of variable length ;
- Historical information will be considered in the calculation ;
- Weights are shared in time ;
- The model size will not change as the input size increases .
shortcoming :
- Calculation efficiency is low ;
- The gradient will disappear / The explosion ( Later we will know , Gradient clipping can be used to avoid gradient explosion , To avoid the gradient disappearing, you can use other RNN structure , Such as LSTM);
- Unable to process long sequence ( That is, it does not have long memory );
- Unable to take advantage of future input (Bi-RNN To solve ).
5、 ... and 、Bidirectional RNN
A lot of time , What we want to output y ( t ) y^{(t)} y(t) May depend on the entire sequence , Therefore, we need to use two-way RNN(BRNN).BRNN Combined with the movement in time from the beginning of the sequence RNN And moving from the end of the sequence RNN. Two RNN They are independent of each other and do not share weights :
The corresponding calculation method becomes :
h ( t ) = tanh ( W 1 h ( t − 1 ) + U 1 x ( t ) + b 1 ) g ( t ) = tanh ( W 2 h ( t − 1 ) + U 2 x ( t ) + b 2 ) o ( t ) = V ( h ( t ) ; g ( t ) ) + c y ^ ( t ) = softmax ( o ( t ) ) \begin{aligned} &h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1) \\ &g^{(t)}=\tanh(W_2h^{(t-1)}+U_2x^{(t)}+b_2) \\ &o^{(t)}=V(h^{(t)};g^{(t)})+c \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned} h(t)=tanh(W1h(t−1)+U1x(t)+b1)g(t)=tanh(W2h(t−1)+U2x(t)+b2)o(t)=V(h(t);g(t))+cy^(t)=softmax(o(t))
among ( h ( t ) ; g ( t ) ) (h^{(t)};g^{(t)}) (h(t);g(t)) Represents that the two column vectors h ( t ) h^{(t)} h(t) and g ( t ) g^{(t)} g(t) Make longitudinal connection .
in fact , If the V V V Block by column , Then the third equation above can also be written as :
o ( t ) = V ( h ( t ) ; g ( t ) ) + c = ( V 1 , V 2 ) ( h ( t ) g ( t ) ) + c = V 1 h ( t ) + V 2 g ( t ) + c o^{(t)}=V(h^{(t)};g^{(t)})+c= (V_1,V_2) \begin{pmatrix} h^{(t)} \\ g^{(t)} \end{pmatrix}+c=V_1h^{(t)}+V_2g^{(t)}+c o(t)=V(h(t);g(t))+c=(V1,V2)(h(t)g(t))+c=V1h(t)+V2g(t)+c
Training BRNN The process of learning is actually learning U 1 , U 2 , V , W 1 , W 2 , b 1 , b 2 , c U_1,U_2,V,W_1,W_2,b_1,b_2,c U1,U2,V,W1,W2,b1,b2,c These parameters .
6、 ... and 、Stacked RNN
The stack RNN Also called multilayer RNN Or depth RNN, That is, it is composed of multiple hidden layers . One way with double hidden layers RNN For example , Its structure is as follows :
The corresponding calculation process is as follows :
h ( t ) = tanh ( W h h h ( t − 1 ) + W x h x ( t ) + b h ) z ( t ) = tanh ( W z z z ( t − 1 ) + W h z h ( t ) + b z ) o ( t ) = W z o z ( t ) + b o y ^ ( t ) = softmax ( o ( t ) ) \begin{aligned} &h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b_h) \\ &z^{(t)}=\tanh(W_{zz}z^{(t-1)}+W_{hz}h^{(t)}+b_z) \\ &o^{(t)}=W_{zo}z^{(t)}+b_o \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned} h(t)=tanh(Whhh(t−1)+Wxhx(t)+bh)z(t)=tanh(Wzzz(t−1)+Whzh(t)+bz)o(t)=Wzoz(t)+boy^(t)=softmax(o(t))
边栏推荐
- 进入前六!博云在中国云管理软件市场销量排行持续上升
- Deep understanding of NN in pytorch Embedding
- PHP query distance according to longitude and latitude
- Cluster Analysis in R Simplified and Enhanced
- Depth filter of SvO2 series
- Log4j2
- 深入理解P-R曲线、ROC与AUC
- b格高且好看的代码片段分享图片生成
- How to Easily Create Barplots with Error Bars in R
- 基于Hardhat和Openzeppelin开发可升级合约(一)
猜你喜欢
基于Arduino和ESP8266的连接手机热点实验(成功)
Seriation in R: How to Optimally Order Objects in a Data Matrice
HOW TO EASILY CREATE BARPLOTS WITH ERROR BARS IN R
自然语言处理系列(三)——LSTM
Some problems encountered in introducing lvgl into esp32 Arduino
A sharp tool for exposing data inconsistencies -- a real-time verification system
HOW TO ADD P-VALUES ONTO A GROUPED GGPLOT USING THE GGPUBR R PACKAGE
GGHIGHLIGHT: EASY WAY TO HIGHLIGHT A GGPLOT IN R
[QT] Qt development environment installation (QT version 5.14.2 | QT download | QT installation)
【C语言】十进制数转换成二进制数
随机推荐
Develop scalable contracts based on hardhat and openzeppelin (I)
基于Arduino和ESP8266的Blink代码运行成功(包含错误分析)
Writing contract test cases based on hardhat
The selected cells in Excel form have the selection effect of cross shading
ESP32音频框架 ESP-ADF 添加按键外设流程代码跟踪
Dynamic debugging of multi file program x32dbg
【多线程】主线程等待子线程执行完毕在执行并获取执行结果的方式记录(有注解代码无坑)
How does Premiere (PR) import the preset mogrt template?
How to Visualize Missing Data in R using a Heatmap
How to Add P-Values onto Horizontal GGPLOTS
[visual studio 2019] create MFC desktop program (install MFC development components | create MFC application | edit MFC application window | add click event for button | Modify button text | open appl
MySQL stored procedure cursor traversal result set
GGPlot Examples Best Reference
Seriation in R: How to Optimally Order Objects in a Data Matrice
Small guide for rapid formation of manipulator (VII): description method of position and posture of manipulator
Bedtools tutorial
Homer forecast motif
H5, add a mask layer to the page, which is similar to clicking the upper right corner to open it in the browser
PyTorch中repeat、tile与repeat_interleave的区别
动态内存(进阶四)