当前位置:网站首页>Natural language processing series (I) -- RNN Foundation
Natural language processing series (I) -- RNN Foundation
2022-07-02 12:01:00 【raelum】
notes : This article is a concluding article , The narration is relatively simple , Not for beginners
Catalog
One 、 Why would there be RNN?
ordinary MLP Unable to process sequence information ( Text 、 Voice etc. ), This is because the sequence is Indefinite length Of , and MLP The number of neurons in the input layer is fixed .
Two 、RNN Structure
Ordinary MLP Structure ( Take a single hidden layer as an example ):

Ordinary RNN( also called Vanilla RNN, This statement will be used next ) Structure ( In single hidden layer MLP On the basis of ):

namely t t t The input received by the time hiding layer comes from t − 1 t-1 t−1 Always hide the output of the layer and t t t Sample input of time . Expressed by mathematical formula , Namely
h ( t ) = tanh ( W h ( t − 1 ) + U x ( t ) + b ) , o ( t ) = V h ( t ) + c , y ^ ( t ) = softmax ( o ( t ) ) h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b),\quad o^{(t)}=Vh^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)}) h(t)=tanh(Wh(t−1)+Ux(t)+b),o(t)=Vh(t)+c,y^(t)=softmax(o(t))
Training RNN In the process of , It's actually learning U , V , W , b , c U,V,W,b,c U,V,W,b,c These parameters .
After positive propagation , We need to calculate the loss , Set a time step t t t The loss obtained at is L ( t ) = L ( t ) ( y ^ ( t ) , y ( t ) ) L^{(t)}=L^{(t)}(\hat{y}^{(t)},y^{(t)}) L(t)=L(t)(y^(t),y(t)), Then the total loss is L = ∑ t = 1 T L ( t ) L=\sum_{t=1}^T L^{(t)} L=∑t=1TL(t).
2.1 BPTT
BPTT(BackPropagation Through Time), Back propagation through time is RNN A term in the training process . Because the forward propagation is along the direction of time passing , And back propagation is carried out against time .
For the convenience of subsequent derivation , Let's improve the symbolic representation first :
h ( t ) = tanh ( W h h h ( t − 1 ) + W x h x ( t ) + b ) , o ( t ) = W h o h ( t ) + c , y ^ ( t ) = softmax ( o ( t ) ) h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b),\quad o^{(t)}=W_{ho}h^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)}) h(t)=tanh(Whhh(t−1)+Wxhx(t)+b),o(t)=Whoh(t)+c,y^(t)=softmax(o(t))
Make a horizontal one concatenation: W = ( W h h , W x h ) W=(W_{hh},W_{xh}) W=(Whh,Wxh), For simplicity , Omit offset b b b, Then there are
h ( t ) = tanh ( W ( h ( t − 1 ) x ( t ) ) ) h^{(t)}=\tanh\left(W \begin{pmatrix} h^{(t-1)} \\ x^{(t)} \end{pmatrix} \right) h(t)=tanh(W(h(t−1)x(t)))
, Next we will focus on parameters W W W Learning from .
be aware
∂ h ( t ) ∂ h ( t − 1 ) = tanh ′ ( W h h h ( t − 1 ) + W x h x ( t ) ) W h h , ∂ L ∂ W = ∑ t = 1 T ∂ L ( t ) ∂ W \frac{\partial h^{(t)}}{\partial h^{(t-1)}}=\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})W_{hh},\quad \frac{\partial L}{\partial W}=\sum_{t=1}^T\frac{\partial L^{(t)}}{\partial W} ∂h(t−1)∂h(t)=tanh′(Whhh(t−1)+Wxhx(t))Whh,∂W∂L=t=1∑T∂W∂L(t)
thus
∂ L ( T ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ∂ h ( T ) ∂ h ( T − 1 ) ⋯ ∂ h ( 2 ) ∂ h ( 1 ) ⋅ ∂ h ( 1 ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ∏ t = 2 T ∂ h ( t ) ∂ h ( t − 1 ) ⋅ ∂ h ( 1 ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ ( ∏ t = 2 T tanh ′ ( W h h h ( t − 1 ) + W x h x ( t ) ) ) ⋅ W h h T − 1 ⋅ ∂ h ( 1 ) ∂ W \begin{aligned} \frac{\partial L^{(T)}}{\partial W}&=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \frac{\partial h^{(T)}}{\partial h^{(T-1)}}\cdots \frac{\partial h^{(2)}}{\partial h^{(1)}}\cdot\frac{\partial h^{(1)}}{\partial W} \\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \prod_{t=2}^T\frac{\partial h^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(1)}}{\partial W}\\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \left(\prod_{t=2}^T\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})\right)\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W}\\ \end{aligned} ∂W∂L(T)=∂h(T)∂L(T)⋅∂h(T−1)∂h(T)⋯∂h(1)∂h(2)⋅∂W∂h(1)=∂h(T)∂L(T)⋅t=2∏T∂h(t−1)∂h(t)⋅∂W∂h(1)=∂h(T)∂L(T)⋅(t=2∏Ttanh′(Whhh(t−1)+Wxhx(t)))⋅WhhT−1⋅∂W∂h(1)
because tanh ′ ( ⋅ ) \tanh'(\cdot) tanh′(⋅) Almost always less than 1 1 1 Of , When T T T When it is large enough, the gradient will disappear .
If the nonlinear activation function is not used , For simplicity , Let's set the activation function as an identity map f ( x ) = x f(x)=x f(x)=x, So there is
∂ L ( T ) ∂ W = ∂ L ( T ) ∂ h ( T ) ⋅ W h h T − 1 ⋅ ∂ h ( 1 ) ∂ W \frac{\partial L^{(T)}}{\partial W}=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W} ∂W∂L(T)=∂h(T)∂L(T)⋅WhhT−1⋅∂W∂h(1)
- When W h h W_{hh} Whh The maximum singular value of is greater than 1 1 1 when , There will be a gradient explosion .
- When W h h W_{hh} Whh The maximum singular value of is less than 1 1 1 when , The gradient disappears .
3、 ... and 、RNN The classification of
According to the structure of input and output RNN Classify as follows :
- 1 vs N(vec2seq):Image Captioning;
- N vs 1(seq2vec):Sentiment Analysis;
- N vs M(seq2seq):Machine Translation;
- N vs N(seq2seq):Sequence Labeling(POS Tagging)

Be careful 1 vs 1 It's traditional MLP.
If you classify according to the internal structure, you will get :
- RNN、Bi-RNN、…
- LSTM、Bi-LSTM、…
- GRU、Bi-GRU、…
Four 、Vanilla RNN Advantages and disadvantages
advantage :
- It can handle sequences of variable length ;
- Historical information will be considered in the calculation ;
- Weights are shared in time ;
- The model size will not change as the input size increases .
shortcoming :
- Calculation efficiency is low ;
- The gradient will disappear / The explosion ( Later we will know , Gradient clipping can be used to avoid gradient explosion , To avoid the gradient disappearing, you can use other RNN structure , Such as LSTM);
- Unable to process long sequence ( That is, it does not have long memory );
- Unable to take advantage of future input (Bi-RNN To solve ).
5、 ... and 、Bidirectional RNN
A lot of time , What we want to output y ( t ) y^{(t)} y(t) May depend on the entire sequence , Therefore, we need to use two-way RNN(BRNN).BRNN Combined with the movement in time from the beginning of the sequence RNN And moving from the end of the sequence RNN. Two RNN They are independent of each other and do not share weights :

The corresponding calculation method becomes :
h ( t ) = tanh ( W 1 h ( t − 1 ) + U 1 x ( t ) + b 1 ) g ( t ) = tanh ( W 2 h ( t − 1 ) + U 2 x ( t ) + b 2 ) o ( t ) = V ( h ( t ) ; g ( t ) ) + c y ^ ( t ) = softmax ( o ( t ) ) \begin{aligned} &h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1) \\ &g^{(t)}=\tanh(W_2h^{(t-1)}+U_2x^{(t)}+b_2) \\ &o^{(t)}=V(h^{(t)};g^{(t)})+c \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned} h(t)=tanh(W1h(t−1)+U1x(t)+b1)g(t)=tanh(W2h(t−1)+U2x(t)+b2)o(t)=V(h(t);g(t))+cy^(t)=softmax(o(t))
among ( h ( t ) ; g ( t ) ) (h^{(t)};g^{(t)}) (h(t);g(t)) Represents that the two column vectors h ( t ) h^{(t)} h(t) and g ( t ) g^{(t)} g(t) Make longitudinal connection .
in fact , If the V V V Block by column , Then the third equation above can also be written as :
o ( t ) = V ( h ( t ) ; g ( t ) ) + c = ( V 1 , V 2 ) ( h ( t ) g ( t ) ) + c = V 1 h ( t ) + V 2 g ( t ) + c o^{(t)}=V(h^{(t)};g^{(t)})+c= (V_1,V_2) \begin{pmatrix} h^{(t)} \\ g^{(t)} \end{pmatrix}+c=V_1h^{(t)}+V_2g^{(t)}+c o(t)=V(h(t);g(t))+c=(V1,V2)(h(t)g(t))+c=V1h(t)+V2g(t)+c
Training BRNN The process of learning is actually learning U 1 , U 2 , V , W 1 , W 2 , b 1 , b 2 , c U_1,U_2,V,W_1,W_2,b_1,b_2,c U1,U2,V,W1,W2,b1,b2,c These parameters .
6、 ... and 、Stacked RNN
The stack RNN Also called multilayer RNN Or depth RNN, That is, it is composed of multiple hidden layers . One way with double hidden layers RNN For example , Its structure is as follows :

The corresponding calculation process is as follows :
h ( t ) = tanh ( W h h h ( t − 1 ) + W x h x ( t ) + b h ) z ( t ) = tanh ( W z z z ( t − 1 ) + W h z h ( t ) + b z ) o ( t ) = W z o z ( t ) + b o y ^ ( t ) = softmax ( o ( t ) ) \begin{aligned} &h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b_h) \\ &z^{(t)}=\tanh(W_{zz}z^{(t-1)}+W_{hz}h^{(t)}+b_z) \\ &o^{(t)}=W_{zo}z^{(t)}+b_o \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned} h(t)=tanh(Whhh(t−1)+Wxhx(t)+bh)z(t)=tanh(Wzzz(t−1)+Whzh(t)+bz)o(t)=Wzoz(t)+boy^(t)=softmax(o(t))
边栏推荐
- YYGH-BUG-04
- H5, add a mask layer to the page, which is similar to clicking the upper right corner to open it in the browser
- 基于Hardhat和Openzeppelin开发可升级合约(一)
- Natural language processing series (II) -- building character level language model using RNN
- Yygh-9-make an appointment to place an order
- Thesis translation: 2022_ PACDNN: A phase-aware composite deep neural network for speech enhancement
- Develop scalable contracts based on hardhat and openzeppelin (I)
- 行業的分析
- [visual studio 2019] create and import cmake project
- HOW TO CREATE AN INTERACTIVE CORRELATION MATRIX HEATMAP IN R
猜你喜欢

pgsql 字符串转数组关联其他表,匹配 拼接后原顺序展示

HOW TO CREATE AN INTERACTIVE CORRELATION MATRIX HEATMAP IN R

R HISTOGRAM EXAMPLE QUICK REFERENCE

vant tabs组件选中第一个下划线位置异常

HR wonderful dividing line

K-Means Clustering Visualization in R: Step By Step Guide

How to Create a Beautiful Plots in R with Summary Statistics Labels

ESP32 Arduino 引入LVGL 碰到的一些问题

自然语言处理系列(二)——使用RNN搭建字符级语言模型
![[visual studio 2019] create and import cmake project](/img/51/6c2575030c5103aee6c02bec8d5e77.jpg)
[visual studio 2019] create and import cmake project
随机推荐
Power Spectral Density Estimates Using FFT---MATLAB
GGHIGHLIGHT: EASY WAY TO HIGHLIGHT A GGPLOT IN R
【2022 ACTF-wp】
时间格式化显示
Power Spectral Density Estimates Using FFT---MATLAB
HOW TO ADD P-VALUES ONTO A GROUPED GGPLOT USING THE GGPUBR R PACKAGE
Esp32 stores the distribution network information +led displays the distribution network status + press the key to clear the distribution network information (source code attached)
Small guide for rapid formation of manipulator (VII): description method of position and posture of manipulator
小程序链接生成
HOW TO CREATE A BEAUTIFUL INTERACTIVE HEATMAP IN R
XSS labs master shooting range environment construction and 1-6 problem solving ideas
R HISTOGRAM EXAMPLE QUICK REFERENCE
PHP 2D and multidimensional arrays are out of order, PHP_ PHP scrambles a simple example of a two-dimensional array and a multi-dimensional array. The shuffle function in PHP can only scramble one-dim
PHP query distance according to longitude and latitude
How to Create a Beautiful Plots in R with Summary Statistics Labels
行业的分析
CMake交叉编译
【C语言】十进制数转换成二进制数
K-Means Clustering Visualization in R: Step By Step Guide
动态内存(进阶四)