当前位置：网站首页>Natural language processing series (I) -- RNN Foundation

Natural language processing series (I) -- RNN Foundation

2022-07-02 12:01:00 【raelum】

notes ： This article is a concluding article , The narration is relatively simple , Not for beginners

Catalog

One 、 Why would there be RNN？
Two 、RNN Structure
- 2.1 BPTT
3、 ... and 、RNN The classification of
Four 、Vanilla RNN Advantages and disadvantages
5、 ... and 、Bidirectional RNN
6、 ... and 、Stacked RNN

One 、 Why would there be RNN？

ordinary MLP Unable to process sequence information （ Text 、 Voice etc. ）, This is because the sequence is Indefinite length Of , and MLP The number of neurons in the input layer is fixed .

Two 、RNN Structure

Ordinary MLP Structure （ Take a single hidden layer as an example ）：

Insert picture description here

Ordinary RNN（ also called Vanilla RNN, This statement will be used next ） Structure （ In single hidden layer MLP On the basis of ）：

Insert picture description here

namely $t$ The input received by the time hiding layer comes from $t - 1$ Always hide the output of the layer and $t$ Sample input of time . Expressed by mathematical formula , Namely

$h^{(t)}=\tanh(Wh^{(t-1)}+Ux^{(t)}+b),\quad o^{(t)}=Vh^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)})$

Training RNN In the process of , It's actually learning $U, V, W, b, c$ These parameters .

After positive propagation , We need to calculate the loss , Set a time step $t$ The loss obtained at is $L^{(t)}=L^{(t)}(\hat{y}^{(t)},y^{(t)})$ , Then the total loss is $L=\sum_{t=1}^T L^{(t)}$ .

2.1 BPTT

BPTT（BackPropagation Through Time）, Back propagation through time is RNN A term in the training process . Because the forward propagation is along the direction of time passing , And back propagation is carried out against time .

For the convenience of subsequent derivation , Let's improve the symbolic representation first ：

$h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b),\quad o^{(t)}=W_{ho}h^{(t)}+c,\quad \hat{y}^{(t)}=\text{softmax}(o^{(t)})$

Make a horizontal one concatenation： $W=(W_{hh},W_{xh})$ , For simplicity , Omit offset $b$ , Then there are

$h^{(t)}=\tanh\left(W \begin{pmatrix} h^{(t-1)} \\ x^{(t)} \end{pmatrix} \right)$

, Next we will focus on parameters $W$ Learning from .

be aware

$\frac{\partial h^{(t)}}{\partial h^{(t-1)}}=\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})W_{hh},\quad \frac{\partial L}{\partial W}=\sum_{t=1}^T\frac{\partial L^{(t)}}{\partial W}$

thus

$\begin{aligned} \frac{\partial L^{(T)}}{\partial W}&=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \frac{\partial h^{(T)}}{\partial h^{(T-1)}}\cdots \frac{\partial h^{(2)}}{\partial h^{(1)}}\cdot\frac{\partial h^{(1)}}{\partial W} \\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \prod_{t=2}^T\frac{\partial h^{(t)}}{\partial h^{(t-1)}}\cdot\frac{\partial h^{(1)}}{\partial W}\\ &=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot \left(\prod_{t=2}^T\tanh'(W_{hh}h^{(t-1)}+W_{xh}x^{(t)})\right)\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W}\\ \end{aligned}$

because $\tanh'(\cdot)$ Almost always less than $1$ Of , When $T$ When it is large enough, the gradient will disappear .

If the nonlinear activation function is not used , For simplicity , Let's set the activation function as an identity map $f (x) = x$ , So there is

$\frac{\partial L^{(T)}}{\partial W}=\frac{\partial L^{(T)}}{\partial h^{(T)}}\cdot W_{hh}^{T-1} \cdot\frac{\partial h^{(1)}}{\partial W}$

When $W_{hh}$ The maximum singular value of is greater than $1$ when , There will be a gradient explosion .
When $W_{hh}$ The maximum singular value of is less than $1$ when , The gradient disappears .

3、 ... and 、RNN The classification of

According to the structure of input and output RNN Classify as follows ：

1 vs N（vec2seq）：Image Captioning;
N vs 1（seq2vec）：Sentiment Analysis;
N vs M（seq2seq）：Machine Translation;
N vs N（seq2seq）：Sequence Labeling（POS Tagging）

Insert picture description here

Be careful 1 vs 1 It's traditional MLP.

If you classify according to the internal structure, you will get ：

RNN、Bi-RNN、…
LSTM、Bi-LSTM、…
GRU、Bi-GRU、…

Four 、Vanilla RNN Advantages and disadvantages

advantage ：

It can handle sequences of variable length ;
Historical information will be considered in the calculation ;
Weights are shared in time ;
The model size will not change as the input size increases .

shortcoming ：

Calculation efficiency is low ;
The gradient will disappear / The explosion （ Later we will know , Gradient clipping can be used to avoid gradient explosion , To avoid the gradient disappearing, you can use other RNN structure , Such as LSTM）;
Unable to process long sequence （ That is, it does not have long memory ）;
Unable to take advantage of future input （Bi-RNN To solve ）.

5、 ... and 、Bidirectional RNN

A lot of time , What we want to output $y^{(t)}$ May depend on the entire sequence , Therefore, we need to use two-way RNN（BRNN）.BRNN Combined with the movement in time from the beginning of the sequence RNN And moving from the end of the sequence RNN. Two RNN They are independent of each other and do not share weights ：

Insert picture description here
The corresponding calculation method becomes ：

$\begin{aligned} &h^{(t)}=\tanh(W_1h^{(t-1)}+U_1x^{(t)}+b_1) \\ &g^{(t)}=\tanh(W_2h^{(t-1)}+U_2x^{(t)}+b_2) \\ &o^{(t)}=V(h^{(t)};g^{(t)})+c \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned}$

among $h^{(t)};g^{(t)})$ Represents that the two column vectors $h^{(t)}$ and $g^{(t)}$ Make longitudinal connection .

in fact , If the $V$ Block by column , Then the third equation above can also be written as ：

$o^{(t)}=V(h^{(t)};g^{(t)})+c= (V_1,V_2) \begin{pmatrix} h^{(t)} \\ g^{(t)} \end{pmatrix}+c=V_1h^{(t)}+V_2g^{(t)}+c$

Training BRNN The process of learning is actually learning $U_1,U_2,V,W_1,W_2,b_1,b_2,c$ These parameters .

6、 ... and 、Stacked RNN

The stack RNN Also called multilayer RNN Or depth RNN, That is, it is composed of multiple hidden layers . One way with double hidden layers RNN For example , Its structure is as follows ：

Insert picture description here

The corresponding calculation process is as follows ：

$\begin{aligned} &h^{(t)}=\tanh(W_{hh}h^{(t-1)}+W_{xh}x^{(t)}+b_h) \\ &z^{(t)}=\tanh(W_{zz}z^{(t-1)}+W_{hz}h^{(t)}+b_z) \\ &o^{(t)}=W_{zo}z^{(t)}+b_o \\ &\hat{y}^{(t)}=\text{softmax}(o^{(t)}) \\ \end{aligned}$