当前位置：网站首页>Hands on deep learning (37) -- cyclic neural network

Hands on deep learning (37) -- cyclic neural network

2022-07-04 09:40:00 【Stay a little star】

List of articles

Cyclic neural network

Cyclic neural network

In the language model , We introduced $n$ Metagrammar model , Words in it $x_t$ In time step $t$ The conditional probability of depends only on the previous $n - 1$ Word . If we want to take time step $t - (n - 1)$ The possible effects of previous words are merged into $x_t$ On , We need to increase $n$ . however , The number of model parameters will also increase exponentially , Because we need a vocabulary $\mathcal{V}$ Storage $|\mathcal{V}|^n$ A digital . therefore , Its modeling $P(x_t \mid x_{t-1}, \ldots, x_{t-n+1})$ , It's better to use the implicit variable model ：

$P(x_t \mid x_{t-1}, \ldots, x_1) \approx P(x_t \mid h_{t-1}),$

among $h_{t-1}$ yes Hidden state （ Also known as hidden variables ）, It stores time steps $t - 1$ The sequence information of . Usually , Can be based on current input $x_{t}$ And previously hidden $h_{t-1}$ To calculate the time step $t$ Hidden state at any time ：

$h_t = f(x_{t}, h_{t-1}).$

For functions that are powerful enough $f$ , The implicit variable model is not an approximation . After all , $h_t$ It may just store all the data observed so far . However , It may make computing and storage expensive .

For neural networks , It is worth noting that hidden layer and hidden state refer to two distinct concepts ：

Hidden layers are layers hidden from the view on the path from input to output .
Technically speaking , Hidden state is the result of everything we do in a given step “ Input ”. The hidden status can only be calculated by viewing the data of the previous time point .

Cyclic neural network （Recurrent neural networks, RNNs） It is a neural network with hidden state . Introducing RNN Before the model , We first review the multilayer perceptron model .

One 、 Neural network without hidden state

Let's take a look at a multi-layer perceptron with only a single hidden layer . Let the activation function of the hidden layer be $\phi$ . Given a small batch of samples $\mathbf{X} \in \mathbb{R}^{n \times d}$ , The batch size is $n$ , Input is $d$ dimension . Hidden layer output $\mathbf{H} \in \mathbb{R}^{n \times h}$ Calculate by the following formula ：

$\mathbf{H} = \phi(\mathbf{X} \mathbf{W}_{xh} + \mathbf{b}_h).$

In the above formula , We have weight parameters for hidden layers $\mathbf{W}_{xh} \in \mathbb{R}^{d \times h}$ 、 Offset parameters $\mathbf{b}_h \in \mathbb{R}^{1 \times h}$ , The number of hidden cells is $h$ . therefore , Apply broadcast mechanism during summation . Next , The variable will be hidden $\mathbf{H}$ Used as input to the output layer . The output layer is given by the following formula ：

$\mathbf{O} = \mathbf{H} \mathbf{W}_{hq} + \mathbf{b}_q,$

among , $\mathbf{O} \in \mathbb{R}^{n \times q}$ It's the output variable , $\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}$ It's a weight parameter , $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ Is the offset parameter of the output layer . If it's a classification problem , We can use $\text{softmax}(\mathbf{O})$ To calculate the probability distribution of the output category .

This is completely similar to the regression problem we solved in the sequence model , So we omit the details . so to speak , We can randomly select features - Label pair , And through automatic differentiation and random gradient descent to learn network parameters .

Two 、 Cyclic neural networks with hidden states

When we have a hidden state , It's totally different . Let's look at this structure in more detail .

Let's say we're in the time step $t$ There are small batch inputs $\mathbf{X}_t \in \mathbb{R}^{n \times d}$ . In other words , about $n$ A small batch of sequence samples , $\mathbf{X}_t$ Each line of corresponds to a time step from the sequence $t$ A sample at . Next , use $\mathbf{H}_t \in \mathbb{R}^{n \times h}$ Time step $t$ Hidden variables . Unlike the maximum likelihood algorithm , Here we save the hidden variables of the previous time step $\mathbf{H}_{t-1}$ , A new weight parameter is introduced $\mathbf{W}_{hh} \in \mathbb{R}^{h \times h}$ To describe how to use the hidden variables of the previous time step in the current time step . In particular , The hidden variable calculation of the current time step is determined by the input of the current time step and the hidden variable of the previous time step ：

$\mathbf{H}_t = \phi(\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh} + \mathbf{b}_h).$

Compared with the cyclic neural network without hidden state , One more item is added to the above formula $\mathbf{H}_{t-1} \mathbf{W}_{hh}$ , Thus instantiating $h_t = f(x_t,h_{t-1})$ . Hidden variables from adjacent time steps $\mathbf{H}_t$ and $\mathbf{H}_{t-1}$ The relationship between them is known , These variables capture and retain the historical information of the sequence up to its current time step , Just like the state or memory of the current time step of neural network . therefore , Such hidden variables are called “ Hidden state ”（hidden state）. Because the hidden state uses the same definition as the previous time step in the current time step , So the calculation is Cyclic （recurrent）. therefore , The hidden state neural network based on cyclic computation is named Cyclic neural network （recurrent neural networks）. The layer that performs output calculations in a recurrent neural network is called “ Circulation layer ”（recurrent layers）.

There are many different ways to construct Recurrent Neural Networks . Recurrent neural networks with hidden states defined by the above formula are very common . For time steps $t$ , The output of the output layer is similar to the calculation in multi-layer perceptron ：

$\mathbf{O}_t = \mathbf{H}_t \mathbf{W}_{hq} + \mathbf{b}_q.$

The parameters of the recurrent neural network include the weight of the hidden layer $\mathbf{W}_{xh} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh} \in \mathbb{R}^{h \times h}$ And offset $\mathbf{b}_h \in \mathbb{R}^{1 \times h}$ , And the weight of the output layer $\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}$ And offset $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ . It is worth mentioning that , Even at different time steps , Recurrent neural networks always use these model parameters . therefore , The parameter cost of recurrent neural network will not increase with the increase of time step .

The following figure shows the computational logic of the recurrent neural network in three adjacent time steps . At any time step $t$ , The calculation of hidden state can be regarded as ：

Set the current time step $t$ The input of $\mathbf{X}_t$ And the previous time step $t - 1$ The hidden state of $\mathbf{H}_{t-1}$ Link ;
Send the link result to the activation function $\phi$ The full connection layer of . The output of the full connection layer is the current time step $t$ The hidden state of $\mathbf{H}_t$ .

In this case , The model parameters are $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ The link to , as well as $\mathbf{b}_h$ The offset of . Current time step $t$ 、 $\mathbf{H}_t$ The hidden state of will participate in the calculation of the next time step $t + 1$ The hidden state of $\mathbf{H}_{t+1}$ . Besides , Will also $\mathbf{H}_t$ Into the fully connected output layer , To calculate the current time step $t$ Output $\mathbf{O}_t$ .

We just mentioned , Calculation of hidden state $\mathbf{X}_t \mathbf{W}_{xh} + \mathbf{H}_{t-1} \mathbf{W}_{hh}$, amount to $\mathbf{X}_t$ and $\mathbf{H}_{t-1}$ Link and $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ Connected matrix multiplication .

In fact, it's quite simple , Write here ：
$X_tW_{xh} + H_{t-1}W_{hh} = \begin{bmatrix} X_t & H_{t-1} \end{bmatrix} \begin{bmatrix} W_{xh}\\ W_{hh} \end{bmatrix}$

Although this can be proved mathematically , But below we will only use a simple code snippet to illustrate this . First , We define the matrix X、W_xh、H and W_hh, Their shapes are (3,1)、(1,4)、(3,4) and (4,4). Separately X multiply W_xh, take H multiply W_hh, Then add the two multiplications , We get a shape of (3,4) Matrix .

import torch
from d2l import torch as d2l

X, W_xh = torch.normal(0, 1, (3, 1)), torch.normal(0, 1, (1, 4))
H, W_hh = torch.normal(0, 1, (3, 4)), torch.normal(0, 1, (4, 4))
torch.matmul(X, W_xh) + torch.matmul(H, W_hh)

tensor([[ 2.3381, -3.0454,  1.0498,  2.0654],
        [ 2.1281,  6.2845,  0.9586,  0.6588],
        [-2.1097,  1.7296, -1.0643, -1.9253]])

Now? , Let's go along the line （ Axis 1） Link matrix X and H, Along the line （ Axis 0） Link matrix W_xh and W_hh. These two links produce shapes respectively (3, 5) And shape (5, 4) Matrix . Multiply these two connected matrices , We get the same shape as above (3, 4) The output matrix of .

torch.matmul(torch.cat((X, H), 1), torch.cat((W_xh, W_hh), 0))

tensor([[ 2.3381, -3.0454,  1.0498,  2.0654],
        [ 2.1281,  6.2845,  0.9586,  0.6588],
        [-2.1097,  1.7296, -1.0643, -1.9253]])

3、 ... and 、 Character level language model based on recurrent neural network

Think about it , For the language model , Our goal is to predict the next marker based on current and past markers , So we shift the original sequence by a tag .Bengio wait forsomeone (Bengio.Ducharme.Vincent.ea.2003) First of all, it is proposed to use neural network for language modeling .

Next , We will show how to use recurrent neural networks to build language models . Set the small batch size to 1, The text sequence is "machine". In order to simplify the training of the subsequent part , We mark text as characters instead of words , And consider using Character level language model （character-level language model）. The following figure shows how to use a cyclic neural network for character level language modeling , Predict the next character based on the current character and the previous character .

In the process of training , We analyze the output of the output layer of each time step softmax operation , Then the cross entropy loss is used to calculate the error between the model output and the label . Due to the cyclic calculation of the hidden state in the hidden layer , The time steps in the figure 3 Output $\mathbf{O}_3$ By text sequence “m”、“a” and “c” determine . Because the next character of the sequence in the training data is “h”, So time step 3 The loss of will depend on the feature sequence based on this time step “m”、“a”、“c” Generate the next character probability distribution and label “h”.

actually , Each tag consists of a $d$ The dimension vector represents , We use batch size $n > 1$ . therefore , Input $\mathbf X_t$ In time step $t$ It will be $n\times d$ matrix .

Four 、 Confusion （Perplexity）—— Measuring the quality of language models

Last , Let's discuss how to measure the quality of language models , This will be used in the following part to evaluate our model based on recurrent neural network . One way is to check how surprising the text is . A good language model can predict what we will see next with high-precision tags . Consider different language models for phrases "It is raining" Propose the following Continuation ：

“It is raining outside”
“It is raining banana tree”
“It is raining piouw;kcj pwepoiut”

In terms of quality , example 1 Obviously the best . These words are wise , Logically, it is coherent . Although this model may not accurately reflect which word follows semantically （“in San Francisco” and “in winter” It may be a completely reasonable extension ）, But the model can capture what kind of words follow . example 2 Produced a meaningless sequel , This is much worse . For all that , At least the model has learned how to spell words and some degree of correlation between words . Last , example 3 It is pointed out that the model with insufficient training cannot fit the data well .

We can measure the quality of the model by calculating the likelihood probability of the sequence . Unfortunately , This is a difficult number to understand and compare . After all , Shorter sequences are more likely to occur than longer sequences , So in Tolstoy's masterpiece 《 War and Peace 》 Evaluate the model on , Inevitably, it will be better than the novella of St. Exupery 《 The little prince 》 It is much less likely to happen . What is missing is equivalent to the average . Information theory comes in handy here . We are introducing softmax Entropy is defined in regression 、 Singular entropy and cross entropy , And in Online appendix of information theory More information theory is discussed in . If we want to compress the text , We can ask to predict the next tag given the current tag set . A better language model should allow us to predict the next tag more accurately . therefore , It should allow us to spend less bits when compressing sequences . So we can Through all in a sequence $n$ The average cross entropy loss of markers ：

$\frac{1}{n} \sum_{t=1}^n -\log P(x_t \mid x_{t-1}, \ldots, x_1),$

among $P$ Given by language model , $x_t$ It's in the time step $t$ The actual marker observed from this sequence . This makes the performance on documents of different lengths comparable . For historical reasons , Natural language processing Scientists prefer to use a tool called “ Confusion ”（perplexity） The amount of . In short , It is the exponent of the above formula ：

$\exp\left(-\frac{1}{n} \sum_{t=1}^n \log P(x_t \mid x_{t-1}, \ldots, x_1)\right).$

Confusion can best be understood as when we decide which tag to choose next , Harmonic mean of the actual number of choices . Let's take a look at some cases ：

In the best of circumstances , The model always perfectly estimates the probability of label marking as 1. under these circumstances , The confusion degree of the model is 1. ( Confusion =1 The sign prediction effect is good )
In the worst case , The model always predicts that the probability of label marking is 0. under these circumstances , The degree of confusion is positive infinity .
At baseline , The model predicts the uniform distribution of all available tags in the vocabulary . under these circumstances , The degree of confusion is equal to the number of unique marks in the vocabulary . in fact , If we store the sequence without any compression , This will be the best coding we can do . therefore , This provides an important upper limit , Any actual model must exceed this upper limit .

In the next few sections , We will implement recurrent neural networks for character level language models , And use confusion to evaluate these models .

Summary

Neural networks that use cyclic computation for hidden states are called cyclic Neural Networks （RNN）.
The hidden state of the recurrent neural network can capture the historical information of the sequence up to the current time step .
The number of parameters of the recurrent neural network model will not increase with the increase of time steps .
We can use recurrent neural networks to create character level language models .
We can use the degree of confusion to evaluate the quality of language models .

practice

If we use a recurrent neural network to predict the next character in the text sequence , So what is the dimension required for any output ？

Enter the latitude

Why can the recurrent neural network represent the conditional probability of the tag at a certain time step based on all the previous tags in the text sequence ？