当前位置：网站首页>Hands on deep learning (40) -- short and long term memory network (LSTM)

Hands on deep learning (40) -- short and long term memory network (LSTM)

2022-07-04 09:41:00 【Stay a little star】

List of articles

One 、 Long and short term memory network （LSTM）

The earliest method used to deal with the problems of long-term information preservation and short-term input jump in implicit variable models （long short-term memory LSTM）. It has the same properties as many gated cycle units .LSTM Than GRU More complicated , But its ratio GRU Early birth 20 About years ago .

1.1 Gated memory unit

LSTM Introduced storage unit （memory cell）, Abbreviated as unit （cell）. Some literatures think that storage unit is a special type of hidden state , They have the same shape as the hidden state , It is designed to record additional information . In order to control the storage unit , We need many doors . One of the doors is used to read entries from the unit . Let's call this Output gate （output gate）. Another gate is used to decide when to read data into the unit . Let's call this Input gate （input gate）. Last , We need a mechanism to reset the contents of the unit , from Oblivion gate （forget gate） To manage . The motivation for this design is the same as that of the gated cycle unit , That is, it can decide when to remember or ignore the input in the hidden state through a special mechanism . Let's see how this works in practice .

Oblivion gate ： Turn the value toward 0 The direction decreases
Input gate ： Decide whether to ignore input data
Output gate ： Decide whether to use hidden state

1.2 Input gate 、 Forgetting gate and output gate

Just like in the gated cycle unit , The input of the current time step and the hidden state of the previous time step are sent into the long-term and short-term memory network gate as data , As shown in the figure below . They consist of three with sigmoid Activate the full connection layer processing of the function , To calculate the input gate 、 Forget the values of gate and output gate . therefore , The values of these three doors are $(0, 1)$ Within the scope of .

mathematical description , Suppose there is $h$ Hidden units , Batch size is $n$ , The number of inputs is $d$ . therefore , Input is $\mathbf{X}_t \in \mathbb{R}^{n \times d}$ , The hidden state of the previous time step is $\mathbf{H}_{t-1} \in \mathbb{R}^{n \times h}$ . Accordingly , Time step $t$ The door of is defined as follows ： The input gate is $\mathbf{I}_t \in \mathbb{R}^{n \times h}$ , The forgetting door is $\mathbf{F}_t \in \mathbb{R}^{n \times h}$ , The output gate is $\mathbf{O}_t \in \mathbb{R}^{n \times h}$ . Their calculation method is as follows ：

$\begin{aligned} \mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i),\\ \mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f),\\ \mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o), \end{aligned}$

among $\mathbf{W}_{xi}, \mathbf{W}_{xf}, \mathbf{W}_{xo} \in \mathbb{R}^{d \times h}$ and $\mathbf{W}_{hi}, \mathbf{W}_{hf}, \mathbf{W}_{ho} \in \mathbb{R}^{h \times h}$ It's a weight parameter , $\mathbf{b}_i, \mathbf{b}_f, \mathbf{b}_o \in \mathbb{R}^{1 \times h}$ Is the offset parameter .

1.3 Candidate memory unit

Next , Design memory unit . Since the operation of various doors has not been specified , So let's start with Candidate memory unit （candidate memory cell） $\tilde{\mathbf{C}}_t \in \mathbb{R}^{n \times h}$ . Its calculation is similar to that of the three doors described above , But use $\tanh$ Function as activation function , The value range of the function is $(- 1, 1)$ . The following export is in the time step $t$ Equation at ：

$\tilde{\mathbf{C}}_t = \text{tanh}(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c),$

among $\mathbf{W}_{xc} \in \mathbb{R}^{d \times h}$ and $\mathbf{W}_{hc} \in \mathbb{R}^{h \times h}$ It's a weight parameter , $\mathbf{b}_c \in \mathbb{R}^{1 \times h}$ Is the offset parameter .

The diagram of candidate memory units is as follows

1.4 Memory unit

In the gating cycle unit , There is a mechanism to control input and forgetting （ Or skip ）. Similarly , In short-term and long-term memory networks , There are also two doors for this purpose ： Input gate $\mathbf{I}_t$ Control how much is used from $\tilde{\mathbf{C}}_t$ New data for , And forget the door $\mathbf{F}_t$ Control how many old memory units are retained $\mathbf{C}_{t-1} \in \mathbb{R}^{n \times h}$ The content of . Use the same technique of multiplying by elements as before , The following updated formula is obtained ：

$\mathbf{C}_t = \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t.$

If the forgetting door is always $1$ And the input gate is always $0$ , Then the memory unit of the past $\mathbf{C}_{t-1}$ Will be saved over time and passed to the current time step . This design is introduced to alleviate the gradient disappearance problem , And better capture the long-distance dependence in the sequence .

So we get the flow chart , as follows .

1.5 Hidden state

Last , We need to define how to calculate hidden states $\mathbf{H}_t \in \mathbb{R}^{n \times h}$ . This is where the output gate works . In short-term and long-term memory networks , It's just a memory unit $\tanh$ Gated version of . This ensures that $\mathbf{H}_t$ The value of is always in the interval $(- 1, 1)$ Inside .

$\mathbf{H}_t = \mathbf{O}_t \odot \tanh(\mathbf{C}_t).$

As long as the output gate is close to $1$ , We can effectively transfer all memory information to the prediction part , For the output gate, close to $0$ , We only keep all the information in the storage unit , And there is no further process to perform .

The following is a graphical demonstration of all data streams .

Two 、 From zero LSTM

import torch
from torch import nn
from d2l import torch as d2l
# load data iter
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

2.1 Initialize model parameters

#  Casually initialize , The standard deviation used here is 0.01 Initialization of Gaussian distribution , Offset use 0
def get_lstm_params(vocab_size,num_hiddens,device):
    num_inputs = num_outputs = vocab_size
    
    def normal(shape):
        return torch.randn(size=shape,device=device)*0.01
    def three():
        return(normal((num_inputs,num_hiddens)),
               normal((num_hiddens,num_hiddens)),
               torch.zeros(num_hiddens,device=device))
    W_xi, W_hi, b_i = three()  #  Enter the door parameters 
    W_xf, W_hf, b_f = three()  #  Forget the door parameters 
    W_xo, W_ho, b_o = three()  #  Output gate parameters 
    W_xc, W_hc, b_c = three()  #  Candidate memory cell parameters 
    #  Output layer parameters 
    W_hq = normal((num_hiddens, num_outputs))
    b_q = torch.zeros(num_outputs, device=device)
    #  Additional gradient 
    params = [
        W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,
        W_hq, b_q]
    for param in params:
        param.requires_grad_(True)
    return params

2.2 Define the network model

In the initialization function ,LSTM The hidden state of needs to return an additional memory unit , The value of its cell is 0, Shape is （ Batch size , Number of hidden units ）.

def init_lstm_state(batch_size,num_hiddens,device):
    return (torch.zeros((batch_size,num_hiddens),device=device),
            torch.zeros((batch_size,num_hiddens),device=device))

#  The definition of the actual model is the same as the previous definition , Provide three doors and an additional memory unit . Only hidden states are passed to the output layer , The memory unit is not directly involved in the output calculation 
def lstm(inputs,state,params):
    [W_xi, W_hi, b_i, W_xf, W_hf, b_f, W_xo, W_ho, b_o, W_xc, W_hc, b_c,W_hq, b_q] = params
    (H,C) = state
    outputs = []
    for X in inputs:
        I = torch.sigmoid(([email protected]_xi)+([email protected]_hi)+b_i)
        F = torch.sigmoid((X @ W_xf) + (H @ W_hf) + b_f)
        O = torch.sigmoid((X @ W_xo) + (H @ W_ho) + b_o)
        C_tilda = torch.tanh((X @ W_xc) + (H @ W_hc) + b_c)
        C = F * C + I * C_tilda
        H = O * torch.tanh(C)
        Y = (H @ W_hq) + b_q
        outputs.append(Y)
    return torch.cat(outputs, dim=0), (H, C)

2.3 Training and forecasting

vocab_size, num_hiddens, device = len(vocab), 256, d2l.try_gpu()
num_epochs, lr = 500, 1
model = d2l.RNNModelScratch(len(vocab), num_hiddens, device, get_lstm_params,init_lstm_state, lstm)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

perplexity 1.1, 49112.9 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller abcerthen thing the time traveller held in his ha

2.4 Concise implementation

num_inputs = vocab_size
lstm_layer = nn.LSTM(num_inputs, num_hiddens)
model = d2l.RNNModel(lstm_layer, len(vocab))
model = model.to(device)
d2l.train_ch8(model, train_iter, vocab, lr, num_epochs, device)

perplexity 1.1, 281347.3 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
travelleryou can show black is white by argument said filby

The long-term and short-term memory network is a typical implicit variable autoregressive model with important state control . Many variants have been proposed over the years , for example , Multi-storey 、 Residual connection 、 Different types of regularization . However , Due to the long-distance dependence of the sequence , Train short-term and long-term memory networks and other sequence models ( Cycle control unit, e.g ) The cost of is quite high . In the following content , We will encounter alternative models that can be used in some cases , Such as Transformer.

Summary

There are three types of gates in long-term and short-term memory networks ： Input gate 、 Forgetting gate and output gate controlling information flow .
The hidden layer outputs of long-term and short-term memory networks include “ Hidden state ” and “ Memory unit ”. Only the hidden state is passed to the output layer , The memory unit is completely internal information .
Long-term and short-term memory networks can alleviate gradient disappearance and gradient explosion .

practice

How do you need to change the model to generate the appropriate words , Not a sequence of characters

At the time of input , We need to treat words as vocab Encoding , But in that case ,onehot The size of the code may need to become very large . Data processing , Give each word a corresponding number , Number this onehot Coding can also .

Given the hidden layer dimension , Compare the gating cycle unit 、 The computational cost of long-term and short-term memory networks and conventional Recurrent Neural Networks . Pay special attention to the cost of training and reasoning .
Since candidate memory units are used $\tanh$ Function to ensure that the hedging range is $(- 1, 1)$ Between , So why do hidden states need to be used again $\tanh$ Function to ensure that the output value range is $(- 1, 1)$ Between ？

$\begin{aligned} \mathbf{I}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xi} + \mathbf{H}_{t-1} \mathbf{W}_{hi} + \mathbf{b}_i),\\ \mathbf{F}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xf} + \mathbf{H}_{t-1} \mathbf{W}_{hf} + \mathbf{b}_f),\\ \mathbf{O}_t &= \sigma(\mathbf{X}_t \mathbf{W}_{xo} + \mathbf{H}_{t-1} \mathbf{W}_{ho} + \mathbf{b}_o), \end{aligned}$ $\tilde{\mathbf{C}}_t = \text{tanh}(\mathbf{X}_t \mathbf{W}_{xc} + \mathbf{H}_{t-1} \mathbf{W}_{hc} + \mathbf{b}_c),$ $\mathbf{C}_t = \mathbf{F}_t \odot \mathbf{C}_{t-1} + \mathbf{I}_t \odot \tilde{\mathbf{C}}_t.$ $\mathbf{H}_t = \mathbf{O}_t \odot \tanh(\mathbf{C}_t).$