当前位置：网站首页>Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)

Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)

2022-07-04 09:41:00 【Stay a little star】

List of articles

This section is very simple , Is in the RNN The concept of layer is added on the basis of , After all, what we discussed before RNN、GRU and LSTM We all discuss it on the basis of single-layer neural network

One 、 Deep loop neural network

up to now , We only discuss recurrent neural networks with a one-way hidden layer . among , The interaction between implicit variables and observations and specific functional forms is quite arbitrary . As long as we can model different interaction types with enough flexibility , This is not a big problem . However , For a single layer , This can be quite challenging . In the case of linear models , We solve this problem by adding more layers . And in a recurrent neural network , Because we first need to decide how and where to add additional nonlinearity , So this problem is a little tricky .

in fact , We can stack multilayer recurrent neural networks together , Through the combination of several simple layers , A flexible mechanism . especially , The data may be related to the stacking of different layers . for example , We may want to maintain the relevant financial market conditions （ Bear market or bull market ） High level data available for , At the bottom, only short-term time dynamics are recorded .

The following is a list with $L$ A deep recurrent neural network with hidden layers , Each hidden state is continuously transferred to the next time step of the current layer and the current time step of the next layer .( The interesting thing is that you will be the former RNN The picture is compared with it , You can find $H^{(1)}_1、H^{(1)}_2$ They are the corresponding outputs before , Now it just becomes the input of the next layer of neurons )

Two 、 Functional dependencies

We can formalize the functional dependencies in the deep architecture , This structure is made up of $L$ A hidden layer . Let's say we're in the time step $t$ There is a small batch of input data $\mathbf{X}_t \in \mathbb{R}^{n \times d}$ （ Sample size ： $n$ , Number of inputs in each sample ： $d$ ）. meanwhile , take $l^\mathrm{th}$ Hidden layer （ $l=1,\ldots,L$ ） The hidden state of is set to $\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}$ （ Number of hidden units ： $h$ ）, The output layer variable is set to $\mathbf{O}_t \in \mathbb{R}^{n \times q}$ （ Output number ： $q$ ）. Set up $\mathbf{H}_t^{(0)} = \mathbf{X}_t$ , The first $l$ The hidden state of a hidden layer uses the activation function $\phi_l$ The expression of is as follows ：

$\mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}),$

among , The weight $\mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h}$ and $\mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h}$ And offset $\mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h}$ Is the first $l$ Model parameters for multiple hidden layers . Last , The calculation of the output layer is only based on $l$ The final hidden state of the hidden layer ：

$\mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q,$

among , The weight $\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}$ And offset $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ Are the model parameters of the output layer .

Like multi-layer perceptron , Number of hidden layers $L$ And the number of hidden cells $h$ It's all super parameters . in other words , They can be adjusted or specified by us . in addition , Use the hidden state of gated cycle unit or long-term and short-term memory network to replace depth RNN Calculate the hidden state in , Deep gated recurrent neural networks can be easily obtained ( Mu Shen said in the course to use LSTM and GRU It depends on my hobbies , There is no essential difference , Of course, there are more NB Of transformer Can be applied ), After all Attention is all you need Not for fun .

3、 ... and 、 Introduction and Implementation

import torch
from torch import nn
from d2l import torch as d2l

batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)

Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...

3.1 Network construction

#  Hyperparametric architecture decision and single layer lstm In the same way , You need to select the same number of inputs and outputs , Hidden cells are set to 256, Just one more hidden layer 
vocab_size,num_hiddens,num_layers = len(vocab),256,2
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layers = nn.LSTM(num_inputs,num_hiddens,num_layers)
model = d2l.RNNModel(lstm_layers,len(vocab))
model = model.to(device)

3.2 Training and Forecasting

num_epochs,lr=500,2
d2l.train_ch8(model,train_iter,vocab,lr,num_epochs,device)

perplexity 1.0, 161938.1 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller with a slight accession ofcheerfulness really thi

Insert picture description here

Four 、 summary

In deep recurrent neural networks , The information of the hidden state is transferred to the next time step of the current layer and the current time step of the next layer
There are many different styles of deep loop neural networks , Such as LSTM、GRU Or classic RNN The Internet . These models are advanced in the deep learning framework API cover
In general, two words , depth RNN It takes a lot of work （ Learning rate adjustment 、 trim ） To ensure proper convergence , The initialization of the model also needs to be very careful
The two-tier model above is 200 individual epoch It basically converges when , The previous single-layer model is almost 300 individual epoch Will converge when . Multilayer adds nonlinearity , It accelerates convergence but also increases the risk of over fitting . It is not usually used in particularly deep RNN The Internet

原网站

版权声明
本文为[Stay a little star]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202141424095811.html