当前位置:网站首页>Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)
Hands on deep learning (41) -- Deep recurrent neural network (deep RNN)
2022-07-04 09:41:00 【Stay a little star】
List of articles
This section is very simple , Is in the RNN The concept of layer is added on the basis of , After all, what we discussed before RNN、GRU and LSTM We all discuss it on the basis of single-layer neural network
One 、 Deep loop neural network
up to now , We only discuss recurrent neural networks with a one-way hidden layer . among , The interaction between implicit variables and observations and specific functional forms is quite arbitrary . As long as we can model different interaction types with enough flexibility , This is not a big problem . However , For a single layer , This can be quite challenging . In the case of linear models , We solve this problem by adding more layers . And in a recurrent neural network , Because we first need to decide how and where to add additional nonlinearity , So this problem is a little tricky .
in fact , We can stack multilayer recurrent neural networks together , Through the combination of several simple layers , A flexible mechanism . especially , The data may be related to the stacking of different layers . for example , We may want to maintain the relevant financial market conditions ( Bear market or bull market ) High level data available for , At the bottom, only short-term time dynamics are recorded .
The following is a list with L L L A deep recurrent neural network with hidden layers , Each hidden state is continuously transferred to the next time step of the current layer and the current time step of the next layer .( The interesting thing is that you will be the former RNN The picture is compared with it , You can find H 1 ( 1 ) 、 H 2 ( 1 ) H^{(1)}_1、H^{(1)}_2 H1(1)、H2(1) They are the corresponding outputs before , Now it just becomes the input of the next layer of neurons )

Two 、 Functional dependencies
We can formalize the functional dependencies in the deep architecture , This structure is made up of L L L A hidden layer . Let's say we're in the time step t t t There is a small batch of input data X t ∈ R n × d \mathbf{X}_t \in \mathbb{R}^{n \times d} Xt∈Rn×d( Sample size : n n n, Number of inputs in each sample : d d d). meanwhile , take l t h l^\mathrm{th} lth Hidden layer ( l = 1 , … , L l=1,\ldots,L l=1,…,L) The hidden state of is set to H t ( l ) ∈ R n × h \mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h} Ht(l)∈Rn×h( Number of hidden units : h h h), The output layer variable is set to O t ∈ R n × q \mathbf{O}_t \in \mathbb{R}^{n \times q} Ot∈Rn×q( Output number : q q q). Set up H t ( 0 ) = X t \mathbf{H}_t^{(0)} = \mathbf{X}_t Ht(0)=Xt, The first l l l The hidden state of a hidden layer uses the activation function ϕ l \phi_l ϕl The expression of is as follows :
H t ( l ) = ϕ l ( H t ( l − 1 ) W x h ( l ) + H t − 1 ( l ) W h h ( l ) + b h ( l ) ) , \mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)} + \mathbf{b}_h^{(l)}), Ht(l)=ϕl(Ht(l−1)Wxh(l)+Ht−1(l)Whh(l)+bh(l)),
among , The weight W x h ( l ) ∈ R h × h \mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h} Wxh(l)∈Rh×h and W h h ( l ) ∈ R h × h \mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h} Whh(l)∈Rh×h And offset b h ( l ) ∈ R 1 × h \mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h} bh(l)∈R1×h Is the first l l l Model parameters for multiple hidden layers . Last , The calculation of the output layer is only based on l l l The final hidden state of the hidden layer :
O t = H t ( L ) W h q + b q , \mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q, Ot=Ht(L)Whq+bq,
among , The weight W h q ∈ R h × q \mathbf{W}_{hq} \in \mathbb{R}^{h \times q} Whq∈Rh×q And offset b q ∈ R 1 × q \mathbf{b}_q \in \mathbb{R}^{1 \times q} bq∈R1×q Are the model parameters of the output layer .
Like multi-layer perceptron , Number of hidden layers L L L And the number of hidden cells h h h It's all super parameters . in other words , They can be adjusted or specified by us . in addition , Use the hidden state of gated cycle unit or long-term and short-term memory network to replace depth RNN Calculate the hidden state in , Deep gated recurrent neural networks can be easily obtained ( Mu Shen said in the course to use LSTM and GRU It depends on my hobbies , There is no essential difference , Of course, there are more NB Of transformer Can be applied ), After all Attention is all you need Not for fun .
3、 ... and 、 Introduction and Implementation
import torch
from torch import nn
from d2l import torch as d2l
batch_size, num_steps = 32, 35
train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)
Downloading ../data/timemachine.txt from http://d2l-data.s3-accelerate.amazonaws.com/timemachine.txt...
3.1 Network construction
# Hyperparametric architecture decision and single layer lstm In the same way , You need to select the same number of inputs and outputs , Hidden cells are set to 256, Just one more hidden layer
vocab_size,num_hiddens,num_layers = len(vocab),256,2
num_inputs = vocab_size
device = d2l.try_gpu()
lstm_layers = nn.LSTM(num_inputs,num_hiddens,num_layers)
model = d2l.RNNModel(lstm_layers,len(vocab))
model = model.to(device)
3.2 Training and Forecasting
num_epochs,lr=500,2
d2l.train_ch8(model,train_iter,vocab,lr,num_epochs,device)
perplexity 1.0, 161938.1 tokens/sec on cuda:0
time traveller for so it will be convenient to speak of himwas e
traveller with a slight accession ofcheerfulness really thi
Four 、 summary
- In deep recurrent neural networks , The information of the hidden state is transferred to the next time step of the current layer and the current time step of the next layer
- There are many different styles of deep loop neural networks , Such as LSTM、GRU Or classic RNN The Internet . These models are advanced in the deep learning framework API cover
- In general, two words , depth RNN It takes a lot of work ( Learning rate adjustment 、 trim ) To ensure proper convergence , The initialization of the model also needs to be very careful
- The two-tier model above is 200 individual epoch It basically converges when , The previous single-layer model is almost 300 individual epoch Will converge when . Multilayer adds nonlinearity , It accelerates convergence but also increases the risk of over fitting . It is not usually used in particularly deep RNN The Internet
边栏推荐
- MySQL transaction mvcc principle
- 【leetcode】29. Divide two numbers
- Golang defer
- UML sequence diagram [easy to understand]
- Some points needing attention in PMP learning
- "How to connect the Internet" reading notes - FTTH
- PHP book borrowing management system, with complete functions, supports user foreground management and background management, and supports the latest version of PHP 7 x. Database mysql
- "How to connect the network" reading notes - Web server request and response (4)
- Write a jison parser from scratch (1/10):jison, not JSON
- Web端自动化测试失败原因汇总
猜你喜欢
Write a mobile date selector component by yourself
C # use gdi+ to add text to the picture and make the text adaptive to the rectangular area
How does idea withdraw code from remote push
2022-2028 global special starch industry research and trend analysis report
How to batch change file extensions in win10
PHP personal album management system source code, realizes album classification and album grouping, as well as album image management. The database adopts Mysql to realize the login and registration f
How should PMP learning ideas be realized?
MATLAB小技巧(25)竞争神经网络与SOM神经网络
Svg image quoted from CodeChina
2022-2028 global industry research and trend analysis report on anterior segment and fundus OTC detectors
随机推荐
Write a jison parser from scratch (6/10): parse, not define syntax
Reading notes on how to connect the network - tcp/ip connection (II)
"How to connect the network" reading notes - Web server request and response (4)
Are there any principal guaranteed financial products in 2022?
华为联机对战如何提升玩家匹配成功几率
Hands on deep learning (34) -- sequence model
Launpad | basic knowledge
浅谈Multus CNI
品牌连锁店5G/4G无线组网方案
Global and Chinese market of air fryer 2022-2028: Research Report on technology, participants, trends, market size and share
Baidu R & D suffered Waterloo on three sides: I was stunned by the interviewer's set of combination punches on the spot
Golang defer
Reading notes of how the network is connected - understanding the basic concepts of the network (I)
Investment analysis and future production and marketing demand forecast report of China's paper industry Ⓥ 2022 ~ 2028
Golang defer
Global and Chinese trisodium bicarbonate operation mode and future development forecast report Ⓢ 2022 ~ 2027
Basic data types in golang
You can see the employment prospects of PMP project management
2022-2028 global special starch industry research and trend analysis report
Write a jison parser from scratch (3/10): a good beginning is half the success -- "politics" (Aristotle)