当前位置：网站首页>LSTM neural network

LSTM neural network

2022-07-24 06:10:00 【A little cute C】

Long and short term memory network （LSTM） It is a variant of circular network , It can effectively solve the problem of cyclic neural network （RNN） Gradient explosion problem .

LSTM The three doors of

LSTM The network introduces a gating mechanism （gating mechanism） To control the path of information transmission , The three gates are input gates $i_{t}$ 、 Oblivion gate $f_{t}$ 、 Output gate $o_{t}$ , The functions of these three doors are ：

（1） Input gate $i_{t}$ Control the candidate state at the current time $\tilde{c}_{t}$ How much information needs to be saved .

（2） Oblivion gate $f_{t}$ Control the internal state of the last time $c_{t-1}$ How much information to forget

（3） Output gate $o_{t}$ Controls the internal state at the current moment $c_{t}$ How much information needs to be output to the external state $h_{t}$

When $f_{t}=0$ , $i_{t}=1$ when , The memory unit empties the history information , And the candidate state vector $\tilde{c}_{t}$ write in , But now the memory unit $c_{t}$ Still relevant to the historical information of the last moment , When $f_{t}=1$ , $i_{t}=0$ when , The memory unit will copy the contents of the last time , Do not write new information .

LSTM Network “ door ” It's a kind of “ soft ” door , The value is （0,1） Between , Means to allow information to pass through in a certain proportion , The calculation method of three doors is ：

$i_{t}=\sigma (W_{i}x_{t}+U_{i}h_{t-1}+b_{i})$ ,

$f_{t}=\sigma (W_{f}x_{t}+U_{f}h_{t-1}+b_{f})$ ,

$o_{t}=\sigma (W_{o}x_{t}+U_{o}h_{t-1}+b_{o})$ ,

among $\sigma (\cdot )$ by Logistic function , Its output range is （0,1）, $x_{t}$ Input for the current time , $h_{t-1}$ It is the external state of the last moment .

LSTM Calculation process

The figure below shows LSTM Cyclic cell structure of network

The calculation process is as follows ：

1） First, use the external state of the previous moment $h_{t-1}$ And the input of the current time $x_{t}$ Calculate three doors , And candidate status $\tilde{c}_{t}$ ;

2） Combined with forgetting gate $f_{t}$ And input gate to update memory unit $c_{t}$ ;

3） Combined with output gate $o_{t}$ , Pass the information of internal state to external state $h_{t}$ ;

pytorch in lstm Parameter interpretation

LSTM All in all 7 Parameters ：

1：input_size – The size of the input data

2：hidden_size – The size of the hidden layer （ That is, the number of hidden layer nodes ）, The dimension of the output vector is equal to the number of hidden nodes

3：num_layers – LSTM The number of layers stacked , The default value is 1 layer , If set to 2, the second LSTM Receive the first LSTM Calculated results of . That is, the first layer of input [ X0 X1 X2 ... Xt], To calculate the [ h0 h1 h2 ... ht ], The second layer will [ h0 h1 h2 ... ht ] As [ X0 X1 X2 ... Xt] Enter to recalculate , Output the last [ h0 h1 h2 ... ht ].

4：bias– Whether the hidden layer state has bias, The default is true.bias It's the offset value , Or offset value

5：batch_first– Whether the first dimension of input and output is batch_size, The default value is False

6：dropout– The default value is 0. Is it in addition to the last RNN Others outside the floor RNN Add dropout layer . The input value is 0-1 Decimal between , Representation probability .0 Express 0 probability dripout, I.e. no dropout

7：bidirectional– Whether it's two-way RNN, The default is ：false, if true, be ：num_directions=2, Otherwise 1.

Why is it called long-term and short-term memory ？（ Long and short-term memory refers to long “ Short term memory ”）

Hidden state in recurrent neural networks Historical information is stored , It can be seen as a kind of memory . In simple cyclic networks , The hidden state is rewritten every moment , Therefore, it can be regarded as a kind of short-term memory , In the neural network , Long term memory can be regarded as grid parameters , Implicit in the experience learned from training data , Its renewal cycle is much slower than short-term memory , And in the LSTM In the network , Memory unit You can capture some key information at a certain moment , And the ability to store this critical information at intervals , Memory unit The declaration period for storing information in is longer than that in short-term memory , But it's much shorter than long-term memory , Therefore, it is called short-term and long-term memory .

On gradient dispersion

Generally, in the deep network parameter learning , The value of parameter initialization is generally set to be small , But in training LSTM When the network , Too small a value will make the value of the forgetting gate smaller , This means that most of the information from the previous moment has been lost , In this way, it is difficult for the network to capture long-distance dependent information , And the gradient between adjacent time intervals will be very small , This can lead to gradient dispersion problems . Therefore, the initial value of forgotten parameters is generally set to be large , Its paranoid vector $b_{f}$ Set to 1 or 2