当前位置：网站首页>Deep learning | rnn/lstm of naturallanguageprocessing

Deep learning | rnn/lstm of naturallanguageprocessing

2022-07-01 03:49:00 【RichardsZ_】

Cyclic neural network RNN

Tips ： This article assumes that readers have basic in-depth learning knowledge , Such as weighted activation , Chain derivative , Weight matrix and other information .

List of articles

Cyclic neural network RNN
Preface
One 、 Cyclic neural network structure
Two deformations
Two 、LSTM- Long and short memory network
- - Oblivion gate
  - Input gate
- Update door

Preface

RNN Very suitable " Have Sequence properties Characteristics of ", Therefore, it can mine temporal information and semantic information in features . Take advantage of RNN This ability , Make the deep learning model solve speech recognition 、 Language model 、 Machinetranslation and time series analysis NLP Some breakthroughs have been made in the field .

Sequence properties , That is, in chronological order , Logical order , Or other sequences are called sequence properties , Take a few examples ：

Take a human sentence , That is, human natural language , Is it a combination of words that conform to a certain logic or rule , This is consistent with the sequence characteristics .
voice , The sound we make , Every frame, every frame , That's what we heard , This also has sequence characteristics 、
Stocks , as time goes on , A series of numbers with sequence will be generated , These numbers also have sequence characteristics .

One 、 Cyclic neural network structure

Insert picture description here
among ,
x： Feature input vector , $x_{t-1}, x_t, x_{t+1}$ Represent the t-1, t, t+1 Characteristic input vector at time .
U： Weight matrix from input layer to hidden layer , For fully connected neural networks , The state of the hidden layer = $U * x$
W: The value of the hidden layer at the previous time , The weight matrix as one of this input
$s_t = f(U* x_t+W*s_{t-1})$

Now it looks clearer , This network is in t Always receive input $x_t$ after , The value of the hidden layer is $s_t$ , The output value is $o_t$ .

The key point is , $s_t$ It's not just about $x_t$ , It also depends on $s_{t-1}$ .

Two deformations

Insert picture description here

Elman Network

The output of the hidden layer is used as the input of the hidden layer at the next time , That is, the most rustic RNN

Jordan Network

difference ： Output layer output （ namely o Output ） As the input of the next hidden layer , This contains the information of the weight matrix from the hidden layer to the output layer

As for two kinds RNN Which is better or worse , There is little difference , No conclusion , It depends on whether the business itself needs information from the hidden layer to the output layer , As a try . But actually , These two kinds of RNN It is no longer used by industry , While using LSTM or Attention Mechanism , This is the later story. ！

shortcoming

for instance , The machine translation scenario shown in the figure below . When the last moment , namely RNN The input is French This word , The output of the hidden layer on the network at one time is "fluent" even to the extent that "speak" The two words are related , But with "France" Basically irrelevant , Obviously, this does not meet our expectations .

On the other hand , The network structure is shown in the figure below ,t+1 The hidden layer of time and $t_0, t_1$ The state of is basically irrelevant , therefore RNN For sequence data , The mutual information with long sequence spacing is lost , This is also RNN The fatal flaw of
Insert picture description here

One sentence summary

RNN: One that can cope with Sequence properties Changing neural network structure , The input of hidden neurons comes from the input of this moment , It also includes the output of the hidden layer at the previous time .

shortcoming ： The information with long middle distance of sequence data will cause loss , That is, the network only saves short-term memory , Lost long-term associative memory

Two 、LSTM- Long and short memory network

In order to solve RNN It can't be solved by itself “ Long term related information ”,1997 German scientists introduced LSTM The Internet , It's a special RNN The Internet , Used for processing “ Long term related information ”,

The core idea ： door

Oblivion gate

Output ：0-1 Probability
Insert picture description here

Input gate

Insert picture description here
So given a $C_t$ after , And multiply by a factor $i_t$ , For input data $C_t$ When flowing to the next moment , How much information is retained . What is really controlled is a few parameters , In the original training , Adjust through data W and b Make the result meet the maximum likelihood value in the training data .

Update door

Insert picture description here
therefore LSTM The core is to control the weight W And offset b, To control the forgetting door , Update the status of the door , It is decided that in the information at this time “ Proportion of forgotten information ” and “ Proportion of new input information ”, When there is a long-term relationship in the sequence data , Maybe in the middle of time cells in $i_t$ The share is relatively small , $f_t$ It's a big part , Long term relationship has been realized

（ Optional ）LSTM/GRU Optimize

In order to reduce the coefficient of calculation , Directly change the coefficient of the update door to $1-f_t$ , And forgetting gate are mutually exclusive , The network structure is more concise
Insert picture description here

Given a corpus information , Through the maximum likelihood value of the training set , To train W Equal weight , after softmax, Predict the words most likely to appear in the next moment

原网站

版权声明
本文为[RichardsZ_]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160330170468.html