当前位置：网站首页>Deep learning NLP from RNN LSTM Gru seq2seq to attention classification and analysis

Deep learning NLP from RNN LSTM Gru seq2seq to attention classification and analysis

2022-06-24 04:55:00 【Goose】

1. background

The last blog said Transformers Inside self-attention, stay NLP In the field attentionseq2seq It is widely used when . This article mainly summarizes from RNN LSTM GRU seq2seq To attention Types and applications of , It is convenient for everyone to understand the overall development and attention Mechanism .

2. RNN

RNN The basic model is shown in the figure above , The inputs that each neuron receives include ： The hidden layer state of the previous neuron h ( For memory ) And the current input x ( Current information ). When neurons get input , The accountant worked out the new hidden state h And the output y, And then to the next neuron . Because the hidden state h The existence of , bring RNN It has a certain memory function .

For different tasks , Usually to RNN A little adjustment of the model structure , According to the number of inputs and outputs , There are three common structures ：

N vs N、1 vs N、N vs 1.

1.1 RNN structure 1: N vs N

Above, RNN One of the models N vs N structure , contain N Inputs x1, x2, ..., xN, and N Outputs y1, y2, ..., yN.N vs N In the structure , The length of the input and output sequences is equal , Usually suitable for the following tasks ：

Part of speech tagging
Training language models , Use the previous word to predict the next word, etc

1.2 RNN structure 2: 1 vs N

stay 1 vs N In structure , We have only one input x, and N Outputs y1, y2, ..., yN. It can be used in two ways 1 vs N:

The first type will only input x Incoming first RNN Neuron

The second is to input x Pass in all the RNN Neuron

1 vs N The structure is suitable for the following tasks ：

Images generate text , Input x It's a picture , The output is the description of a picture .
According to the music category , Generate the corresponding music .
According to the type of novel , Generate corresponding novels .

1.3 RNN structure 3: N vs 1

stay N vs 1 In structure , We have N Inputs x1, x2, ..., xN, And an output y.N vs 1 The structure is suitable for the following tasks ：

Sequence classification task , A speech 、 The category of a paragraph of text , Emotional analysis of sentences .

1.4 LSTM

We can go through LSTM Better ease RNN The problem of gradient disappearance .

and LSTM On the basis of this, the neuron of also input a cell state ct-1, cell state c and RNN Hidden state in h be similar , Have preserved historical information , from ct-2 ~ ct-1 ~ ct. stay LSTM in c And RNN Medium h The roles are very similar , All of them save historical status information , And in the LSTM Medium h It is more about saving the output information of the last time .

1.4.1 Oblivion gate

The red box in the above figure shows LSTM Forget the door section , Used to determine cell state ct-1 Which information in should be deleted . among σ Is the activation function sigmoid. Input ht-1 and xt after sigmoid After activating the function, you get ft,ft The range of each value in is [0, 1].ft The closer the value in 1, Express cell state ct-1 The value of the corresponding position in the should be remembered ;ft The closer the value in 0, Express cell state ct-1 The value of the corresponding position in should be forgotten . take ft And ct-1 Multiply by place (ElementWise Multiply ), That is, you can get the information after forgetting useless information c’t-1.

1.4.2 Input gate

The red box in the above figure shows LSTM Input gate section , Used to determine what new information should be added to cell state c‘t-1 in . among σ Is the activation function sigmoid. Input ht-1 and xt after tanh Activate the function to get new input information ( With wavy lines Ct), But not all this new information is useful , So you need to use ht-1 and xt after sigmoid Function to get it, it Indicates what new information is useful . The result of the multiplication of two vectors is added to c’t-1 in , Or get t The moment cell state ct.

1.4.3 Output gate

The red box in the above figure shows LSTM Output gate section , Used to determine what information should be output to ht in .cell state ct after tanh Function to get the information that can be output , then ht-1 and xt after sigmoid Function to get a vector ot,ot The range of each dimension is [0, 1], Indicates where the output should be removed , What should be kept . The result of multiplying two vectors is the final ht.

1.5 GRU

GRU yes LSTM A variant of , Structure ratio LSTM A little bit more simple .LSTM There are three doors ( Oblivion gate forget, Input gate input, Output gate output), and GRU There are only two doors ( Update door update, Reset door reset). in addition ,GRU No, LSTM Medium cell state c.

In the picture zt and rt Respectively represent the update door ( Red ) And reset the door ( Blue ). Reset door rt Information that controls the previous state ht-1 Pass in candidate status ( With wavy lines ht) The proportion of , Reset door rt The smaller the value of , Then with ht-1 The smaller the product of ,ht-1 The less information is added to the candidate states . Update the information used by the door to control the previous state ht-1 How much is left in the new state ht in , When (1-zt) The bigger it is , The more information you keep .

2. seq2seq

2.1 seq2seq structure

RNN There are certain restrictions on the number of inputs and outputs of , But in practice, the length of the sequence of many tasks is not fixed , For example, in machine translation , source language 、 The sentence length of the target language is different ; In the dialogue system , The length of questions and answers is different .

eq2Seq It's an important RNN Model , Also known as Encoder-Decoder Model , It can be understood as a kind of N×M Model of . The model consists of two parts ：Encoder Information used to encode sequences , Encode sequence information of any length into a vector c in . and Decoder It's a decoder , The decoder gets the context information vector c Then you can decode the information , And output it as a sequence .Seq2Seq There are many kinds of model structures , Here are some of the more common ：

2.1.1 seq2seq structure 1

2.1.2 seq2seq structure 2

2.1.3 seq2seq structure 3

2.2 Encoder Encoder

These three Seq2Seq The main difference between models is Decoder, their Encoder It's all the same . The picture below is Encoder part ,Encoder Of RNN Accept input x, The final output is a context vector that encodes all the information c, The neurons in the middle have no output .Decoder The main input is the context vector c, And then decode the information you need .

You can see from the formula that ,c You can directly use the hidden state of the last neuron hN Express ; You can also make some kind of transformation in the hidden state of the last neuron hN And get ,q Functions represent some kind of transformation ; You can also use the hidden state of all neurons h1, h2, ..., hN To calculate the . Get the context vector c after , It needs to be delivered to Decoder.

2.3 decoder Decoder

2.3.1 Decoder #1

The first one is Decoder Simple structure , Put the context vector c Think of it as RNN The initial hidden state of , Input to RNN in , Then we only accept the hidden layer state of the last neuron h' Without receiving any other input x. The first one is Decoder Hidden layer of structure and calculation formula of output ：

2.3.2 Decoder #2

The second kind Decoder The structure has its own initial hidden layer state h'0, No more context vectors c Think of it as RNN The initial hidden state of , It's like RNN Every neuron's input . You can see in the Decoder Every neuron in the brain has the same input c, such Decoder Hidden layer and output calculation formula of ：

2.3.3 Decoder #3

The third kind of Decoder The structure is similar to the second one , But in the input part, there is the output of the last neuron y'. That is, the input of each neuron includes ： The hidden layer vector of the last neuron h', The output of the last neuron y', The current input c (Encoder Coded context vector ). For the input of the first neuron y'0, It's usually a sentence that's actually a marker embedding vector . The third kind of Decoder Hidden layer and output calculation formula of ：

2.4 beam search

beam search Methods are not used in the process of training , It's for testing . In every neuron , We all choose the one with the highest current output probability top k One output goes to the next neuron . The next neuron uses this k Outputs , To calculate the L The probability of a word (L For vocabulary size ), And then in kL From the first result we get top k The largest output , Repeat this step .

2.5 teacher forcing

Teacher Forcing For the training phase , Mainly for the third one above Decoder In terms of models , The third kind of Decoder The input of the model neuron includes the output of the previous neuron y'. If the output of the last neuron is wrong , The output of the next neuron is also prone to error , The resulting error is passed on all the time .

and Teacher Forcing It can alleviate the above problems to a certain extent , In the training Seq2Seq Model time ,Decoder Every neuron in the world does not necessarily use the output of the previous neuron , Instead, there is a certain proportion of using the correct sequence as input .

3. Attention Attention mechanism

Attention My mind is like its name , Namely “ attention ”, When predicting results, focus on different characteristics . For example, translation "I have a cat", Translate to " I " when , Pay attention to the "I" On , Translate to " cat " Pay attention to the "cat" On .

3.1 attention The calculation of

Usually we will divide the input into query(Q), key(K), value(V) Three ：

First use Q and K Calculate weight a , Will use softmax Re unification of rights ： a=softmax(f(QK))
1. QK The concrete operation of f There are many ways , Common are additive attention Sum multiplicative property attention etc. ：
  1. Additive attention: f(Q,K)=tanh(W1Q+W2K)
  2. Multiplicity attention: f(Q,K)=QKT
  3. Scale dot product attention: f(Q,K)=QKT√d
  4. Bilinear dot product attention: f(Q,K)=QWKT
Then weight the results ： out=\sum a_i*v_i

What this mechanism does is address （addressing）, That is to read out the stored contents by imitating the way that the central processing unit interacts with the storage . As shown in the figure above ： Given a task - related query Query vector q, Through calculation and Key The attention distribution of and attached to Value On , To calculate Attention Value, This process is actually Attention The mechanism alleviates the complexity of neural network model ： There is no need to put all N All the input information is input into the neural network for calculation , Just from X Select some information related to the task and input it to the neural network .

3.2 attention stay seq2seq

among h^i It's the encoder Encoder Every step Output , z^j It's a decoder Decoder Every step Output , The calculation steps are as follows ：

First encode the input , obtain [h^1,h^2,h^3,h^4]
Start decoding , First use the fixed start token That is to say z^0 Most of all Q, Go with everyone h^i （ At the same time as K and V） To calculate attention, Get weighted c^0
use c^0 As decoding RNN Input （ At the same time, there is the last step z^0 ）, obtain z^1 And predict that the first word is machine
If we continue to predict , Just use z^1 As Q To seek attention：

It may be easier to understand by changing the diagram , You can have a look together ：

Used Attention after ,Decoder The input is not a fixed context vector c 了 , It will be based on the information of the current translation , Calculate the current c.

Attention Need to keep Encoder The hidden layer vector of each neuron h, then Decoder Of the t One neuron is based on the hidden layer vector of the previous neuron h't-1 Calculate the current state and Encoder The relevance of each neuron et.et It's a N Dimension vector (Encoder The number of neurons is N), if et Of the i The bigger the dimension is , The current node and Encoder The first i The greater the correlation between the two neurons .

et There are many ways to calculate , That is, the calculation function of correlation coefficient a There are many kinds of , As mentioned above .

We get the correlation vector above et after , Normalization is needed , Use softmax normalization . Then the normalized coefficients are used to fuse Encoder Multiple hidden layer vectors of get Decoder The context vector of the current neuron ct：

3.3 attention classification

Soft/Hard Attention
- soft attention： Tradition attention, It can be embedded into the model for training and propagation
- hard attention： Do not calculate all outputs , According to probability encoder Output sampling of , In back propagation, Monte Carlo is used to estimate the gradient
Global/Local Attention
- global attention： Tradition attention, For all encoder Output for calculation
- local attention： Be situated between soft and hard Between , Will predict a location and select a window to calculate
Self Attention
- Tradition attention It's calculation Q and K Dependencies between , and self attention Then calculate respectively Q and K Their own dependencies . See the previous blog for details .

3.4 self attention And attention Different from relation

attention and self attention The specific calculation process is the same , It's just that the object of calculation has changed .

attention yes source Yes target Of attention
- For example, for English - In machine translation ,Source It's an English sentence ,Target It's the corresponding translated Chinese sentence ,Attention The mechanism takes place in Target The elements of Query and Source Between all the elements in . In short, it's Attention The calculation of the weight in the mechanism needs Target To participate in , That is to say Encoder-Decoder model in Attention The calculation of weights requires not only Encoder And also need Decoder The implicit state in .
self attention yes source Yes source Of attention.
- For example, in Transformer When calculating the weight parameters, the text vector is converted to the corresponding KQV, Only need Source Perform the corresponding matrix operation at , In less than Target Information in .

3.4.1 self-attention

self attention Will give you a matrix , Tell you entity1 and entity2、entity3 …. The degree of relevance 、entity2 and entity1、entity3… The degree of relevance .

It doesn't mean Target and Source Between Attention Mechanism , It is Source Between internal elements or Target Between internal elements Attention Mechanism , It can also be understood as Target=Source The attention computer system in this special case .Q=K=V.