当前位置：网站首页>[RNN] analyze the RNN from rnn- (simple|lstm) to sequence generation, and then to seq2seq framework (encoder decoder, or seq2seq)

[RNN] analyze the RNN from rnn- (simple|lstm) to sequence generation, and then to seq2seq framework (encoder decoder, or seq2seq)

2022-07-25 09:57:00 【Tobi_ Obito】

Preface

Recently, I am working on a multi label classification project , It is found that the difficulty of multi label classification can be transformed into sequence generation problem （ Here's the picture , Quoted from thesis 《Ensemble Application of Convolutional and Recurrent Neural Networks for Multi-label Text Categorization》[1]）, The ideas in the paper are very thorough , The picture is also clear , however RNN We still need to figure out the specific implementation of , So the whole thinking process starts from the simplest RNN To seq2seq Have combed it all over , I wish to record .

For clarity , Let's expand around the above figure .

RNN

First, simplify the above figure to basic RNN structure ：

Be clear at a glance , $\mathbf{y_0}$ ~ $\mathbf{y_3}$ It's the input sequence , $\mathbf{y_4}$ ~ $\mathbf{y_7}$ Is the output sequence （ Sometimes we only use the output of the last sequence $\mathbf{y_7}$ , For example, I'll talk about it later Encoder - decoder Encoder in frame ）,h0~h3 The horizontal arrow between is the state $\mathbf{s}$ The transfer .

Simple-RNN

Simple-RNN It's simple , Is to put t The output of time $\mathbf{output_t}$ As t The state of the moment $\mathbf{s_t}$ Pass on to the next moment t+1 Status input , In a word, the state passed between sequences is the output of the previous sequence .

LSTM

Simple-RNN Simple though , But one problem is that the gradient disappears , Occurs when the sequence length is long , The specific analysis is not discussed here , Introduce LSTM Are basically involved , and LSTM It is proposed to solve this problem , Adopt door mechanism （ What a complicated look ）, The gate mechanism can be simply understood as ： Put two adjacent sequences t_i and $t_{i+1}$ The state of transmission between $\mathbf{s_{i+1}}$ In two parts （ It doesn't have to be continuous ）, Part of the last sequence is reserved $t_{i-1}$ Status of transmission （ namely $\mathbf{s_i}$ Partial value of ）, The other part is the current timing t_i Calculated state （ It's the original RNN It's like that ）. The gate is the control state vector s Which dimensions in are reserved , Which dimensions are updated , An example that simply reflects the idea but is not accurate is shown in the following figure . So what is retained in the state , What has been updated ？ The door chooses , So it can be understood visually that some doors are closed, and the newly calculated ones cannot pass , Thus retaining the old . Then how to choose the door , The answer is by multiplying each dimension by one 0~1 Between decimals to achieve opening and closing , It can be seen that the door is a state s Vectors with the same dimension , The value of the gate vector is learned （ As network parameters ）. Said along while , In a word ,LSTM It's solved Simple-RNN Gradient vanishing problem in , Better is over .

Sequence generation

It was all from RNN From the perspective of network structure , Now from the perspective of the task RNN, Have a look RNN The application mode of —— Sequence generation .

The problem of sequence generation is to give a sequence , Generate another sequence . For example, machine translation 、 The text generated . In order to lead to the core of this article , Consider such a text generation task ： Give a word , Generate a sequence . The idea is simple ： Give the word as RNN Of t=0 Time input , Predict the output , Then take the output as the input of the next moment , Make predictions in this cycle , Until the specified length is reached or the specified terminator is predicted . Here's the picture （ Is it more and more similar to the figure in the initial paper ？）.

Clear thinking , But here's the problem , It's not easy to operate ！ Think about training , Training data X It is often a predetermined direct input network , Now the situation is that the subsequent input should be obtained from the output , This requires settings closer to the bottom , Are some of the existing packaged RNN（ such as Keras Inside LSTM/GRU And so on ） What cannot be achieved , Although it is said that , But in fact Keras or TensorFlow It can still be realized , Although in Keras You can't simply define a LSTM layer To achieve , But by definition RNN(rnn_cell, ...) layer Then customize rnn_cell Operate accordingly .（ So-called cell Is the calculation unit of each time sequence , It's the h0/h1/h2/h3 Square box , Notice that they are essentially the same ！ It seems that there are many diagrams that are only expanded according to time, which is easy to understand ）. Of course, it can also be Tensorflow Medium tf.nn.raw_rnn To achieve , See me for details Another blog post .

Conditional generation framework

Now let's do something about the input of sequence generation , We no longer input simple words , Instead, we put the word vector and a meaningful vector c Splice up , So that the input contains more information , This vector c It is called conditional context vector . At this point, it may not be clear why such a vector should be spliced , Let's use the following figure to explain （ Take a step closer to the original picture of the paper , Oh dear , It's already here ）.

In the picture $\mathbf{x_f}$ It can be regarded as a conditional context , It is input in every time sequence cell.Text feature vector It well explains the purpose of doing this ： We can extract features from a long text , Use this feature to help us predict the subsequent network . How to help ？ Now let's return to the problem of multi label classification mentioned at the beginning .

Sequence generation solve Multi label classification

First consider the problem of single label classification , It's very simple , According to the characteristics of X To predict the classification OK 了 , Can it be seen as the following figure ？

It may be a little clear to see here , What if you classify more times ？ Here's the picture .

Why should we repeat the classification several times here ？ By repeating , Created such an opportunity ： For categories with similar characteristics , In the a The result of sub classification differ The first b The result of sub classification , Of course, this form alone cannot achieve the desired effect , Because the same input will basically get the same output （ It is basically because h0 h1 h2 h3 Different parameter weights , For approximate categories, the output may be different , But this difference is uncontrollable , So it can't meet expectations ）. So how can we waver , Make in h0 Predict similar categories A, stay h1 Predict similar categories B Well ？ Obviously, it's OK to add another input to shake , Look at the picture below （ Left ）.

What is the difference between this picture and the picture in the paper ？ There is no connection between shaking and predicted output . Think about what networking brings ？ Compare the left and right pictures , Making connections means that the predicted tags will be related , This association can be between similar tags “ Near meaning ”. A practical example , For example, there are two similar labels, namely, campus safety and campus violence （ Generally, campus violence is often linked to campus security ）, When using only h0 When making a single prediction ,y1 Both want to predict for campus safety , Also want to predict for campus violence , The two suppress each other in probability , As a result, no matter which confidence is predicted, it is not high ; When using the method shown in the right figure above , because y1 stay y2 In front of , In the training data, first y1 Predicted for campus safety , To y2 When predicting , because y1 The information that has been predicted as campus safety is conveyed to them , therefore y2 Will no longer be uncertain, but tend to predict campus violence .

To put it bluntly ,Text feature vector Set the basic tone （ Sift out those with low confidence ）,RNN The information spread between time series learns the relationship between tags （ It is used to add left and right codes to labels that are left and right ）.

Come here , The multi label problem seems to be solved , This is the end , But it seems that the picture is still different from the original picture of the paper , Look again. .

This comparison found some amazing things , All this should start from different points .

The difference between the graphs given in the paper is ,cell State transfer between （cell Horizontal arrow between ）, And the above structure has reached the goal , Is it possible that adding another state transfer between sequences will improve the effect ？ Is not clear , But my guess is possible ：RNN Output at a certain time in $\mathbf{y_t}$ It is the state of the last moment $\mathbf{s_{i-1}}$ And the input of the current time $\mathbf{x_t}$ Determined by certain transformation , namely $\mathbf{y_t=O(R(x_t,s_{t-1}))}$ , This indicates that the state contains the information before the transformation , This may be useful （ If O() This leads to the loss of valid information between tags ）.

After guessing whether this repetition is effective , We should also consider whether we can remove one , Just use one ？ Take a closer look at the picture with only one , Suddenly I found another wonderful thing , That's it Simple-RNN！ Look at the picture below to compare ：

The left and right diagrams only retain one kind of information transmission between time sequences , One is y One is s, So if y=s Well ？ Isn't that what Simple-RNN Well ！ Why think so , Because if you only keep one , That's direct use Simple-RNN It avoids the previous implementation problems —— The dynamic problem of taking the output of the previous moment as the input of the next moment . because Simple-RNN The state passed to the next in s It is the output of the previous moment , In the figure, it means that the broken line arrow in the left figure is equivalent to the horizontal arrow in the right figure .

seq2seq

The conditional generation framework was described earlier , Through the combination of specific examples of multi label classification RNN Deepen the understanding of sequence generation . Let's take it one step further seq2seq 了 ,seq2seq yes Sequence to sequence conditional generation For short , Also known as Encoder - decoder （encoder-decoder） frame . It's just Specifies the context condition vector in the condition generation framework c The source of the —— Get... Through the encoder . What is an encoder ？ It can be said that the encoder is a RNN, It receives a sequence , Output a vector as the conditional context vector c, It encodes the sequence into a vector , Therefore, it is called encoder . As for the decoder , It is the main part of the condition generation framework . Here's the picture .

In fact, broadly speaking , The encoder is not necessarily RNN, For example, the text feature vector in the paper Text feature vector Is the CNN extracted , So the encoder is CNN. Again , The decoder is not necessarily CNN. let me put it another way , Encoder - decoder The framework should cover seq2seq Of .

Conclusion

This article only briefly introduces from RNN To seq2seq The main idea of , Among them, the problem of sequence generation is deeply analyzed , But there are still unmentioned problems , For example, in sequence generation teacher-forcing Training mode , It is also a training method to make the model learn the relationship between successive words , Different from the training mode mentioned in this article , Its training Y Namely X Move back one bit in time , Compared with the training mode in this article, it is more direct , But there are also non gold The problem of sequence , The specific details will not be discussed here , Interested parties can find relevant information .

reference

[1] G. Chen, D. Ye, Z. Xing, J. Chen and E. Cambria, "Ensemble application of convolutional and recurrent neural networks for multi-label text categorization," 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2377-2383, doi: 10.1109/IJCNN.2017.7966144.

原网站

版权声明
本文为[Tobi_ Obito]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207250921541322.html