当前位置:网站首页>[RNN] analyze the RNN from rnn- (simple|lstm) to sequence generation, and then to seq2seq framework (encoder decoder, or seq2seq)
[RNN] analyze the RNN from rnn- (simple|lstm) to sequence generation, and then to seq2seq framework (encoder decoder, or seq2seq)
2022-07-25 09:57:00 【Tobi_ Obito】
Preface
Recently, I am working on a multi label classification project , It is found that the difficulty of multi label classification can be transformed into sequence generation problem ( Here's the picture , Quoted from thesis 《Ensemble Application of Convolutional and Recurrent Neural Networks for Multi-label Text Categorization》[1]), The ideas in the paper are very thorough , The picture is also clear , however RNN We still need to figure out the specific implementation of , So the whole thinking process starts from the simplest RNN To seq2seq Have combed it all over , I wish to record .

For clarity , Let's expand around the above figure .
RNN
First, simplify the above figure to basic RNN structure :

Be clear at a glance ,
~
It's the input sequence ,
~
Is the output sequence ( Sometimes we only use the output of the last sequence
, For example, I'll talk about it later Encoder - decoder Encoder in frame ),h0~h3 The horizontal arrow between is the state
The transfer .
Simple-RNN
Simple-RNN It's simple , Is to put t The output of time
As t The state of the moment
Pass on to the next moment t+1 Status input , In a word, the state passed between sequences is the output of the previous sequence .

LSTM
Simple-RNN Simple though , But one problem is that the gradient disappears , Occurs when the sequence length is long , The specific analysis is not discussed here , Introduce LSTM Are basically involved , and LSTM It is proposed to solve this problem , Adopt door mechanism ( What a complicated look ), The gate mechanism can be simply understood as : Put two adjacent sequences
and
The state of transmission between
In two parts ( It doesn't have to be continuous ), Part of the last sequence is reserved
Status of transmission ( namely
Partial value of ), The other part is the current timing
Calculated state ( It's the original RNN It's like that ). The gate is the control state vector s Which dimensions in are reserved , Which dimensions are updated , An example that simply reflects the idea but is not accurate is shown in the following figure . So what is retained in the state , What has been updated ? The door chooses , So it can be understood visually that some doors are closed, and the newly calculated ones cannot pass , Thus retaining the old . Then how to choose the door , The answer is by multiplying each dimension by one 0~1 Between decimals to achieve opening and closing , It can be seen that the door is a state s Vectors with the same dimension , The value of the gate vector is learned ( As network parameters ). Said along while , In a word ,LSTM It's solved Simple-RNN Gradient vanishing problem in , Better is over .

Sequence generation
It was all from RNN From the perspective of network structure , Now from the perspective of the task RNN, Have a look RNN The application mode of —— Sequence generation .
The problem of sequence generation is to give a sequence , Generate another sequence . For example, machine translation 、 The text generated . In order to lead to the core of this article , Consider such a text generation task : Give a word , Generate a sequence . The idea is simple : Give the word as RNN Of t=0 Time input , Predict the output , Then take the output as the input of the next moment , Make predictions in this cycle , Until the specified length is reached or the specified terminator is predicted . Here's the picture ( Is it more and more similar to the figure in the initial paper ?).

Clear thinking , But here's the problem , It's not easy to operate ! Think about training , Training data X It is often a predetermined direct input network , Now the situation is that the subsequent input should be obtained from the output , This requires settings closer to the bottom , Are some of the existing packaged RNN( such as Keras Inside LSTM/GRU And so on ) What cannot be achieved , Although it is said that , But in fact Keras or TensorFlow It can still be realized , Although in Keras You can't simply define a LSTM layer To achieve , But by definition RNN(rnn_cell, ...) layer Then customize rnn_cell Operate accordingly .( So-called cell Is the calculation unit of each time sequence , It's the h0/h1/h2/h3 Square box , Notice that they are essentially the same ! It seems that there are many diagrams that are only expanded according to time, which is easy to understand ). Of course, it can also be Tensorflow Medium tf.nn.raw_rnn To achieve , See me for details Another blog post .
Conditional generation framework
Now let's do something about the input of sequence generation , We no longer input simple words , Instead, we put the word vector and a meaningful vector c Splice up , So that the input contains more information , This vector c It is called conditional context vector . At this point, it may not be clear why such a vector should be spliced , Let's use the following figure to explain ( Take a step closer to the original picture of the paper , Oh dear , It's already here ).

In the picture
It can be regarded as a conditional context , It is input in every time sequence cell.Text feature vector It well explains the purpose of doing this : We can extract features from a long text , Use this feature to help us predict the subsequent network . How to help ? Now let's return to the problem of multi label classification mentioned at the beginning .
Sequence generation solve Multi label classification
First consider the problem of single label classification , It's very simple , According to the characteristics of X To predict the classification OK 了 , Can it be seen as the following figure ?

It may be a little clear to see here , What if you classify more times ? Here's the picture .

Why should we repeat the classification several times here ? By repeating , Created such an opportunity : For categories with similar characteristics , In the a The result of sub classification differ The first b The result of sub classification , Of course, this form alone cannot achieve the desired effect , Because the same input will basically get the same output ( It is basically because h0 h1 h2 h3 Different parameter weights , For approximate categories, the output may be different , But this difference is uncontrollable , So it can't meet expectations ). So how can we waver , Make in h0 Predict similar categories A, stay h1 Predict similar categories B Well ? Obviously, it's OK to add another input to shake , Look at the picture below ( Left ).


What is the difference between this picture and the picture in the paper ? There is no connection between shaking and predicted output . Think about what networking brings ? Compare the left and right pictures , Making connections means that the predicted tags will be related , This association can be between similar tags “ Near meaning ”. A practical example , For example, there are two similar labels, namely, campus safety and campus violence ( Generally, campus violence is often linked to campus security ), When using only h0 When making a single prediction ,y1 Both want to predict for campus safety , Also want to predict for campus violence , The two suppress each other in probability , As a result, no matter which confidence is predicted, it is not high ; When using the method shown in the right figure above , because y1 stay y2 In front of , In the training data, first y1 Predicted for campus safety , To y2 When predicting , because y1 The information that has been predicted as campus safety is conveyed to them , therefore y2 Will no longer be uncertain, but tend to predict campus violence .
To put it bluntly ,Text feature vector Set the basic tone ( Sift out those with low confidence ),RNN The information spread between time series learns the relationship between tags ( It is used to add left and right codes to labels that are left and right ).
Come here , The multi label problem seems to be solved , This is the end , But it seems that the picture is still different from the original picture of the paper , Look again. .

This comparison found some amazing things , All this should start from different points .
The difference between the graphs given in the paper is ,cell State transfer between (cell Horizontal arrow between ), And the above structure has reached the goal , Is it possible that adding another state transfer between sequences will improve the effect ? Is not clear , But my guess is possible :RNN Output at a certain time in
It is the state of the last moment
And the input of the current time
Determined by certain transformation , namely
, This indicates that the state contains the information before the transformation , This may be useful ( If O() This leads to the loss of valid information between tags ).
After guessing whether this repetition is effective , We should also consider whether we can remove one , Just use one ? Take a closer look at the picture with only one , Suddenly I found another wonderful thing , That's it Simple-RNN! Look at the picture below to compare :


The left and right diagrams only retain one kind of information transmission between time sequences , One is y One is s, So if y=s Well ? Isn't that what Simple-RNN Well ! Why think so , Because if you only keep one , That's direct use Simple-RNN It avoids the previous implementation problems —— The dynamic problem of taking the output of the previous moment as the input of the next moment . because Simple-RNN The state passed to the next in s It is the output of the previous moment , In the figure, it means that the broken line arrow in the left figure is equivalent to the horizontal arrow in the right figure .
seq2seq
The conditional generation framework was described earlier , Through the combination of specific examples of multi label classification RNN Deepen the understanding of sequence generation . Let's take it one step further seq2seq 了 ,seq2seq yes Sequence to sequence conditional generation For short , Also known as Encoder - decoder (encoder-decoder) frame . It's just Specifies the context condition vector in the condition generation framework c The source of the —— Get... Through the encoder . What is an encoder ? It can be said that the encoder is a RNN, It receives a sequence , Output a vector as the conditional context vector c, It encodes the sequence into a vector , Therefore, it is called encoder . As for the decoder , It is the main part of the condition generation framework . Here's the picture .

In fact, broadly speaking , The encoder is not necessarily RNN, For example, the text feature vector in the paper Text feature vector Is the CNN extracted , So the encoder is CNN. Again , The decoder is not necessarily CNN. let me put it another way , Encoder - decoder The framework should cover seq2seq Of .
Conclusion
This article only briefly introduces from RNN To seq2seq The main idea of , Among them, the problem of sequence generation is deeply analyzed , But there are still unmentioned problems , For example, in sequence generation teacher-forcing Training mode , It is also a training method to make the model learn the relationship between successive words , Different from the training mode mentioned in this article , Its training Y Namely X Move back one bit in time , Compared with the training mode in this article, it is more direct , But there are also non gold The problem of sequence , The specific details will not be discussed here , Interested parties can find relevant information .
reference
[1] G. Chen, D. Ye, Z. Xing, J. Chen and E. Cambria, "Ensemble application of convolutional and recurrent neural networks for multi-label text categorization," 2017 International Joint Conference on Neural Networks (IJCNN), 2017, pp. 2377-2383, doi: 10.1109/IJCNN.2017.7966144.
边栏推荐
- Data viewing and parameter modification of multi-channel vibrating wire, temperature and analog sensing signal acquisition instrument
- 数字IC设计SOC入门进阶
- First knowledge of opencv4.x ---- mean filtering
- ISP图像信号处理
- 从Anaconda到TensorFlow到Jupyter一路踩坑一路填平
- 深入理解pytorch分布式并行处理工具DDP——从工程实战中的bug说起
- ADC introduction
- CCF 201503-3 节日
- 目标检测与分割之MaskRCNN代码结构流程全面梳理+总结
- 无线振弦采集仪应用工程安全监测
猜你喜欢

无线中继采集仪的常见问题

【RNN】剖析RNN 之 从RNN-(Simple|LSTM) 到 序列生成 再到 seq2seq框架(encoder-decoder,或称为seq2seq)

多通道振弦、温度、模拟传感信号采集仪数据查看和参数修改

从鱼眼到环视到多任务王炸——盘点Valeo视觉深度估计经典文章(从FisheyeDistanceNet到OmniDet)(上)

~3 CCF 2022-03-2 travel plan

First knowledge of opencv4.x --- box filtering

Development history of convolutional neural network (part)

用ESP32+TM1638实验NTP网络校时闹钟的ARDUINO代码

Swift creates weather app

TensorFlow raw_rnn - 实现seq2seq模式中将上一时刻的输出作为下一时刻的输入
随机推荐
ARMv8通用定时器简介
鱼眼图像自监督深度估计原理分析和Omnidet核心代码解读
Binary Cross Entropy真的适合多标签分类吗?
Temperature, humidity and light intensity acquisition based on smart cloud platform
File -- first acquaintance
Chmod and chown invalidate the files of the mounted partition
目标检测与分割之MaskRCNN代码结构流程全面梳理+总结
dp-851
ISP图像信号处理
T5论文总结
用Arduino写个ESP32看门狗
深度估计自监督模型monodepth2论文总结和源码分析【理论部分】
VScode配置ROS开发环境:修改代码不生效问题原因及解决方法
pytorch使用tensorboard实现可视化总结
单目深度估计自监督模型Featdepth解读(上)——论文理解和核心源码分析
CDA Level1知识点总结之业务数据分析
VS无线振弦采集仪蓝牙功能的使用
ADC简介
testbench简介
ADC介绍