当前位置:网站首页>Construction of encoder decoder model with keras LSTM
Construction of encoder decoder model with keras LSTM
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |Nechu BM compile |VK source |Towards Data Science
Basic knowledge of : It's better to have some knowledge about cyclic neural networks before you know this article (RNN) And knowledge of codecs .
This article is about how to use Python and Keras A practical tutorial for developing a codec model , More precisely, it's a sequence to sequence (Seq2Seq). In the last tutorial , We developed a many to many translation model , As shown in the figure below :
This structure has an important limitation , The length of the sequence . As we can see in the image , The length of input sequence and output sequence must be the same . What if we need different lengths ?
for example , We want to implement a model that accepts different sequence lengths , It takes a sequence of words and outputs a number , Or image subtitle model , Where the input is an image , The output is a sequence of words .
If we want to develop a model with different input and output lengths , We need to develop a codec model . Through this tutorial , We'll learn how to develop models , And apply it to translation practice . The representation of the model is as follows .
We divide the model into two parts , First , We have an encoder , Input the Spanish sentence and generate a hidden vector . The encoder uses an embedded layer to convert words into a vector and then uses a cyclic neural network (RNN) To calculate the hidden state , Here we're going to use long-term and long-term memory (LSTM) layer .
Then the output of the encoder will be used as the input to the decoder . For the decoder , We will use it again LSTM layer , And the full connectivity layer for predicting English words .
Realization
The sample data comes from manythings.org. It is made up of sentence pairs of language . In our case , We're going to use Spanish - English to .
To build a model, we need to preprocess the data first , Get the maximum length of Spanish and English sentences .
1- Preprocessing
precondition : understand Keras Class in “tokenizer” and “pad_sequences”. If you want to go back in detail , We discussed this topic in the last tutorial .
First , We're going to import it into the library , And then read the downloaded data .
import string
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import LSTM, Input, TimeDistributed, Dense, Activation, RepeatVector, Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
# The path of the translation file
path_to_data = 'data/spa.txt'
# Reading documents
translation_file = open(path_to_data,"r", encoding='utf-8')
raw_data = translation_file.read()
translation_file.close()
# Parsing data
raw_data = raw_data.split('\n')
pairs = [sentence.split('\t') for sentence in raw_data]
pairs = pairs[1000:20000]
Once we read the data , We'll keep the first example , In order to train faster . If we want to develop higher performance models , We need to use a complete dataset . Then we have to clean up the data by removing the capital letters and punctuation marks .
def clean_sentence(sentence):
# Put this sentence in lowercase
lower_case_sent = sentence.lower()
# Remove punctuation
string_punctuation = string.punctuation + "¡" + '¿'
clean_sentence = lower_case_sent.translate(str.maketrans('', '', string_punctuation))
return clean_sentence
Next , We identify sentences and analyze data .
def tokenize(sentences):
# establish tokenizer
text_tokenizer = Tokenizer()
# Apply to text
text_tokenizer.fit_on_texts(sentences)
return text_tokenizer.texts_to_sequences(sentences), text_tokenizer
After creating the function , We can preprocess :
# Clean up sentences
english_sentences = [clean_sentence(pair[0]) for pair in pairs]
spanish_sentences = [clean_sentence(pair[1]) for pair in pairs]
# Identifying words
spa_text_tokenized, spa_text_tokenizer = tokenize(spanish_sentences)
eng_text_tokenized, eng_text_tokenizer = tokenize(english_sentences)
print('Maximum length spanish sentence: {}'.format(len(max(spa_text_tokenized,key=len))))
print('Maximum length english sentence: {}'.format(len(max(eng_text_tokenized,key=len))))
# Check the length
spanish_vocab = len(spa_text_tokenizer.word_index) + 1
english_vocab = len(eng_text_tokenizer.word_index) + 1
print("Spanish vocabulary is of {} unique words".format(spanish_vocab))
print("English vocabulary is of {} unique words".format(english_vocab))
The code above prints the following results
According to the previous code , The maximum length of a Spanish sentence is 12 Word , The maximum length of an English sentence is 6 Word . Here we can see the advantages of using the codec model . In the past, we had limitations in dealing with equal length sentences , So we need to fill in English sentences with 12, Now it's only half of it . therefore , what's more , It also reduces LSTM Time steps , Reduced computing requirements and complexity .
We use padding to make the maximum length of sentences in each language equal .
max_spanish_len = int(len(max(spa_text_tokenized,key=len)))
max_english_len = int(len(max(eng_text_tokenized,key=len)))
spa_pad_sentence = pad_sequences(spa_text_tokenized, max_spanish_len, padding = "post")
eng_pad_sentence = pad_sequences(eng_text_tokenized, max_english_len, padding = "post")
# restore
spa_pad_sentence = spa_pad_sentence.reshape(*spa_pad_sentence.shape, 1)
eng_pad_sentence = eng_pad_sentence.reshape(*eng_pad_sentence.shape, 1)
Now we have the data , Let's build a model .
2. Model development
In the next section , We're going to create models , And in python Each layer added is explained in the code .
2.1- Encoder
The first layer we define is the image embedding layer . So , We must first add an input layer , The only parameter to consider here is “shape”, This is the maximum length of a Spanish sentence , In our case, it's 12.
Then we connect it to the embedded layer , The parameters to be considered here are “input_dim”( The length of the Spanish vocabulary ) and “output_dim”( The shape of the embedded vector ). This layer converts Spanish words into vectors that output dimension shapes .
The concept behind this is to extract the meaning of words in the form of spatial representation , Each of these dimensions is a feature that defines a word . for example ,“sol” Convert to shape to 128 Vector . The higher the output dimension , The more semantic meaning is extracted from each word , But the more computing and processing time it takes . We also need to find a balance between speed and performance .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
Next , We're going to add a size of 64 Of LSTM layer . Even if LSTM Each time step of the output of a hidden vector , We'll focus on the last one , So the parameter return_sequences yes 'False'. We will see LSTM How the layer in the decoder return_sequences=True Working in the case of .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
When the return sequence is 'False' when , The output is the last hidden state .
2.2- decoder
The output of the encoder layer will be the hidden state of the last time step . And then we need to put this vector into the decoder . Let's take a closer look at the decoder part , And understand how it works .
As we can see in the image , The hidden vector is repeated n Time , therefore LSTM Each time step of the receives the same vector . To make every time step have the same vector , We need to use layers RepeatVector, Because its name means its role is to repeat the vector it receives , The only parameter we need to define is n, Repeat the number . This number is equal to the number of time steps in the decoder part , In other words, the maximum length of an English sentence 6.
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
Once we're ready to enter , We will continue to decode . It also works LSTM Layer built , The difference is in the parameters return_sequences, In this case 'True'. What is this parameter used for ? In the encoder section , We only expect one vector in the last time step , And ignore all the other vectors , Here we expect each time step to have an output vector , In this way, the full connection layer can predict .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
decoder = LSTM(64, return_sequences=True, dropout=0.2)(r_vec)
We have the last step , Predict the translated words . So , We need to use the full connectivity layer . The parameter we need to define is the number of elements , This unit number is the shape of the output vector , It needs to be the same length as the English word . Why? ? The values of this vector are all close to zero , Except that one of the units is close to 1. Then we need to output 1 The index of the unit is mapped to the dictionary , In the dictionary, we map each unit to a word .
for example , If the input is a word ‘sun’, And the output is a vector , All of them are zero , And then the unit 472 yes 1, So we map the index to a dictionary that contains English words , And get the value ‘sun’.
We just saw how to apply the full connectivity layer to predict a word , But how do we predict the whole sentence ? Because we use return_sequence=True, therefore LSTM The layer outputs a vector at each time step , So we need to apply the full connectivity layer explained earlier in each time step , Let them predict one word at a time .
So ,Keras We developed a program called TimeDistributed Specific layers of , It applies the same full connectivity layer to each time step .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
decoder = LSTM(64, return_sequences=True, dropout=0.2)(r_vec)
logits = TimeDistributed(Dense(english_vocab))(decoder)
Last , We create the model and add a loss function .
enc_dec_model = Model(input_sequence, Activation('softmax')(logits))
enc_dec_model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(1e-3),
metrics=['accuracy'])
enc_dec_model.summary()
Once we define the model , We can train it .
model_results = enc_dec_model.fit(spa_pad_sentence, eng_pad_sentence, batch_size=30, epochs=100)
When the model is trained , We can translate for the first time . You can also find functions “logits_to_sentence”, It maps the output of the full connectivity layer to English vocabulary .
def logits_to_sentence(logits, tokenizer):
index_to_words = {idx: word for word, idx in tokenizer.word_index.items()}
index_to_words[0] = '<empty>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
index = 14
print("The english sentence is: {}".format(english_sentences[index]))
print("The spanish sentence is: {}".format(spanish_sentences[index]))
print('The predicted sentence is :')
print(logits_to_sentence(enc_dec_model.predict(spa_pad_sentence[index:index+1])[0], eng_text_tokenizer))
Conclusion
The codec architecture allows different input and output sequence lengths . First , We use an embedded layer to create a spatial representation of words , And input it into LSTM layer , Because we only focus on the output of the last time step , We use return_sequences=False.
This output vector needs to be repeated the same number of times as the decoder part , We use RepeatVector layer . The decoder will use LSTM, Parameters return_sequences=True, So the output of each time step is passed to the full connectivity layer .
Although this model is a good improvement from the previous tutorial , We can still improve accuracy . We can add a layer of encoder and decoder to one layer . We can also use the pre trained embedding layer , such as word2vec or Glove. Last , We can use the attention mechanism , This is a major improvement in the field of natural language processing . We'll introduce this concept in the next tutorial .
appendix : Codec without repetition vector
In this tutorial , We learned how to use RepeatVector Layer building encoder - decoder . There's a second option , We use the output of the model as input to the next time step , Instead of repeating hidden vectors , As shown in the figure .
The code to implement this model can be found in Keras Found in document , It needs to be right Keras The library has a deeper understanding , And development is much more complicated :https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
Link to the original text :https://towardsdatascience.com/how-to-build-an-encoder-decoder-translation-model-using-lstm-with-python-and-keras-a31e9d864b9b
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- Filecoin的经济模型与未来价值是如何支撑FIL币价格破千的
- 2019年的一个小目标,成为csdn的博客专家,纪念一下
- Using consult to realize service discovery: instance ID customization
- 采购供应商系统是什么?采购供应商管理平台解决方案
- 5.5 controlleradvice notes - SSM in depth analysis and project practice
- 6.1.2 handlermapping mapping processor (2) (in-depth analysis of SSM and project practice)
- Arrangement of basic knowledge points
- Did you blog today?
- ES6 essence:
- Summary of common string algorithms
猜你喜欢
有了这个神器,快速告别垃圾短信邮件
keras model.compile Loss function and optimizer
Examples of unconventional aggregation
比特币一度突破14000美元,即将面临美国大选考验
一篇文章带你了解CSS3图片边框
前端都应懂的入门基础-github基础
小程序入门到精通(二):了解小程序开发4个重要文件
关于Kubernetes 与 OAM 构建统一、标准化的应用管理平台知识!(附网盘链接)
I've been rejected by the product manager. Why don't you know
What to do if you are squeezed by old programmers? I don't want to quit
随机推荐
NLP model Bert: from introduction to mastery (1)
I think it is necessary to write a general idempotent component
After reading this article, I understand a lot of webpack scaffolding
How to encapsulate distributed locks more elegantly
Multi classification of unbalanced text using AWS sagemaker blazingtext
Natural language processing - wrong word recognition (based on Python) kenlm, pycorrector
NLP model Bert: from introduction to mastery (2)
速看!互联网、电商离线大数据分析最佳实践!(附网盘链接)
CCR炒币机器人:“比特币”数字货币的大佬,你不得不了解的知识
Arrangement of basic knowledge points
一篇文章带你了解CSS 分页实例
With the advent of tensorflow 2.0, can pytoch still shake the status of big brother?
小程序入门到精通(二):了解小程序开发4个重要文件
6.2 handleradapter adapter processor (in-depth analysis of SSM and project practice)
H5 makes its own video player (JS Part 2)
Keyboard entry lottery random draw
前端都应懂的入门基础-github基础
(2)ASP.NET Core3.1 Ocelot路由
Using Es5 to realize the class of ES6
Summary of common algorithms of binary tree