当前位置:网站首页>Construction of encoder decoder model with keras LSTM
Construction of encoder decoder model with keras LSTM
2020-11-06 01:28:00 【Artificial intelligence meets pioneer】
author |Nechu BM compile |VK source |Towards Data Science
Basic knowledge of : It's better to have some knowledge about cyclic neural networks before you know this article (RNN) And knowledge of codecs .
This article is about how to use Python and Keras A practical tutorial for developing a codec model , More precisely, it's a sequence to sequence (Seq2Seq). In the last tutorial , We developed a many to many translation model , As shown in the figure below :
This structure has an important limitation , The length of the sequence . As we can see in the image , The length of input sequence and output sequence must be the same . What if we need different lengths ?
for example , We want to implement a model that accepts different sequence lengths , It takes a sequence of words and outputs a number , Or image subtitle model , Where the input is an image , The output is a sequence of words .
If we want to develop a model with different input and output lengths , We need to develop a codec model . Through this tutorial , We'll learn how to develop models , And apply it to translation practice . The representation of the model is as follows .
We divide the model into two parts , First , We have an encoder , Input the Spanish sentence and generate a hidden vector . The encoder uses an embedded layer to convert words into a vector and then uses a cyclic neural network (RNN) To calculate the hidden state , Here we're going to use long-term and long-term memory (LSTM) layer .
Then the output of the encoder will be used as the input to the decoder . For the decoder , We will use it again LSTM layer , And the full connectivity layer for predicting English words .
Realization
The sample data comes from manythings.org. It is made up of sentence pairs of language . In our case , We're going to use Spanish - English to .
To build a model, we need to preprocess the data first , Get the maximum length of Spanish and English sentences .
1- Preprocessing
precondition : understand Keras Class in “tokenizer” and “pad_sequences”. If you want to go back in detail , We discussed this topic in the last tutorial .
First , We're going to import it into the library , And then read the downloaded data .
import string
import numpy as np
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Model
from keras.layers import LSTM, Input, TimeDistributed, Dense, Activation, RepeatVector, Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy
# The path of the translation file
path_to_data = 'data/spa.txt'
# Reading documents
translation_file = open(path_to_data,"r", encoding='utf-8')
raw_data = translation_file.read()
translation_file.close()
# Parsing data
raw_data = raw_data.split('\n')
pairs = [sentence.split('\t') for sentence in raw_data]
pairs = pairs[1000:20000]
Once we read the data , We'll keep the first example , In order to train faster . If we want to develop higher performance models , We need to use a complete dataset . Then we have to clean up the data by removing the capital letters and punctuation marks .
def clean_sentence(sentence):
# Put this sentence in lowercase
lower_case_sent = sentence.lower()
# Remove punctuation
string_punctuation = string.punctuation + "¡" + '¿'
clean_sentence = lower_case_sent.translate(str.maketrans('', '', string_punctuation))
return clean_sentence
Next , We identify sentences and analyze data .
def tokenize(sentences):
# establish tokenizer
text_tokenizer = Tokenizer()
# Apply to text
text_tokenizer.fit_on_texts(sentences)
return text_tokenizer.texts_to_sequences(sentences), text_tokenizer
After creating the function , We can preprocess :
# Clean up sentences
english_sentences = [clean_sentence(pair[0]) for pair in pairs]
spanish_sentences = [clean_sentence(pair[1]) for pair in pairs]
# Identifying words
spa_text_tokenized, spa_text_tokenizer = tokenize(spanish_sentences)
eng_text_tokenized, eng_text_tokenizer = tokenize(english_sentences)
print('Maximum length spanish sentence: {}'.format(len(max(spa_text_tokenized,key=len))))
print('Maximum length english sentence: {}'.format(len(max(eng_text_tokenized,key=len))))
# Check the length
spanish_vocab = len(spa_text_tokenizer.word_index) + 1
english_vocab = len(eng_text_tokenizer.word_index) + 1
print("Spanish vocabulary is of {} unique words".format(spanish_vocab))
print("English vocabulary is of {} unique words".format(english_vocab))
The code above prints the following results
According to the previous code , The maximum length of a Spanish sentence is 12 Word , The maximum length of an English sentence is 6 Word . Here we can see the advantages of using the codec model . In the past, we had limitations in dealing with equal length sentences , So we need to fill in English sentences with 12, Now it's only half of it . therefore , what's more , It also reduces LSTM Time steps , Reduced computing requirements and complexity .
We use padding to make the maximum length of sentences in each language equal .
max_spanish_len = int(len(max(spa_text_tokenized,key=len)))
max_english_len = int(len(max(eng_text_tokenized,key=len)))
spa_pad_sentence = pad_sequences(spa_text_tokenized, max_spanish_len, padding = "post")
eng_pad_sentence = pad_sequences(eng_text_tokenized, max_english_len, padding = "post")
# restore
spa_pad_sentence = spa_pad_sentence.reshape(*spa_pad_sentence.shape, 1)
eng_pad_sentence = eng_pad_sentence.reshape(*eng_pad_sentence.shape, 1)
Now we have the data , Let's build a model .
2. Model development
In the next section , We're going to create models , And in python Each layer added is explained in the code .
2.1- Encoder
The first layer we define is the image embedding layer . So , We must first add an input layer , The only parameter to consider here is “shape”, This is the maximum length of a Spanish sentence , In our case, it's 12.
Then we connect it to the embedded layer , The parameters to be considered here are “input_dim”( The length of the Spanish vocabulary ) and “output_dim”( The shape of the embedded vector ). This layer converts Spanish words into vectors that output dimension shapes .
The concept behind this is to extract the meaning of words in the form of spatial representation , Each of these dimensions is a feature that defines a word . for example ,“sol” Convert to shape to 128 Vector . The higher the output dimension , The more semantic meaning is extracted from each word , But the more computing and processing time it takes . We also need to find a balance between speed and performance .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
Next , We're going to add a size of 64 Of LSTM layer . Even if LSTM Each time step of the output of a hidden vector , We'll focus on the last one , So the parameter return_sequences yes 'False'. We will see LSTM How the layer in the decoder return_sequences=True Working in the case of .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
When the return sequence is 'False' when , The output is the last hidden state .
2.2- decoder
The output of the encoder layer will be the hidden state of the last time step . And then we need to put this vector into the decoder . Let's take a closer look at the decoder part , And understand how it works .
As we can see in the image , The hidden vector is repeated n Time , therefore LSTM Each time step of the receives the same vector . To make every time step have the same vector , We need to use layers RepeatVector, Because its name means its role is to repeat the vector it receives , The only parameter we need to define is n, Repeat the number . This number is equal to the number of time steps in the decoder part , In other words, the maximum length of an English sentence 6.
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
Once we're ready to enter , We will continue to decode . It also works LSTM Layer built , The difference is in the parameters return_sequences, In this case 'True'. What is this parameter used for ? In the encoder section , We only expect one vector in the last time step , And ignore all the other vectors , Here we expect each time step to have an output vector , In this way, the full connection layer can predict .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
decoder = LSTM(64, return_sequences=True, dropout=0.2)(r_vec)
We have the last step , Predict the translated words . So , We need to use the full connectivity layer . The parameter we need to define is the number of elements , This unit number is the shape of the output vector , It needs to be the same length as the English word . Why? ? The values of this vector are all close to zero , Except that one of the units is close to 1. Then we need to output 1 The index of the unit is mapped to the dictionary , In the dictionary, we map each unit to a word .
for example , If the input is a word ‘sun’, And the output is a vector , All of them are zero , And then the unit 472 yes 1, So we map the index to a dictionary that contains English words , And get the value ‘sun’.
We just saw how to apply the full connectivity layer to predict a word , But how do we predict the whole sentence ? Because we use return_sequence=True, therefore LSTM The layer outputs a vector at each time step , So we need to apply the full connectivity layer explained earlier in each time step , Let them predict one word at a time .
So ,Keras We developed a program called TimeDistributed Specific layers of , It applies the same full connectivity layer to each time step .
input_sequence = Input(shape=(max_spanish_len,))
embedding = Embedding(input_dim=spanish_vocab, output_dim=128,)(input_sequence)
encoder = LSTM(64, return_sequences=False)(embedding)
r_vec = RepeatVector(max_english_len)(encoder)
decoder = LSTM(64, return_sequences=True, dropout=0.2)(r_vec)
logits = TimeDistributed(Dense(english_vocab))(decoder)
Last , We create the model and add a loss function .
enc_dec_model = Model(input_sequence, Activation('softmax')(logits))
enc_dec_model.compile(loss=sparse_categorical_crossentropy,
optimizer=Adam(1e-3),
metrics=['accuracy'])
enc_dec_model.summary()
Once we define the model , We can train it .
model_results = enc_dec_model.fit(spa_pad_sentence, eng_pad_sentence, batch_size=30, epochs=100)
When the model is trained , We can translate for the first time . You can also find functions “logits_to_sentence”, It maps the output of the full connectivity layer to English vocabulary .
def logits_to_sentence(logits, tokenizer):
index_to_words = {idx: word for word, idx in tokenizer.word_index.items()}
index_to_words[0] = '<empty>'
return ' '.join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])
index = 14
print("The english sentence is: {}".format(english_sentences[index]))
print("The spanish sentence is: {}".format(spanish_sentences[index]))
print('The predicted sentence is :')
print(logits_to_sentence(enc_dec_model.predict(spa_pad_sentence[index:index+1])[0], eng_text_tokenizer))
Conclusion
The codec architecture allows different input and output sequence lengths . First , We use an embedded layer to create a spatial representation of words , And input it into LSTM layer , Because we only focus on the output of the last time step , We use return_sequences=False.
This output vector needs to be repeated the same number of times as the decoder part , We use RepeatVector layer . The decoder will use LSTM, Parameters return_sequences=True, So the output of each time step is passed to the full connectivity layer .
Although this model is a good improvement from the previous tutorial , We can still improve accuracy . We can add a layer of encoder and decoder to one layer . We can also use the pre trained embedding layer , such as word2vec or Glove. Last , We can use the attention mechanism , This is a major improvement in the field of natural language processing . We'll introduce this concept in the next tutorial .
appendix : Codec without repetition vector
In this tutorial , We learned how to use RepeatVector Layer building encoder - decoder . There's a second option , We use the output of the model as input to the next time step , Instead of repeating hidden vectors , As shown in the figure .
The code to implement this model can be found in Keras Found in document , It needs to be right Keras The library has a deeper understanding , And development is much more complicated :https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
Link to the original text :https://towardsdatascience.com/how-to-build-an-encoder-decoder-translation-model-using-lstm-with-python-and-keras-a31e9d864b9b
Welcome to join us AI Blog station : http://panchuang.net/
sklearn Machine learning Chinese official documents : http://sklearn123.com/
Welcome to pay attention to pan Chuang blog resource summary station : http://docs.panchuang.net/
版权声明
本文为[Artificial intelligence meets pioneer]所创,转载请带上原文链接,感谢
边栏推荐
- 加速「全民直播」洪流,如何攻克延时、卡顿、高并发难题?
- Natural language processing - wrong word recognition (based on Python) kenlm, pycorrector
- 一篇文章带你了解CSS 分页实例
- The difference between Es5 class and ES6 class
- 至联云分享:IPFS/Filecoin值不值得投资?
- [event center azure event hub] interpretation of error information found in event hub logs
- What to do if you are squeezed by old programmers? I don't want to quit
- IPFS/Filecoin合法性:保护个人隐私不被泄露
- 深度揭祕垃圾回收底層,這次讓你徹底弄懂她
- EOS创始人BM: UE,UBI,URI有什么区别?
猜你喜欢
随机推荐
Advanced Vue component pattern (3)
What problems can clean architecture solve? - jbogard
H5 makes its own video player (JS Part 2)
Grouping operation aligned with specified datum
How to encapsulate distributed locks more elegantly
5.5 controlleradvice notes - SSM in depth analysis and project practice
阿里云Q2营收破纪录背后,云的打开方式正在重塑
Vite + TS quickly build vue3 project and introduce related features
High availability cluster deployment of jumpserver: (6) deployment of SSH agent module Koko and implementation of system service management
Working principle of gradient descent algorithm in machine learning
Word segmentation, naming subject recognition, part of speech and grammatical analysis in natural language processing
Summary of common algorithms of linked list
Nodejs crawler captures ancient books and records, a total of 16000 pages, experience summary and project sharing
Relationship between business policies, business rules, business processes and business master data - modern analysis
一篇文章带你了解CSS对齐方式
I've been rejected by the product manager. Why don't you know
Using consult to realize service discovery: instance ID customization
带你学习ES5中新增的方法
6.1.2 handlermapping mapping processor (2) (in-depth analysis of SSM and project practice)
PHP应用对接Justswap专用开发包【JustSwap.PHP】