当前位置:网站首页>Hands on deep learning (43) -- machine translation and its data construction

Hands on deep learning (43) -- machine translation and its data construction

2022-07-04 09:41:00 Stay a little star

   This article Blog Start introducing translation , All are NLP Related content of , Translation and the sequence prediction we mentioned earlier (RNN、LSTM、GRU wait )、 Fill in the blanks (Bi-RNN) And so on, what are the connections and differences ?

One 、 Machine translation

   Machine translation (machine translation) It refers to the automatic translation of sequences from one language to another . in fact , This research field can be traced back to shortly after the invention of digital computer 20 century 40 years , Especially in the Second World War, computers were used to crack language codes . For decades, , Before the rise of end-to-end learning using neural networks , Statistical methods have always been dominant in this field Brown.Cocke.Della-Pietra.ea.1988, Brown.Cocke.Della-Pietra.ea.1990 . because Statistical machine translation (statistical machine translation) It involves the statistical analysis of translation model, language model and other components , Therefore, the method based on neural network is usually called Neural machine translation (neural machine translation), Used to distinguish the two translation models .( Statistical machine translation and neural machine translation

   We mainly focus on neural machine translation methods , It emphasizes end-to-end learning . It is different from the language model in which the corpus in the language model is a single language , The data set of machine translation is composed of text sequence pairs of source language and target language . therefore , We need a completely different approach to preprocessing machine translation data sets , Instead of reusing the preprocessor of the language model . below , We will show how to load the preprocessed data into a small batch for training .

Two 、 Machine translation dataset

import os
import torch 
from d2l import torch as d2l

1. Downloading and preprocessing datasets

   First , Download one from Tatoeba The bilingual sentences of the project are right Composed of “ Britain - Law ” Data sets , Each row in the dataset is a tab delimited text sequence pair , Sequence pairs consist of English text sequences and translated French text sequences . Please note that , Each text sequence can be a sentence , It can also be a paragraph containing multiple sentences . In this machine translation problem of translating English into French , English is a source language (source language), French is target language (target language).

d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip',
                           '94646ad1522d915e7b0f9296181140edcf86a4f5')

def read_data_nmt():
    """ load “ English - French ” Data sets """
    data_dir = d2l.download_extract('fra-eng')
    with open(os.path.join(data_dir,'fra.txt'),'r') as f:
        return f.read()

raw_text = read_data_nmt()
print(raw_text[:80])
Go.	Va !
Hi.	Salut !
Run!	Cours !
Run!	Courez !
Who?	Qui ?
Wow!	Ça alors !
Fire!

1.1 Text preprocessing

  1. Use spaces instead of uninterrupted spaces (non-breaking space)
  2. Use lowercase letters instead of uppercase letters
  3. Insert spaces between words and punctuation
def preprocess_nmt(text):
    """ Preprocessing """
    #  Replace uninterrupted spaces with spaces ,(\xa0 Is a character in the Latin extended character set , Represents an uninterrupted white space )
    #  Replace uppercase letters with lowercase letters 
    text = text.replace('\u202f', ' ').replace('\xa0', ' ').lower()
    #  Insert spaces between words and punctuation 
    out = ''
    for i,char in enumerate(text):
        if i>0 and char in (',','!','.','?') and text[i-1] !=' ':
            out += ' '
        out +=char
    #  The following is the original code of Mu God , The feeling I wrote above is easy to understand 
# def no_space(char,prev_char):
# return char in set(',.!?') and prev_char != ' '
# out = [
# ' ' + char if i > 0 and no_space(char, text[i - 1]) else char
# for i, char in enumerate(text)]
    return ''.join(out)

text = preprocess_nmt(raw_text)
print(text[:80])
go .	va !
hi .	salut !
run !	cours !
run !	courez !
who ?	qui ?
wow !	ça alors !

1.2 Word metabolization tokenization

My personal understanding : Sentence / Paragraphs are divided into vectors of words . It's like cutting a handful according to its scale

def tokenize_nmt(text,num_examples=None):
    """ Tokenize the dataset """
    source,target = [],[]
    for i,line in enumerate(text.split('\n')):
        if num_examples and i>num_examples:
            break
        parts = line.split('\t')
        if len(parts)==2:
            source.append(parts[0].split(' '))
            target.append(parts[1].split(' '))
    return source,target
source, target = tokenize_nmt(text)
source[:6], target[:6]
([['go', '.'],
  ['hi', '.'],
  ['run', '!'],
  ['run', '!'],
  ['who', '?'],
  ['wow', '!']],
 [['va', '!'],
  ['salut', '!'],
  ['cours', '!'],
  ['courez', '!'],
  ['qui', '?'],
  ['ça', 'alors', '!']])
#  Draw a histogram of the number of tags contained in each text sequence 
#  The length of the sentence is not long , Usually less than 20
d2l.set_figsize()
_, _, patches = d2l.plt.hist([[len(l) for l in source], 
                              [len(l) for l in target]],
                             label=['source', 'target'])
for patch in patches[1].patches:
    patch.set_hatch('/')
d2l.plt.legend(loc='upper right');

1.3 glossary ( word embedding

   Since the machine translation dataset consists of language pairs , Therefore, we can build two vocabularies for the source language and the target language . When using word level tokenization , The vocabulary will be significantly larger than when using character level tokenization . To alleviate this problem , Here we will appear less than 2 The second low frequency mark is regarded as the same unknown (“<unk>”) Mark . besides , We also specify additional specific tags , For example, in a small batch, it is used to fill the sequence to the filling mark of the same length (“<pad>”), And the beginning of the sequence (“<bos>”) And closing marks (“<eos>”). These special tags are commonly used in natural language processing tasks .

src_vocab = d2l.Vocab(source, min_freq=2,
                      reserved_tokens=['<pad>', '<bos>', '<eos>'])
len(src_vocab),list(src_vocab.token_to_idx.items())[:10]
(10012,
 [('<unk>', 0),
  ('<pad>', 1),
  ('<bos>', 2),
  ('<eos>', 3),
  ('.', 4),
  ('i', 5),
  ('you', 6),
  ('to', 7),
  ('the', 8),
  ('?', 9)])

2. Load data set

   The sequence samples in the language model have a fixed length , Whether the sample is part of a sentence or a fragment spanning multiple sentences . This fixed length is specified by the number of time steps or the number of tags parameter . In machine translation , Each sample is a text sequence pair composed of source and target , Each of these text sequences may have a different length .

   In order to improve the calculation efficiency , We can still go through truncation (truncation) and fill (padding) Only one small batch of text sequences can be processed at a time . Suppose that each sequence in the same small batch should have the same length n, So if the number of tags in the text sequence is less than this length n when , We will continue to add specific... At the end “<pad>” Mark , Until its length reaches a uniform length ; conversely , We will truncate the text sequence , Just take the first n A sign , And discard the remaining tags . such , Each text sequence will have the same length , In order to load in small batches of the same shape .
( Actually, I have a problem here : We first added at the end pad Mark , The truncation operation is applied later , So if it is longer than our limit pad It must have been cut , Add this to these sequences pad In fact, it is more than one . Of course , This does not affect our training , It's just a little thought when I see this logic )

def truncate_pad(line,num_steps,padding_token):
    """ Truncate or fill the text sequence """
    if len(line)>num_steps:
        return line[:num_steps] #  Cut off the extra 
    return line +  [padding_token]*(num_steps -len(line)) #  Fill in the missing 

#  hypothesis num_step by 10, The filling symbol is <pad>, Operate on each sentence 
truncate_pad(src_vocab[source[0]], 10, src_vocab['<pad>']) 
[47, 4, 1, 1, 1, 1, 1, 1, 1, 1]

   Now let's define a function , Text sequences can be converted into small batch data sets for training . We will be specific “<eos>” Tags are added to the end of all sequences , Used to indicate the end of a sequence . When the model generates a sequence for prediction by tag after tag , Generated “<eos>” The mark indicates that the sequence output work has been completed . Besides , We also recorded the length of each text sequence , Fill marks are excluded from length statistics , Some models that will be introduced later will need this length information .

def build_array_nmt(lines,vocab,num_steps):
    """ Convert machine translated text sequences into small batches """
    lines = [vocab[l] for l in lines]
    lines = [l+[vocab['<eos>']] for l in lines] #  Add an end mark 
    array = torch.tensor([
        truncate_pad(l,num_steps,vocab['<pad>']) for l in lines])
    valid_len = (array != vocab['<pad>']).type(torch.int32).sum(1) #  Save the length of the fill 
    return array,valid_len

#  Be careful eos yes 3
array,valid_len = build_array_nmt(source,src_vocab ,10)
array[1],valid_len[1]
(tensor([113,   4,   3,   1,   1,   1,   1,   1,   1,   1]), tensor(3))
#  Organize data loading and processing 
def load_data_nmt(batch_size,num_steps,num_examples=600):
    """ Returns the iterator and vocabulary of the translation dataset """
    text = preprocess_nmt(read_data_nmt()) #  Preprocessing 
    source, target = tokenize_nmt(text, num_examples) #  Word metabolization 
    src_vocab = d2l.Vocab(source, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>'])
    tgt_vocab = d2l.Vocab(target, min_freq=2,
                          reserved_tokens=['<pad>', '<bos>', '<eos>']) # Building a vocabulary 
    src_array, src_valid_len = build_array_nmt(source, src_vocab, num_steps)
    tgt_array, tgt_valid_len = build_array_nmt(target, tgt_vocab, num_steps)
    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
    data_iter = d2l.load_array(data_arrays, batch_size)
    return data_iter, src_vocab, tgt_vocab

train_iter, src_vocab, tgt_vocab = load_data_nmt(batch_size=2, num_steps=8)
for X, X_valid_len, Y, Y_valid_len in train_iter:
    print('X:', X.type(torch.int32))
    print('valid lengths for X:', X_valid_len)
    print('Y:', Y.type(torch.int32))
    print('valid lengths for Y:', Y_valid_len)
    break
X: tensor([[16, 51,  4,  3,  1,  1,  1,  1],
        [ 6,  0,  4,  3,  1,  1,  1,  1]], dtype=torch.int32)
valid lengths for X: tensor([4, 4])
Y: tensor([[ 35,  37,  11,   5,   3,   1,   1,   1],
        [ 21,  51, 134,   4,   3,   1,   1,   1]], dtype=torch.int32)
valid lengths for Y: tensor([5, 5])

Summary

  • Machine translation refers to the automatic translation of text sequences from one language to another .
  • Vocabulary when using word level tokenization , Will be significantly larger than the vocabulary when using character level tokenization . To alleviate this problem , We can treat the low-frequency marker as the same unknown marker .
  • By truncating and filling the text sequence , It can ensure that all text sequences have the same length , In order to load... In small batches .

practice

  1. stay load_data_nmt Try different num_examples Parameter values . How does this affect the vocabulary of the source and target languages ?

Well understood. : If num_examples The bigger it is , It means that the more low-frequency words we keep , Then the corresponding vocabulary will increase relatively . And the more vocabulary , Its combination will also increase . It has a great impact on the cost of our training and prediction . The test adjusted this value to 800 and 1200 Value .

  1. Some languages ( For example, Chinese and Japanese ) The text of does not have a word boundary indicator ( Such as spaces ). In this case , Is word level tokenization still a good idea ? Why? ?

Reference resources Blog:NLP+ Lexical series ( One )︱ Summary of Chinese word segmentation technology 、 Introduction and comparison of several word segmentation engines
 Insert picture description here
I haven't done it personally NLP Translation work , But in my simple idea : Can our Chinese characters be marked separately , As for how to split , Words are individual individuals, equivalent to letters , It's not even impossible to recognize by bytes , After all, Chinese has two bytes per word .

原网站

版权声明
本文为[Stay a little star]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202141424095739.html