当前位置：网站首页>Code implementation MNLM

Code implementation MNLM

2022-07-02 13:46:00 【InfoQ】

This article is a milestone in the field of natural language processing , yes word embedding It's the beginning of the mountain . Today, let's reproduce the code of this article .

Let's review the model structure first

Paper notes ：

https://xie.infoq.cn/article/c3ca7ece5b6d09131b66ab52f

Look at the original paper ：

ResearchGate：A Neural Probabilistic Language Model

See the video explaining the thesis ：

MNLM：A neural probabilistic language model_ Bili, Bili _bilibili

Review the model construction before writing , Only by knowing the structure can we know what to write .

Formula is ：

y It's output

x It's input , Then it will be transformed into C,
But the original formula is still used x Express

d It's a hidden layer bias

H Is the weight of the input layer to the hidden layer

U Is the weight from the hidden layer to the output layer

W yes c Weight directly to the output layer

b It's the output layer bias

Explain how the Internet came out

We can know that this is a

A neural network with a hidden layer

. And take a look at the notes above , Is written

It's No i A word of

Eigenvector

. So the input needs to be embedding Handle .

Here we still need to pay attention to one thing , Although the income has to go through a embedding To deal with , But the input in the original formula is X.

Pay attention to the part I marked in red , Here is the input directly to the output .

This is the part of the red box in the figure below .

Remove the red box, and the rest is the conventional multi-layer perceptron , There's no need to explain .

Look at the original part of the paper ：

This part describes the parameter setting of the model , First, I said W and x These two quantities .x It's a word embedding Matrix .W Is an initialization to 0 Matrix . Then there are the remaining parameters. Let's talk about the shape of the parameters .

Code

The model code

Then we can create our network .

class NNLM(nn.Module):
 def __init__(self):
 super(NNLM, self).__init__()
 self.C = nn.Embedding(n_class, m)
 self.H = nn.Parameter(torch.randn(len_sen * m, n_hidden,requires_grad=True))
 self.d = nn.Parameter(torch.randn(n_hidden))
 self.U = nn.Parameter(torch.randn(n_hidden, n_class,requires_grad=True))
 self.W = nn.Parameter(torch.zeros(len_sen * m, n_class,requires_grad=True))
 self.b = nn.Parameter(torch.randn(n_class))

 def forward(self, X): # X : [batch_size, len_sen]
 X = self.C(X) # X : [batch_size, len_sen, m]
 X = X.view(-1, len_sen * m) # [batch_size, len_sen * m]
 tanh = torch.tanh(self.d + X @ self.H) # [batch_size, n_hidden]
 output = self.b + X @ self.W + tanh @ self.U # [batch_size, n_class]
 return output

Code parsing ：

__init__(self)
This part is the above parameter quantity .

self.C
It's a embedding operation .

The rest is the parameters in the network . Mentioned W Initialize to 0 matrix , therefore W Use it there
torch.zeros
, The rest use random initialization
torch.randn
.

forward(self, X)
Is to set forward propagation ,

X = self.C(X)
, First the X Carry out a embedding Handle , Then return the result to X. It corresponds to what we mentioned earlier , Although it takes a embedding Handle , But the input in the original formula still uses X It means .

Tensor.view
The function is to modify the shape of the tensor .
torch.Tensor.view — PyTorch 1.11.0 documentation
. After the dimension is modified, the word in each sentence is word embedding Join the vectors together .

self.d + X @ self.H
Here is the calculation of the hidden layer of the input layer .

tanh = torch.tanh(self.d + X @ self.H)
The calculation result needs to go through tanh The activation function of . Here is the general tanh The result of the calculation of the activation function is directly assigned to a named tanh The variable of .

output = self.b + X @ self.W + tanh @ self.U
Then there is the output layer calculation . Note here that the result of the output layer is composed of two parts . Part is the result of the hidden layer , Part is the result from the input layer , The calculation of hidden layer is only after the two are added .

Dimension resolution ：

Start with line 11 , I marked the dimensions in the back . Now let's explain .

In the beginning X Is to enter a few sentences , Then each sentence has a different length . Here you enter a few sentences, which is your sample number , We use it

batch_size

Express . The length of each sentence is in

len_sen

Express .

m

yes embedding The length of the vector . When using a vector of nicknames to represent a word , How long is your vocabulary , How long is your representation vector . But now you use eigenvalues to represent a word . You just need to set the length of the eigenvector you want to represent , This m It can be set by yourself . Because the data used in this code is relatively simple , So it doesn't matter if you set it smaller . I set it here to 3.

The size of the hidden layer is set to

n_hidden

. The length of the vocabulary is

n_class

At first, your input is a group of sentences , So your input X The shape of should be [batch_size, len_sen]. At this point, each element of the matrix is a word .

After the first step embedding After calculation , It will be transformed into eigenvector representation . At this time X The shape of should be [batch_size, len_sen, m]. Because you are an element , To express a word . Now it becomes a word , Use an eigenvector to represent . So a dimension is added to represent the eigenvector . Now it becomes a three-dimensional matrix .

after
Tensor.view
Modify shape . Here is
X.view(-1, len_sen * m)
Change to two-dimensional matrix , The second dimension of the matrix is len_sen * m, The first dimension is adaptive （-1 It means adaptive ）. It means to make a representation of different words in a sentence concate, Splice up .

tanh
Here we have reached the hidden layer . So the length of the input vector will become the size of the hidden layer . The size of this hidden layer
n_hidden
You need to set it yourself . The size of the hidden layer determines the quality of the network . Of course, the amount of data here is relatively small , So whether it's good or not actually has little effect on the size of the hidden layer . Generally, the size of the hidden layer follows the following rules .

hypothesis ：

The input layer size is

The output layer is divided into

class

The number of samples is

A constant

A common view is the number of hidden layers

：

……

How to determine the number and size of hidden layers in neural network _LolitaAnn Technology blog _51CTO Blog

Here we use

. The length entered in our code is
len_sen * m
. The classification size is the length of the word list
n_class
. After calculation h The size is 14.

At this time
tanh
Dimension for [batch_size, n_hidden].

The shape of the output layer is [batch_size, n_class], What the output layer needs to do is calculate each sample and finally obtain a vector . The length of this vector is the same as the length of the word list , This indicates the position of the prediction result in the word list .

Data preprocessing part of the code

Let's make a word list first . Here is one of the most basic processing . Use spaces for word segmentation , Then convert all the words into lowercase and put them in the word list , Then make the corresponding index .

 sentences = [&quot;The cat is walking in the bedroom&quot;,
 &quot;A dog was running in a room&quot;,
 &quot;The cat is running in a room&quot;,
 &quot;A dog is walking in a bedroom&quot;,
 &quot;The dog was walking in the room&quot;]

 word_list = &quot; &quot;.join(sentences).lower().split()
 word_list = list(set(word_list))
 word_dict = {w: i for i, w in enumerate(word_list)}
 number_dict = {i: w for i, w in enumerate(word_list)}

The seventh line of code
word_list
Is to splice all sentences in the data set with spaces . Then convert it to lowercase . Then separate them with a space , Divided into different words . At this point, you get a list of words . But now there will be a lot of repeated words .

The code on the eighth line
word_list
First use set, Convert the list obtained above into a set , Remove duplicate words , Then convert back to the list .

The ninth and tenth lines of code are dictionaries that use enumeration to create word lists .

Because the given data is a pile of sentences , We need to separate it , It is divided into input and output , One of the tasks we do here is to predict the next word . We choose to use the length of the original paper as 7 Sentences , We will 6 One word as input to predict the last word . So the data preprocessing part is to split a sentence into input and output .

def dataset():
 input = []
 target = []

 for sen in sentences:
 word = sen.lower().split() # space tokenizer
 i = [word_dict[n] for n in word[:-1]] # create (1~n-1) as input
 t = word_dict[word[-1]] # create (n) as target, We usually call this 'casual language model'

 input.append(i)
 target.append(t)

 return input, target

This code should not be explained too much , Take a look at this output and you will understand . That is, after processing each sample , Then splice it into a matrix .

Complete code

import torch
import torch.nn as nn
import torch.optim as optim

def dataset():
 input = []
 target = []

 for sen in sentences:
 word = sen.lower().split() # space tokenizer
 i = [word_dict[n] for n in word[:-1]] # create (1~n-1) as input
 t = word_dict[word[-1]] # create (n) as target, We usually call this 'casual language model'

 input.append(i)
 target.append(t)

 return input, target

# Model
class NNLM(nn.Module):
 def __init__(self):
 super(NNLM, self).__init__()
 self.C = nn.Embedding(n_class, m)
 self.H = nn.Parameter(torch.randn(len_sen * m, n_hidden,requires_grad=True))
 self.d = nn.Parameter(torch.randn(n_hidden))
 self.U = nn.Parameter(torch.randn(n_hidden, n_class,requires_grad=True))
 self.W = nn.Parameter(torch.zeros(len_sen * m, n_class,requires_grad=True))
 self.b = nn.Parameter(torch.randn(n_class))

 def forward(self, X): # X : [batch_size, len_sen, m]
 X = self.C(X) # X : [batch_size, len_sen, m]
 X = X.view(-1, len_sen * m) # [batch_size, len_sen * m]
 tanh = torch.tanh(self.d + X @ self.H) # [batch_size, n_hidden]
 output = self.b + X @ self.W + tanh @ self.U # [batch_size, n_class]
 return output

if __name__ == '__main__':

 sentences = [&quot;The cat is walking in the bedroom&quot;,
 &quot;A dog was running in a room&quot;,
 &quot;The cat is running in a room&quot;,
 &quot;A dog is walking in a bedroom&quot;,
 &quot;The dog was walking in the room&quot;]

 word_list = &quot; &quot;.join(sentences).lower().split()
 word_list = list(set(word_list))
 word_dict = {w: i for i, w in enumerate(word_list)}
 number_dict = {i: w for i, w in enumerate(word_list)}
 n_class = len(word_dict) # number of Vocabulary
 
 len_sen = 6 # number of steps, n-1 in paper
 m = 3 # embedding size, m in paper
 n_hidden = (int)((len_sen*m*n_class)**0.5) # number of hidden size, h in paper

 model = NNLM()

 loss = nn.CrossEntropyLoss()
 optimizer = optim.Adam(model.parameters(), lr=0.003)

 input, target = dataset()
 input = torch.LongTensor(input)
 target = torch.LongTensor(target)

 #  Look at the effect before training .
 predict = model(input).data.max(1, keepdim=True)[1]
 print([sen.split()[:6] for sen in sentences], '->', [number_dict[n.item()] for n in predict.squeeze()])

 # Training
 for epoch in range(5000):
 optimizer.zero_grad()
 output = model(input)

 # output : [batch_size, n_class], target : [batch_size]
 Loss = loss(output, target)
 if (epoch + 1) % 1000 == 0:
 print('Epoch:', '%04d' % (epoch + 1), 'cost =', '{:.6f}'.format(Loss))

 Loss.backward()
 optimizer.step()

 # Predict & test
 predict = model(input).data.max(1, keepdim=True)[1]
 print([sen.split()[:6] for sen in sentences], '->', [number_dict[n.item()] for n in predict.squeeze()])

Comparison of output before and after training ：

[['The', 'cat', 'is', 'walking', 'in', 'the'], ['A', 'dog', 'was', 'running', 'in', 'a'], ['The', 'cat', 'is', 'running', 'in', 'a'], ['A', 'dog', 'is', 'walking', 'in', 'a'], ['The', 'dog', 'was', 'walking', 'in', 'the']] ->
['dog', 'walking', 'cat', 'a', 'walking']

[['The', 'cat', 'is', 'walking', 'in', 'the'], ['A', 'dog', 'was', 'running', 'in', 'a'], ['The', 'cat', 'is', 'running', 'in', 'a'], ['A', 'dog', 'is', 'walking', 'in', 'a'], ['The', 'dog', 'was', 'walking', 'in', 'the']] ->
['bedroom', 'room', 'room', 'bedroom', 'room']

原网站

版权声明
本文为[InfoQ]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207021011478193.html