当前位置：网站首页>Deep understanding of NN in pytorch Embedding

Deep understanding of NN in pytorch Embedding

2022-07-02 11:56:00 【raelum】

Catalog

One 、 Pre knowledge
Two 、nn.Embedding Basics
3、 ... and 、nn.Embedding Advanced
- 3.1 All parameters
- 3.2 Use pre trained words to embed
Four 、 Last

One 、 Pre knowledge

1.1 corpus （Corpus）

It's too long to read the edition ： NLP The language data that the task depends on is called corpus .

Detailed introduction version ： corpus （Corpus, The plural is Corpora） It is a collection of real text or audio organized into a data set . Authenticity here refers to text or audio produced by native speakers of the language . The corpus can be obtained from newspapers 、 A novel 、 The recipe 、 Broadcast to TV programs 、 All the content of movies and tweets . In natural language processing , The corpus contains information that can be used for training AI Text and voice data .

1.2 Morpheme （Token）

For simplicity , Suppose our corpus has only three English sentences and all of them have been processed （ All lowercase + Remove punctuation ）：

corpus = ["he is an old worker", "english is a useful tool", "the cinema is far away"]

We often need to metabolize its words （tokenize） To become a sequence , There is only a simple split that will do ：

def tokenize(corpus):
    return [sentence.split() for sentence in corpus]


tokens = tokenize(corpus)
print(tokens)
# [['he', 'is', 'an', 'old', 'worker'], ['english', 'is', 'a', 'useful', 'tool'], ['the', 'cinema', 'is', 'far', 'away']]

Here, we use word level to realize word metallization , It can also be tokenized at the character level .

1.3 Thesaurus （Vocabulary）

Thesaurus No repetition It contains all the lexical elements in the corpus , It's very easy to implement ：

vocab = set(sum(tokens, []))
print(vocab)
# {'is', 'useful', 'an', 'old', 'far', 'the', 'away', 'a', 'he', 'tool', 'cinema', 'english', 'worker'}

Thesaurus in NLP Tasks are often not the most important , We need to assign a unique index to each word in the vocabulary and build a word to index mapping ：word2idx. Here we construct according to the frequency of words word2idx.

from collections import Counter

word2idx = {
    
    word: idx
    for idx, (word, freq) in enumerate(
        sorted(Counter(sum(tokens, [])).items(), key=lambda x: x[1], reverse=True))
}
print(word2idx)
# {'is': 0, 'he': 1, 'an': 2, 'old': 3, 'worker': 4, 'english': 5, 'a': 6, 'useful': 7, 'tool': 8, 'the': 9, 'cinema': 10, 'far': 11, 'away': 12}

In turn, , We can also build idx2word：

idx2word = {
    idx: word for word, idx in word2idx.items()}
print(idx2word)
# {0: 'is', 1: 'he', 2: 'an', 3: 'old', 4: 'worker', 5: 'english', 6: 'a', 7: 'useful', 8: 'tool', 9: 'the', 10: 'cinema', 11: 'far', 12: 'away'}

about 1.2 Section tokens, It can also be transformed into the representation of the index ：

encoded_tokens = [[word2idx[token] for token in line] for line in tokens]
print(encoded_tokens)
# [[1, 0, 2, 3, 4], [5, 0, 6, 7, 8], [9, 10, 0, 11, 12]]

This representation will be explained later nn.Embedding When it comes to .

Two 、nn.Embedding Basics

2.1 Why embedding？

RNN Words cannot be processed directly , Therefore, it is necessary to turn words into vectors in numerical form in some way to be used as RNN The input of . This method of mapping words to a vector in vector space is called Word embedding （word embedding）, The corresponding vector is called The word vector （word vector）.

2.2 Basic parameters

Let's start with nn.Embedding Basic parameters in , After understanding its basic usage , Then explain all its parameters .

The basic parameters are as follows ：

nn.Embedding(num_embeddings, embedding_dim)

among num_embeddings Is the size of the vocabulary , namely len(vocab);embedding_dim It's the dimension of the word vector .

We use the example in Chapter 1 , At this time, the vocabulary size is $12$ , Let's suppose that the dimension of the embedded word vector is $3$ （ That is, words are embedded into three-dimensional vector space ）, be embedding Layers should be created like this ：

torch.manual_seed(0)  #  For the sake of reproducibility 
emb = nn.Embedding(12, 3)

embedding There is only one parameter in the layer weight, When created, it will start from Standard normal distribution Initialize in ：

print(emb.weight)
# Parameter containing:
# tensor([[-1.1258, -1.1524, -0.2506],
# [-0.4339, 0.8487, 0.6920],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168],
# [-0.2473, -1.3527, -1.6959],
# [ 0.5667, 0.7935, 0.4397],
# [ 0.1124, 0.6408, 0.4412],
# [-0.2159, -0.7425, 0.5627],
# [ 0.2596, 0.5229, 2.3022],
# [-1.4689, -1.5867, 1.2032],
# [ 0.0845, -1.2001, -0.0048]], requires_grad=True)

Here we can put weight As embedding A weight of the layer .

Let's take a look at nn.Embedding The input of . Intuitive to see , Given a sentence that has been lexicalized , Enter the words in embedding Layer should get the corresponding word vector . in fact ,nn.Embedding The input received is not the sentence after word formation , But its index form , That is mentioned in the first chapter encoded_tokens.

nn.Embedding Acceptable Any shape As input , But because the index is passed in , So every number in the tensor should not exceed len(vocab) - 1, Otherwise you will report an error . Next ,nn.Embedding The function of is like a Lookup table （Lookup Table） equally , Through these indexes in weight Find and return the corresponding word vector .

print(emb.weight)
# tensor([[-1.1258, -1.1524, -0.2506],
# [-0.4339, 0.8487, 0.6920],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168],
# [-0.2473, -1.3527, -1.6959],
# [ 0.5667, 0.7935, 0.4397],
# [ 0.1124, 0.6408, 0.4412],
# [-0.2159, -0.7425, 0.5627],
# [ 0.2596, 0.5229, 2.3022],
# [-1.4689, -1.5867, 1.2032],
# [ 0.0845, -1.2001, -0.0048]], requires_grad=True)
sentence = torch.tensor(encoded_tokens[0])  #  There are three sentences , Only the first sentence is used here 
print(sentence)
# tensor([1, 0, 2, 3, 4])
print(emb(sentence))
# tensor([[-0.4339, 0.8487, 0.6920],
# [-1.1258, -1.1524, -0.2506],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168]], grad_fn=<EmbeddingBackward0>)
print(emb.weight[sentence] == emb(sentence))
# tensor([[True, True, True],
# [True, True, True],
# [True, True, True],
# [True, True, True],
# [True, True, True]])

2.3 nn.Embedding And nn.Linear The difference between

Careful readers may have seen nn.Embedding and nn.Linear It seems very much like , What's the difference between them ？

review nn.Linear, If you don't turn it on bias, Let the input vector be $x$ ,nn.Linear.weight The corresponding matrix is $A$ （ Shape is hidden_size × input_size）, Then the calculation method is ：

$y=xA^{\text T}$

among $x, y$ Are all Row vector .

If $x$ yes one-hot vector , The first $i$ A place is $1$ , that $y$ Namely $A^{\text T}$ Of the $i$ That's ok .

Now give a word $w$ , Suppose it's in word2idx The index in is $i$ , stay nn.Embedding in , We use this index $i$ Go find emb.weight Of the $i$ That's ok . And in the nn.Linear in , We index this $i$ Code into a one-hot vector , Then multiply by the corresponding weight matrix to get the... Of the matrix $i$ That's ok .

Please look at the following example ：

torch.manual_seed(0)

vocab_size = 4  #  The size of the vocabulary is 4
embedding_dim = 3  #  The dimension of the word vector is 3
weight = torch.randn(4, 3)  #  Randomly initialize the weight matrix 

#  Keep the linear layer and the embedded layer with the same weight 
linear_layer = nn.Linear(4, 3, bias=False)
linear_layer.weight.data = weight.T  #  Attention transposition 
emb_layer = nn.Embedding(4, 3)
emb_layer.weight.data = weight

idx = torch.tensor(2)  #  Suppose a word is in word2idx The index in is 2
word = torch.tensor([0, 0, 1, 0]).to(torch.float)  #  Of the above words one-hot Express 
print(emb_layer(idx))
# tensor([ 0.4033, 0.8380, -0.7193], grad_fn=<EmbeddingBackward0>)
print(linear_layer(word))
# tensor([ 0.4033, 0.8380, -0.7193], grad_fn=<SqueezeBackward3>)

From this we can conclude that ：

nn.Linear Accept vector as input , and nn.Embedding Is to accept discrete indexes as input ;
nn.Embedding In fact, the input is one-hot vector , And without bias Of nn.Linear.

Besides ,nn.Linear Matrix multiplication is done in the operation process , and nn.Embedding It is to look up the table directly according to the index , So in this scenario nn.Embedding Is obviously more efficient .

Read further ：[Stack Overflow] What is the difference between an Embedding Layer with a bias immediately afterwards and a Linear Layer in PyTorch?

2.4 nn.Embedding Update issues for

Looking up PyTorch Official forum and Stack Overflow After some posts of , I found that many people are right nn.Embedding Weight in weight I feel very confused about how to update .

nn.Embedding The weight of is actually the word embedding itself

in fact ,nn.Embedding.weight In the process of updating, neither Skip-gram And it didn't use CBOW. Review the simplest multilayer perceptron , Among them nn.Linear.weight It will be automatically updated with the back propagation . When we put nn.Embedding As a special nn.Linear after , Its updating mechanism is not difficult to understand , It's nothing more than updating according to the gradient .

After training , The word embedding obtained is the most suitable word embedding for the current task , Rather than image word2vec,GloVe This more general word is embedded .

Of course, we can also use pre trained word embedding before training , For example, the above mentioned word2vec, But at this time, we should consider retraining or fine-tuning for the current task .

Suppose we have used pre trained word embedding and don't want it to update itself during training , Then try freezing the gradient , namely ：

emb.weight.requires_grad = False

Read further ：
[PyTorch Forums] How nn.Embedding trained?
[PyTorch Forums] How does nn.Embedding work?
[Stack Overflow] Embedding in pytorch
[Stack Overflow] What “exactly” happens inside embedding layer in pytorch?

3、 ... and 、nn.Embedding Advanced

In this chapter , We will explain nn.Embedding And describes how to use pre trained word embedding .

3.1 All parameters

Official documents ：

Insert picture description here

$padding_idx \textcolor{blue}{\text{padding\_idx}}$

We know ,nn.Embedding Although tensors of any shape can be accepted as input , But most of the time , Its input shape is batch_size × sequence_length, This requires the same batch All sequences in have the same length .

review 1.2 The example in Section , The length of the three sentences in the corpus is the same （ Have the same number of words ）, But in fact, these are three sentences specially selected by bloggers . In real tasks , It's hard to guarantee the same batch All sentences in are of the same length , So we need to fill in those sentences with shorter length . Because input to nn.Embedding All of them are indexes , So we also need to fill in with indexes , Which index is the best to use ？

Suppose the corpus is ：

corpus = ["he is an old worker", "time tries truth", "better late than never"]
print(word2idx)
# {'he': 0, 'is': 1, 'an': 2, 'old': 3, 'worker': 4, 'time': 5, 'tries': 6, 'truth': 7, 'better': 8, 'late': 9, 'than': 10, 'never': 11}
print(encoded_tokens)
# [[0, 1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11]]

We can do it in word2idx Add a new word element <pad>（ Represents filler element ）, And assign a new index to it ：

word2idx['<pad>'] = 12

Yes encoded_tokens Fill in ：

max_length = max([len(seq) for seq in encoded_tokens])
for i in range(len(encoded_tokens)):
    encoded_tokens[i] += [word2idx['<pad>']] * (max_length - len(encoded_tokens[i]))
print(encoded_tokens)
# [[0, 1, 2, 3, 4], [5, 6, 7, 12, 12], [8, 9, 10, 11, 12]]

establish embedding Layer and specify padding_idx：

emb = nn.Embedding(len(word2idx), 3, padding_idx=12)  #  Suppose the word vector dimension is 3
print(emb.weight)
# tensor([[ 1.5017, -1.1737, 0.1742],
# [-0.9511, -0.4172, 1.5996],
# [ 0.6306, 1.4186, 1.3872],
# [-0.1833, 1.4485, -0.3515],
# [ 0.2474, -0.8514, -0.2448],
# [ 0.4386, 1.3905, 0.0328],
# [-0.1215, 0.5504, 0.1499],
# [ 0.5954, -1.0845, 1.9494],
# [ 0.0668, 1.1366, -0.3414],
# [-0.0260, -0.1091, 0.4937],
# [ 0.4947, 1.1701, -0.5660],
# [ 1.1717, -0.3970, -1.4958],
# [ 0.0000, 0.0000, 0.0000]], requires_grad=True)

It can be seen that the word vector corresponding to the filler element is Zero vector , And during the training process, the word vector corresponding to the filled word element will not be updated （ Throughout It's a zero vector ）.

padding_idx The default is None, That is, no filling .

$max_norm \textcolor{blue}{\text{max\_norm}}$

If the norm of the word vector exceeds max_norm, Then normalize it to max_norm：

$max_norm ⋅ w ∥ w ∥ w:=\text{max\_norm}\cdot\frac{w}{\Vert w\Vert}$

max_norm The default is None, That is, no normalization .

$norm_type \textcolor{blue}{\text{norm\_type}}$

When specifying the max_norm when ,norm_type Decide which norm to use to calculate . The default is 2- norm .

$scale_grad_by_freq \textcolor{blue}{\text{scale\_grad\_by\_freq}}$

If this parameter is set to True, Then the word vector $w$ When updating , Will be based on it in a batch The frequency of occurrence in scales the corresponding gradient ：

$\frac{\partial \text{Loss}}{\partial w}:=\frac{1}{\text{frequency}(w)}\cdot\frac{\partial \text{Loss}}{\partial w}$

The default is False.

$\textcolor{blue}{\text{sparse}}$

If set to True, Then with Embedding.weight The relevant gradient will become a sparse tensor , At this time, the optimizer can only choose ：SGD、SparseAdam and Adagrad. The default is False.

3.2 Use pre trained words to embed

In some cases, we need to use pre trained word embedding , You can use it from_pretrained Method , as follows ：

torch.manual_seed(0)
pretrained_embeddings = torch.randn(4, 3)
print(pretrained_embeddings)
# tensor([[ 1.5410, -0.2934, -2.1788],
# [ 0.5684, -1.0845, -1.3986],
# [ 0.4033, 0.8380, -0.7193],
# [-0.4033, -0.5966, 0.1820]])
emb = nn.Embedding(4, 3).from_pretrained(pretrained_embeddings)
print(emb.weight)
# tensor([[ 1.5410, -0.2934, -2.1788],
# [ 0.5684, -1.0845, -1.3986],
# [ 0.4033, 0.8380, -0.7193],
# [-0.4033, -0.5966, 0.1820]])

If you want to avoid pre training words embedded in the subsequent training process update , Can be freeze Parameter set to True：

emb = nn.Embedding(4, 3).from_pretrained(pretrained_embeddings, freeze=True)

Four 、 Last

️ The blogger is right nn.Embedding Your understanding may still not be in place , If you have any mistakes, please point out in the comments section .
If this article helps you , Can pay attention to ️ + give the thumbs-up + Collection + Leaving a message. , Your support will be the biggest motivation for my creation

原网站

版权声明
本文为[raelum]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/183/202207020900112407.html