当前位置:网站首页>Deep understanding of NN in pytorch Embedding
Deep understanding of NN in pytorch Embedding
2022-07-02 11:56:00 【raelum】
One 、 Pre knowledge
1.1 corpus (Corpus)
It's too long to read the edition : NLP The language data that the task depends on is called corpus .
Detailed introduction version : corpus (Corpus, The plural is Corpora) It is a collection of real text or audio organized into a data set . Authenticity here refers to text or audio produced by native speakers of the language . The corpus can be obtained from newspapers 、 A novel 、 The recipe 、 Broadcast to TV programs 、 All the content of movies and tweets . In natural language processing , The corpus contains information that can be used for training AI Text and voice data .
1.2 Morpheme (Token)
For simplicity , Suppose our corpus has only three English sentences and all of them have been processed ( All lowercase + Remove punctuation ):
corpus = ["he is an old worker", "english is a useful tool", "the cinema is far away"]
We often need to metabolize its words (tokenize) To become a sequence , There is only a simple split that will do :
def tokenize(corpus):
return [sentence.split() for sentence in corpus]
tokens = tokenize(corpus)
print(tokens)
# [['he', 'is', 'an', 'old', 'worker'], ['english', 'is', 'a', 'useful', 'tool'], ['the', 'cinema', 'is', 'far', 'away']]
Here, we use word level to realize word metallization , It can also be tokenized at the character level .
1.3 Thesaurus (Vocabulary)
Thesaurus No repetition It contains all the lexical elements in the corpus , It's very easy to implement :
vocab = set(sum(tokens, []))
print(vocab)
# {'is', 'useful', 'an', 'old', 'far', 'the', 'away', 'a', 'he', 'tool', 'cinema', 'english', 'worker'}
Thesaurus in NLP Tasks are often not the most important , We need to assign a unique index to each word in the vocabulary and build a word to index mapping :word2idx. Here we construct according to the frequency of words word2idx.
from collections import Counter
word2idx = {
word: idx
for idx, (word, freq) in enumerate(
sorted(Counter(sum(tokens, [])).items(), key=lambda x: x[1], reverse=True))
}
print(word2idx)
# {'is': 0, 'he': 1, 'an': 2, 'old': 3, 'worker': 4, 'english': 5, 'a': 6, 'useful': 7, 'tool': 8, 'the': 9, 'cinema': 10, 'far': 11, 'away': 12}
In turn, , We can also build idx2word:
idx2word = {
idx: word for word, idx in word2idx.items()}
print(idx2word)
# {0: 'is', 1: 'he', 2: 'an', 3: 'old', 4: 'worker', 5: 'english', 6: 'a', 7: 'useful', 8: 'tool', 9: 'the', 10: 'cinema', 11: 'far', 12: 'away'}
about 1.2 Section tokens, It can also be transformed into the representation of the index :
encoded_tokens = [[word2idx[token] for token in line] for line in tokens]
print(encoded_tokens)
# [[1, 0, 2, 3, 4], [5, 0, 6, 7, 8], [9, 10, 0, 11, 12]]
This representation will be explained later nn.Embedding When it comes to .
Two 、nn.Embedding Basics
2.1 Why embedding?
RNN Words cannot be processed directly , Therefore, it is necessary to turn words into vectors in numerical form in some way to be used as RNN The input of . This method of mapping words to a vector in vector space is called Word embedding (word embedding), The corresponding vector is called The word vector (word vector).
2.2 Basic parameters
Let's start with nn.Embedding Basic parameters in , After understanding its basic usage , Then explain all its parameters .
The basic parameters are as follows :
nn.Embedding(num_embeddings, embedding_dim)
among num_embeddings Is the size of the vocabulary , namely len(vocab);embedding_dim It's the dimension of the word vector .
We use the example in Chapter 1 , At this time, the vocabulary size is 12 12 12, Let's suppose that the dimension of the embedded word vector is 3 3 3( That is, words are embedded into three-dimensional vector space ), be embedding Layers should be created like this :
torch.manual_seed(0) # For the sake of reproducibility
emb = nn.Embedding(12, 3)
embedding There is only one parameter in the layer weight, When created, it will start from Standard normal distribution Initialize in :
print(emb.weight)
# Parameter containing:
# tensor([[-1.1258, -1.1524, -0.2506],
# [-0.4339, 0.8487, 0.6920],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168],
# [-0.2473, -1.3527, -1.6959],
# [ 0.5667, 0.7935, 0.4397],
# [ 0.1124, 0.6408, 0.4412],
# [-0.2159, -0.7425, 0.5627],
# [ 0.2596, 0.5229, 2.3022],
# [-1.4689, -1.5867, 1.2032],
# [ 0.0845, -1.2001, -0.0048]], requires_grad=True)
Here we can put weight As embedding A weight of the layer .
Let's take a look at nn.Embedding The input of . Intuitive to see , Given a sentence that has been lexicalized , Enter the words in embedding Layer should get the corresponding word vector . in fact ,nn.Embedding The input received is not the sentence after word formation , But its index form , That is mentioned in the first chapter encoded_tokens.
nn.Embedding Acceptable Any shape As input , But because the index is passed in , So every number in the tensor should not exceed len(vocab) - 1, Otherwise you will report an error . Next ,nn.Embedding The function of is like a Lookup table (Lookup Table) equally , Through these indexes in weight Find and return the corresponding word vector .
print(emb.weight)
# tensor([[-1.1258, -1.1524, -0.2506],
# [-0.4339, 0.8487, 0.6920],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168],
# [-0.2473, -1.3527, -1.6959],
# [ 0.5667, 0.7935, 0.4397],
# [ 0.1124, 0.6408, 0.4412],
# [-0.2159, -0.7425, 0.5627],
# [ 0.2596, 0.5229, 2.3022],
# [-1.4689, -1.5867, 1.2032],
# [ 0.0845, -1.2001, -0.0048]], requires_grad=True)
sentence = torch.tensor(encoded_tokens[0]) # There are three sentences , Only the first sentence is used here
print(sentence)
# tensor([1, 0, 2, 3, 4])
print(emb(sentence))
# tensor([[-0.4339, 0.8487, 0.6920],
# [-1.1258, -1.1524, -0.2506],
# [-0.3160, -2.1152, 0.3223],
# [-1.2633, 0.3500, 0.3081],
# [ 0.1198, 1.2377, 1.1168]], grad_fn=<EmbeddingBackward0>)
print(emb.weight[sentence] == emb(sentence))
# tensor([[True, True, True],
# [True, True, True],
# [True, True, True],
# [True, True, True],
# [True, True, True]])
2.3 nn.Embedding And nn.Linear The difference between
Careful readers may have seen nn.Embedding and nn.Linear It seems very much like , What's the difference between them ?
review nn.Linear, If you don't turn it on bias, Let the input vector be x x x,nn.Linear.weight The corresponding matrix is A A A( Shape is hidden_size × input_size), Then the calculation method is :
y = x A T y=xA^{\text T} y=xAT
among x , y x,y x,y Are all Row vector .
If x x x yes one-hot vector , The first i i i A place is 1 1 1, that y y y Namely A T A^{\text T} AT Of the i i i That's ok .
Now give a word w w w, Suppose it's in word2idx The index in is i i i, stay nn.Embedding in , We use this index i i i Go find emb.weight Of the i i i That's ok . And in the nn.Linear in , We index this i i i Code into a one-hot vector , Then multiply by the corresponding weight matrix to get the... Of the matrix i i i That's ok .
Please look at the following example :
torch.manual_seed(0)
vocab_size = 4 # The size of the vocabulary is 4
embedding_dim = 3 # The dimension of the word vector is 3
weight = torch.randn(4, 3) # Randomly initialize the weight matrix
# Keep the linear layer and the embedded layer with the same weight
linear_layer = nn.Linear(4, 3, bias=False)
linear_layer.weight.data = weight.T # Attention transposition
emb_layer = nn.Embedding(4, 3)
emb_layer.weight.data = weight
idx = torch.tensor(2) # Suppose a word is in word2idx The index in is 2
word = torch.tensor([0, 0, 1, 0]).to(torch.float) # Of the above words one-hot Express
print(emb_layer(idx))
# tensor([ 0.4033, 0.8380, -0.7193], grad_fn=<EmbeddingBackward0>)
print(linear_layer(word))
# tensor([ 0.4033, 0.8380, -0.7193], grad_fn=<SqueezeBackward3>)
From this we can conclude that :
nn.LinearAccept vector as input , andnn.EmbeddingIs to accept discrete indexes as input ;nn.EmbeddingIn fact, the input is one-hot vector , And without bias Ofnn.Linear.
Besides ,nn.Linear Matrix multiplication is done in the operation process , and nn.Embedding It is to look up the table directly according to the index , So in this scenario nn.Embedding Is obviously more efficient .
2.4 nn.Embedding Update issues for
Looking up PyTorch Official forum and Stack Overflow After some posts of , I found that many people are right nn.Embedding Weight in weight I feel very confused about how to update .
nn.EmbeddingThe weight of is actually the word embedding itself
in fact ,nn.Embedding.weight In the process of updating, neither Skip-gram And it didn't use CBOW. Review the simplest multilayer perceptron , Among them nn.Linear.weight It will be automatically updated with the back propagation . When we put nn.Embedding As a special nn.Linear after , Its updating mechanism is not difficult to understand , It's nothing more than updating according to the gradient .
After training , The word embedding obtained is the most suitable word embedding for the current task , Rather than image word2vec,GloVe This more general word is embedded .
Of course, we can also use pre trained word embedding before training , For example, the above mentioned word2vec, But at this time, we should consider retraining or fine-tuning for the current task .
Suppose we have used pre trained word embedding and don't want it to update itself during training , Then try freezing the gradient , namely :
emb.weight.requires_grad = False
Read further :
3、 ... and 、nn.Embedding Advanced
In this chapter , We will explain nn.Embedding And describes how to use pre trained word embedding .
3.1 All parameters
Official documents :

padding_idx \textcolor{blue}{\text{padding\_idx}} padding_idx
We know ,nn.Embedding Although tensors of any shape can be accepted as input , But most of the time , Its input shape is batch_size × sequence_length, This requires the same batch All sequences in have the same length .
review 1.2 The example in Section , The length of the three sentences in the corpus is the same ( Have the same number of words ), But in fact, these are three sentences specially selected by bloggers . In real tasks , It's hard to guarantee the same batch All sentences in are of the same length , So we need to fill in those sentences with shorter length . Because input to nn.Embedding All of them are indexes , So we also need to fill in with indexes , Which index is the best to use ?
Suppose the corpus is :
corpus = ["he is an old worker", "time tries truth", "better late than never"]
print(word2idx)
# {'he': 0, 'is': 1, 'an': 2, 'old': 3, 'worker': 4, 'time': 5, 'tries': 6, 'truth': 7, 'better': 8, 'late': 9, 'than': 10, 'never': 11}
print(encoded_tokens)
# [[0, 1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11]]
We can do it in word2idx Add a new word element <pad>( Represents filler element ), And assign a new index to it :
word2idx['<pad>'] = 12
Yes encoded_tokens Fill in :
max_length = max([len(seq) for seq in encoded_tokens])
for i in range(len(encoded_tokens)):
encoded_tokens[i] += [word2idx['<pad>']] * (max_length - len(encoded_tokens[i]))
print(encoded_tokens)
# [[0, 1, 2, 3, 4], [5, 6, 7, 12, 12], [8, 9, 10, 11, 12]]
establish embedding Layer and specify padding_idx:
emb = nn.Embedding(len(word2idx), 3, padding_idx=12) # Suppose the word vector dimension is 3
print(emb.weight)
# tensor([[ 1.5017, -1.1737, 0.1742],
# [-0.9511, -0.4172, 1.5996],
# [ 0.6306, 1.4186, 1.3872],
# [-0.1833, 1.4485, -0.3515],
# [ 0.2474, -0.8514, -0.2448],
# [ 0.4386, 1.3905, 0.0328],
# [-0.1215, 0.5504, 0.1499],
# [ 0.5954, -1.0845, 1.9494],
# [ 0.0668, 1.1366, -0.3414],
# [-0.0260, -0.1091, 0.4937],
# [ 0.4947, 1.1701, -0.5660],
# [ 1.1717, -0.3970, -1.4958],
# [ 0.0000, 0.0000, 0.0000]], requires_grad=True)
It can be seen that the word vector corresponding to the filler element is Zero vector , And during the training process, the word vector corresponding to the filled word element will not be updated ( Throughout It's a zero vector ).
padding_idx The default is None, That is, no filling .
max_norm \textcolor{blue}{\text{max\_norm}} max_norm
If the norm of the word vector exceeds max_norm, Then normalize it to max_norm:
w : = max_norm ⋅ w ∥ w ∥ w:=\text{max\_norm}\cdot\frac{w}{\Vert w\Vert} w:=max_norm⋅∥w∥w
max_norm The default is None, That is, no normalization .
norm_type \textcolor{blue}{\text{norm\_type}} norm_type
When specifying the max_norm when ,norm_type Decide which norm to use to calculate . The default is 2- norm .
scale_grad_by_freq \textcolor{blue}{\text{scale\_grad\_by\_freq}} scale_grad_by_freq
If this parameter is set to True, Then the word vector w w w When updating , Will be based on it in a batch The frequency of occurrence in scales the corresponding gradient :
∂ Loss ∂ w : = 1 frequency ( w ) ⋅ ∂ Loss ∂ w \frac{\partial \text{Loss}}{\partial w}:=\frac{1}{\text{frequency}(w)}\cdot\frac{\partial \text{Loss}}{\partial w} ∂w∂Loss:=frequency(w)1⋅∂w∂Loss
The default is False.
sparse \textcolor{blue}{\text{sparse}} sparse
If set to True, Then with Embedding.weight The relevant gradient will become a sparse tensor , At this time, the optimizer can only choose :SGD、SparseAdam and Adagrad. The default is False.
3.2 Use pre trained words to embed
In some cases, we need to use pre trained word embedding , You can use it from_pretrained Method , as follows :
torch.manual_seed(0)
pretrained_embeddings = torch.randn(4, 3)
print(pretrained_embeddings)
# tensor([[ 1.5410, -0.2934, -2.1788],
# [ 0.5684, -1.0845, -1.3986],
# [ 0.4033, 0.8380, -0.7193],
# [-0.4033, -0.5966, 0.1820]])
emb = nn.Embedding(4, 3).from_pretrained(pretrained_embeddings)
print(emb.weight)
# tensor([[ 1.5410, -0.2934, -2.1788],
# [ 0.5684, -1.0845, -1.3986],
# [ 0.4033, 0.8380, -0.7193],
# [-0.4033, -0.5966, 0.1820]])
If you want to avoid pre training words embedded in the subsequent training process update , Can be freeze Parameter set to True:
emb = nn.Embedding(4, 3).from_pretrained(pretrained_embeddings, freeze=True)
Four 、 Last
️ The blogger is right
nn.EmbeddingYour understanding may still not be in place , If you have any mistakes, please point out in the comments section .
If this article helps you , Can pay attention to ️ + give the thumbs-up + Collection + Leaving a message. , Your support will be the biggest motivation for my creation
边栏推荐
- Data analysis - Matplotlib sample code
- SVO2系列之深度濾波DepthFilter
- Fabric.js 3个api设置画布宽高
- Beautiful and intelligent, Haval H6 supreme+ makes Yuanxiao travel safer
- How to Create a Nice Box and Whisker Plot in R
- HOW TO EASILY CREATE BARPLOTS WITH ERROR BARS IN R
- flutter 问题总结
- PYQT5+openCV项目实战:微循环仪图片、视频记录和人工对比软件(附源码)
- How to Create a Beautiful Plots in R with Summary Statistics Labels
- YYGH-BUG-05
猜你喜欢

多文件程序X32dbg动态调试

HOW TO CREATE A BEAUTIFUL INTERACTIVE HEATMAP IN R

自然语言处理系列(二)——使用RNN搭建字符级语言模型

Mmrotate rotation target detection framework usage record
![[visual studio 2019] create and import cmake project](/img/51/6c2575030c5103aee6c02bec8d5e77.jpg)
[visual studio 2019] create and import cmake project

R HISTOGRAM EXAMPLE QUICK REFERENCE

Flesh-dect (media 2021) -- a viewpoint of material decomposition

6. Introduce you to LED soft film screen. LED soft film screen size | price | installation | application

How to Create a Nice Box and Whisker Plot in R

K-Means Clustering Visualization in R: Step By Step Guide
随机推荐
GGHIGHLIGHT: EASY WAY TO HIGHLIGHT A GGPLOT IN R
Seriation in R: How to Optimally Order Objects in a Data Matrice
Is it safe to open a stock account through the QR code of the securities manager? Or is it safe to open an account in a securities company?
自然语言处理系列(三)——LSTM
PyTorch搭建LSTM实现服装分类(FashionMNIST)
进入前六!博云在中国云管理软件市场销量排行持续上升
[visual studio 2019] create and import cmake project
HOW TO ADD P-VALUES ONTO A GROUPED GGPLOT USING THE GGPUBR R PACKAGE
Larvel modify table fields
The computer screen is black for no reason, and the brightness cannot be adjusted.
[multithreading] the main thread waits for the sub thread to finish executing, and records the way to execute and obtain the execution result (with annotated code and no pit)
自然语言处理系列(二)——使用RNN搭建字符级语言模型
Esp32 stores the distribution network information +led displays the distribution network status + press the key to clear the distribution network information (source code attached)
GGPlot Examples Best Reference
excel表格中选中单元格出现十字带阴影的选中效果
HOW TO ADD P-VALUES ONTO A GROUPED GGPLOT USING THE GGPUBR R PACKAGE
Research on and off the Oracle chain
Dynamic memory (advanced 4)
Data analysis - Matplotlib sample code
Summary of flutter problems