当前位置：网站首页>One hot and embedding

One hot and embedding

2022-06-22 08:23:00 【Alex_ 81D】

stay NLP field ,word embedding Has become a well-known technology . in real life word embedding Has a very wide range of applications ： Voice assistant 、 Machine translation 、 Sentiment analysis … because word embedding The particularity of , It covers almost all NLP Application . Now let's talk about Conventional one-hot Coding start , Explain its advantages and disadvantages , And extend to word embedding Technology and its advantages

Human beings can easily understand a word 、 A phrase or letter , such as 「LOVE」, But machines can't understand . Want to make the machine understand words , You have to turn it into a string of numbers （ vector ）. What follows One-Hot Encoding（One-Hot code ） and Word Embedding （ Word embedding ） And are two ways to turn words into vectors

One 、One-Hot Encoding （ glossary -> Sparse vector ）

1. First one-hot What is it? ？ Why one-hot？

Generally speaking , Machine learning tutorials will recommend or ask you , Before fitting the model , First prepare the data in a specific way ; among , A good example is for category data （Categorical data） Conduct One-Hot code

that , What is category data ？ Category data is a variable that has only label values but no values . Its value usually belongs to a fixed and finite set . Class variables are also often referred to as nominal values （nominal）.
Here's an example ：

Pets （pet） Variables contain the following values ： Dog （dog）、 cat （cat）.
Color （color） Variables contain the following values ： red （red）、 green （green）、 blue （blue）.
Position （place） Variables contain the following values ： First of all （first）、 second （second） And third （third）.

Each value in the above example represents a different category . Some categories have a natural relationship with each other , For example, natural sorting relations . In the above example , Position （place） The values of variables have this natural ordering relationship . Such variables are called ordinal variables （ordinal variable）.

2. What's wrong with category data ？

Some algorithms can be directly applied to category data . such as , You can do no data conversion , The decision tree algorithm is directly applied to category data （ It depends on the implementation ）. However, there are many machine learning algorithms that can not directly manipulate label data . These algorithms require that all input and output variables be numeric （numeric）. Generally speaking , This limitation is mainly caused by the efficient implementation of these machine learning algorithms , Not the limitation of the algorithm itself .
But it also means that we need to convert the category data into numerical form . If the output variable is a category variable , Then you may have to convert the predicted value of the model back to the category form , In order to show or use in some applications
How to convert category data to numeric data ：

Two methods ：1. Integer encoding 2.One-Hot code

a. Integer encoding ： First step , First assign an integer value to each category value .
such as , use 1 It means red （red）,2 It means green （green）,3 It means blue （blue）. This method is called tag encoding or integer encoding , You can easily restore it back to the category value . There is a natural ordering relationship between integers , Machine learning algorithms may understand and take advantage of this relationship . such as , Front seat （place） The ordinal variable in the example is a good example . For it, we only need to encode the tag .

b.One-Hot code

But for category variables that do not have an order relationship , It is not enough to use the above integer encoding . actually , Using integer encoding makes the model assume that there is a natural order relationship between categories , Which leads to poor results or unexpected results （ The predicted value falls in the middle of the two categories ）. In this case , It is necessary to use... For integer representation One-Hot It's encoded .One-Hot Encoding removes integer encoding , And create a binary variable for each integer value .

A little more general ：

Use discrete features one-hot code , It really makes the distance between features more reasonable . such as , There is a discrete feature , Represents the type of work , This discrete feature , There are three values , Don't use one-hot code , The expressions are respectively x_1 = (1), x_2 = (2), x_3 = (3). The distance between the two jobs is ,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. that x_1 and x_3 The more different jobs are ？ It's obvious that , The calculated distance between features is unreasonable . Then if you use one-hot code , Then we get x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1), So the distance between the two jobs is sqrt(2). That is, the distance between every two jobs is the same , It seems more reasonable .

In color （color） In the example , Yes 3 Species category , Therefore need 3 Two binary variables are encoded . The corresponding color position will be marked as “1”, Other color positions will be marked with “0”.

red, green, blue
1,      0,   0
0,      1,   0
0,      0,   1

Two 、word embedding - Let the computer learn to love （ The relationship between words ）

When deep learning is applied to naturallanguageprocessing , Basically, they will use word vectors to translate one-hot Encoded vector , Convert to word vector . As for why ,

One reason is that deep learning is not good for sparse input ,

The second most important reason is , That kind of one-hot The way of coding , For every different word or Chinese word , There is no way to express the relationship , That is to say , For different words , Two word one-hot The similarity of the encoded vectors is always 0, That is to say cos(Vi, Vj) = 0. So here comes the question , How to express the internal relationship between words ？embedding coming

To understand embedding The advantages of , We can correspond One-hot Code to observe .One-hot Coding is one of the most common representations of discrete data , First, we calculate the total number of discrete or category variables to be represented N, Then for each variable , We can use it N-1 individual 0 And single 1 Composed of vector To represent each category . This has two obvious disadvantages ：

For category variables with very many types , The dimension of the transformed vector is too large , And too sparse .
Completely independent of each other , It does not show the relationship between different categories .

therefore , Considering these two problems , The ideal solution for representing category variables is whether we can represent each category with fewer dimensions , And it can also show the relationship between different types of variables , This is the same. embedding Purpose of emergence .

Embedding Is a way to convert discrete variables into continuous vector representation . In the neural network ,embedding It's very useful , Because it can not only reduce the spatial dimension of discrete variables , At the same time, it can also meaningfully represent the variable .

We can sum up ,embedding There are the following 3 There are two main purposes ：

stay embedding Find the nearest neighbor in space , This can be well used to make recommendations according to users' interests .
As input to supervised learning tasks .
Used to visualize the relationship between different discrete variables .

This means that for Wikipedia Book representation , Use Neural Network Embedding, All that we can get on Wikipedia 37,000 This book , For each article , Use only one containing 50 A vector of numbers . Besides , because embedding It's learnable , So in the process of continuous training , More similar books are represented in embedding space Will be closer to each other .
One-hot The biggest problem with coding is that its transformation does not depend on any internal relationship , And through a network of supervised learning tasks , We can reduce by optimizing the parameters and weight of the network loss To improve our embedding Express ,loss The smaller it is , Then it means that in the final vector representation , The more relevant categories , The closer their representations are

Embedding visualization , We can see the similarity between words

The picture below is for me tensorflow In the name of training the people embedding Result

3、 ... and 、 summary

Finally, let's make a summary , There are two main types of vectorial representation of natural language ：one-hot encoding and word embedding. Their advantages and disadvantages are as follows ：
Insert picture description here

Four 、 application

Embedding The basic content of is shown in the previous introduction , But what I want to say is that its value is not just word embedding perhaps entity embedding, The idea that the category data can be represented by low dimension and self-learning is more valuable . In this way , We can put neural networks , Deep learning is used in a broader field ,Embedding Can express more things , The key is to figure out the problems and applications we need to solve Embedding What do we get .

Speaking of embedding I have to say word2vector 了 ,word2vector The practice is embedding The process of , This can be extended to item2vector . By the way word2vec：

1. It can be regarded as a multi classification task

2. Shallow neural networks

3. Data construction pattern ：cbow、skip-gram Pattern

The end ！

Add a little more below tensorflow simulation embedding A little bit of code to make a record

import tensorflow as tf 
import numpy as np

# Similar corpora 
embedding_dict = tf.convert_to_tensor(np.random.random(size=(30,4)))
# Index of words entered 
input_train = tf.convert_to_tensor([1,2,4,5])

 Retrieve the corresponding words embedding The weight 
embed = tf.nn.embedding_lookup(embedding_dict,input_train)

# Because of my tensorflow Higher version of ,session The opening mode is different 
sess = tf.compat.v1.Session()
sess.run(tf.compat.v1.global_variables_initializer())

print(sess.run(embedding_dict))
print("====================================")
print(sess.run(embed))

sess.close()

原网站

版权声明
本文为[Alex_ 81D]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206220817515188.html