当前位置:网站首页>One hot and embedding

One hot and embedding

2022-06-22 08:23:00 Alex_ 81D

stay NLP field ,word embedding Has become a well-known technology . in real life word embedding Has a very wide range of applications : Voice assistant 、 Machine translation 、 Sentiment analysis … because word embedding The particularity of , It covers almost all NLP Application . Now let's talk about   Conventional one-hot Coding start , Explain its advantages and disadvantages , And extend to word embedding Technology and its advantages

Human beings can easily understand a word 、 A phrase or letter , such as 「LOVE」, But machines can't understand . Want to make the machine understand words , You have to turn it into a string of numbers ( vector ). What follows One-Hot Encoding(One-Hot code ) and Word Embedding ( Word embedding ) And are two ways to turn words into vectors

One 、One-Hot Encoding ( glossary -> Sparse vector )

1. First one-hot What is it? ? Why one-hot?

Generally speaking , Machine learning tutorials will recommend or ask you , Before fitting the model , First prepare the data in a specific way ; among , A good example is for category data (Categorical data) Conduct One-Hot code

that , What is category data ? Category data is a variable that has only label values but no values . Its value usually belongs to a fixed and finite set . Class variables are also often referred to as nominal values (nominal).
Here's an example :

  • Pets (pet) Variables contain the following values : Dog (dog)、 cat (cat).
  • Color (color) Variables contain the following values : red (red)、 green (green)、 blue (blue).
  • Position (place) Variables contain the following values : First of all (first)、 second (second) And third (third).

Each value in the above example represents a different category . Some categories have a natural relationship with each other , For example, natural sorting relations . In the above example , Position (place) The values of variables have this natural ordering relationship . Such variables are called ordinal variables (ordinal variable).

2. What's wrong with category data ?

Some algorithms can be directly applied to category data . such as , You can do no data conversion , The decision tree algorithm is directly applied to category data ( It depends on the implementation ). However, there are many machine learning algorithms that can not directly manipulate label data . These algorithms require that all input and output variables be numeric (numeric). Generally speaking , This limitation is mainly caused by the efficient implementation of these machine learning algorithms , Not the limitation of the algorithm itself .
But it also means that we need to convert the category data into numerical form . If the output variable is a category variable , Then you may have to convert the predicted value of the model back to the category form , In order to show or use in some applications
How to convert category data to numeric data :

Two methods :1. Integer encoding                           2.One-Hot code

a. Integer encoding : First step , First assign an integer value to each category value .
such as , use 1 It means red (red),2 It means green (green),3 It means blue (blue). This method is called tag encoding or integer encoding , You can easily restore it back to the category value . There is a natural ordering relationship between integers , Machine learning algorithms may understand and take advantage of this relationship . such as , Front seat (place) The ordinal variable in the example is a good example . For it, we only need to encode the tag .

b.One-Hot code

But for category variables that do not have an order relationship , It is not enough to use the above integer encoding . actually , Using integer encoding makes the model assume that there is a natural order relationship between categories , Which leads to poor results or unexpected results ( The predicted value falls in the middle of the two categories ). In this case , It is necessary to use... For integer representation One-Hot It's encoded .One-Hot Encoding removes integer encoding , And create a binary variable for each integer value .

A little more general :

Use discrete features one-hot code , It really makes the distance between features more reasonable . such as , There is a discrete feature , Represents the type of work , This discrete feature , There are three values , Don't use one-hot code , The expressions are respectively x_1 = (1), x_2 = (2), x_3 = (3). The distance between the two jobs is ,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. that x_1 and x_3 The more different jobs are ? It's obvious that , The calculated distance between features is unreasonable . Then if you use one-hot code , Then we get x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1), So the distance between the two jobs is sqrt(2). That is, the distance between every two jobs is the same , It seems more reasonable .

In color (color) In the example , Yes 3 Species category , Therefore need 3 Two binary variables are encoded . The corresponding color position will be marked as “1”, Other color positions will be marked with “0”.

red, green, blue
1,      0,   0
0,      1,   0
0,      0,   1

Two 、word embedding - Let the computer learn to love ( The relationship between words )

When deep learning is applied to naturallanguageprocessing , Basically, they will use word vectors to translate one-hot Encoded vector , Convert to word vector . As for why ,

One reason is that deep learning is not good for sparse input ,

The second most important reason is , That kind of one-hot The way of coding , For every different word or Chinese word , There is no way to express the relationship , That is to say , For different words , Two word one-hot The similarity of the encoded vectors is always 0, That is to say cos(Vi, Vj) = 0. So here comes the question , How to express the internal relationship between words ?embedding coming

 

To understand embedding The advantages of , We can correspond One-hot Code to observe .One-hot Coding is one of the most common representations of discrete data , First, we calculate the total number of discrete or category variables to be represented N, Then for each variable , We can use it N-1 individual 0 And single 1 Composed of vector To represent each category . This has two obvious disadvantages :

  • For category variables with very many types , The dimension of the transformed vector is too large , And too sparse .
  • Completely independent of each other , It does not show the relationship between different categories .

therefore , Considering these two problems , The ideal solution for representing category variables is whether we can represent each category with fewer dimensions , And it can also show the relationship between different types of variables , This is the same. embedding Purpose of emergence .

Embedding Is a way to convert discrete variables into continuous vector representation . In the neural network ,embedding It's very useful , Because it can not only reduce the spatial dimension of discrete variables , At the same time, it can also meaningfully represent the variable .

We can sum up ,embedding There are the following 3 There are two main purposes :

  • stay embedding Find the nearest neighbor in space , This can be well used to make recommendations according to users' interests .
  • As input to supervised learning tasks .
  • Used to visualize the relationship between different discrete variables .

This means that for Wikipedia Book representation , Use Neural Network Embedding, All that we can get on Wikipedia 37,000 This book , For each article , Use only one containing 50 A vector of numbers . Besides , because embedding It's learnable , So in the process of continuous training , More similar books are represented in embedding space Will be closer to each other .
One-hot The biggest problem with coding is that its transformation does not depend on any internal relationship , And through a network of supervised learning tasks , We can reduce by optimizing the parameters and weight of the network loss To improve our embedding Express ,loss The smaller it is , Then it means that in the final vector representation , The more relevant categories , The closer their representations are

Embedding visualization , We can see the similarity between words

The picture below is for me tensorflow In the name of training the people embedding Result

3、 ... and 、 summary

Finally, let's make a summary , There are two main types of vectorial representation of natural language :one-hot encoding and word embedding. Their advantages and disadvantages are as follows :
 Insert picture description here

Four 、 application

Embedding The basic content of is shown in the previous introduction , But what I want to say is that its value is not just word embedding perhaps entity embedding, The idea that the category data can be represented by low dimension and self-learning is more valuable . In this way , We can put neural networks , Deep learning is used in a broader field ,Embedding Can express more things , The key is to figure out the problems and applications we need to solve Embedding What do we get .

Speaking of embedding I have to say word2vector 了 ,word2vector The practice is embedding The process of , This can be extended to item2vector . By the way word2vec:

1. It can be regarded as a multi classification task

2. Shallow neural networks

3. Data construction pattern :cbow、skip-gram Pattern

The end !

Add a little more below tensorflow simulation embedding A little bit of code to make a record

import tensorflow as tf 
import numpy as np

# Similar corpora 
embedding_dict = tf.convert_to_tensor(np.random.random(size=(30,4)))
# Index of words entered 
input_train = tf.convert_to_tensor([1,2,4,5])

 Retrieve the corresponding words embedding The weight 
embed = tf.nn.embedding_lookup(embedding_dict,input_train)

# Because of my tensorflow Higher version of ,session The opening mode is different 
sess = tf.compat.v1.Session()
sess.run(tf.compat.v1.global_variables_initializer())

print(sess.run(embedding_dict))
print("====================================")
print(sess.run(embed))

sess.close()
原网站

版权声明
本文为[Alex_ 81D]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206220817515188.html