当前位置:网站首页>One hot and embedding
One hot and embedding
2022-06-22 08:23:00 【Alex_ 81D】
stay NLP field ,word embedding Has become a well-known technology . in real life word embedding Has a very wide range of applications : Voice assistant 、 Machine translation 、 Sentiment analysis … because word embedding The particularity of , It covers almost all NLP Application . Now let's talk about Conventional one-hot Coding start , Explain its advantages and disadvantages , And extend to word embedding Technology and its advantages
Human beings can easily understand a word 、 A phrase or letter , such as 「LOVE」, But machines can't understand . Want to make the machine understand words , You have to turn it into a string of numbers ( vector ). What follows One-Hot Encoding(One-Hot code ) and Word Embedding ( Word embedding ) And are two ways to turn words into vectors
One 、One-Hot Encoding ( glossary -> Sparse vector )
1. First one-hot What is it? ? Why one-hot?
Generally speaking , Machine learning tutorials will recommend or ask you , Before fitting the model , First prepare the data in a specific way ; among , A good example is for category data (Categorical data) Conduct One-Hot code
that , What is category data ? Category data is a variable that has only label values but no values . Its value usually belongs to a fixed and finite set . Class variables are also often referred to as nominal values (nominal).
Here's an example :
- Pets (pet) Variables contain the following values : Dog (dog)、 cat (cat).
- Color (color) Variables contain the following values : red (red)、 green (green)、 blue (blue).
- Position (place) Variables contain the following values : First of all (first)、 second (second) And third (third).
Each value in the above example represents a different category . Some categories have a natural relationship with each other , For example, natural sorting relations . In the above example , Position (place) The values of variables have this natural ordering relationship . Such variables are called ordinal variables (ordinal variable).
2. What's wrong with category data ?
Some algorithms can be directly applied to category data . such as , You can do no data conversion , The decision tree algorithm is directly applied to category data ( It depends on the implementation ). However, there are many machine learning algorithms that can not directly manipulate label data . These algorithms require that all input and output variables be numeric (numeric). Generally speaking , This limitation is mainly caused by the efficient implementation of these machine learning algorithms , Not the limitation of the algorithm itself .
But it also means that we need to convert the category data into numerical form . If the output variable is a category variable , Then you may have to convert the predicted value of the model back to the category form , In order to show or use in some applications
How to convert category data to numeric data :
Two methods :1. Integer encoding 2.One-Hot code
a. Integer encoding : First step , First assign an integer value to each category value .
such as , use 1 It means red (red),2 It means green (green),3 It means blue (blue). This method is called tag encoding or integer encoding , You can easily restore it back to the category value . There is a natural ordering relationship between integers , Machine learning algorithms may understand and take advantage of this relationship . such as , Front seat (place) The ordinal variable in the example is a good example . For it, we only need to encode the tag .
b.One-Hot code
But for category variables that do not have an order relationship , It is not enough to use the above integer encoding . actually , Using integer encoding makes the model assume that there is a natural order relationship between categories , Which leads to poor results or unexpected results ( The predicted value falls in the middle of the two categories ). In this case , It is necessary to use... For integer representation One-Hot It's encoded .One-Hot Encoding removes integer encoding , And create a binary variable for each integer value .
A little more general :
Use discrete features one-hot code , It really makes the distance between features more reasonable . such as , There is a discrete feature , Represents the type of work , This discrete feature , There are three values , Don't use one-hot code , The expressions are respectively x_1 = (1), x_2 = (2), x_3 = (3). The distance between the two jobs is ,(x_1, x_2) = 1, d(x_2, x_3) = 1, d(x_1, x_3) = 2. that x_1 and x_3 The more different jobs are ? It's obvious that , The calculated distance between features is unreasonable . Then if you use one-hot code , Then we get x_1 = (1, 0, 0), x_2 = (0, 1, 0), x_3 = (0, 0, 1), So the distance between the two jobs is sqrt(2). That is, the distance between every two jobs is the same , It seems more reasonable .
In color (color) In the example , Yes 3 Species category , Therefore need 3 Two binary variables are encoded . The corresponding color position will be marked as “1”, Other color positions will be marked with “0”.
red, green, blue
1, 0, 0
0, 1, 0
0, 0, 1Two 、word embedding - Let the computer learn to love ( The relationship between words )
When deep learning is applied to naturallanguageprocessing , Basically, they will use word vectors to translate one-hot Encoded vector , Convert to word vector . As for why ,
One reason is that deep learning is not good for sparse input ,
The second most important reason is , That kind of one-hot The way of coding , For every different word or Chinese word , There is no way to express the relationship , That is to say , For different words , Two word one-hot The similarity of the encoded vectors is always 0, That is to say cos(Vi, Vj) = 0. So here comes the question , How to express the internal relationship between words ?embedding coming
To understand embedding The advantages of , We can correspond One-hot Code to observe .One-hot Coding is one of the most common representations of discrete data , First, we calculate the total number of discrete or category variables to be represented N, Then for each variable , We can use it N-1 individual 0 And single 1 Composed of vector To represent each category . This has two obvious disadvantages :
- For category variables with very many types , The dimension of the transformed vector is too large , And too sparse .
- Completely independent of each other , It does not show the relationship between different categories .
therefore , Considering these two problems , The ideal solution for representing category variables is whether we can represent each category with fewer dimensions , And it can also show the relationship between different types of variables , This is the same. embedding Purpose of emergence .
Embedding Is a way to convert discrete variables into continuous vector representation . In the neural network ,embedding It's very useful , Because it can not only reduce the spatial dimension of discrete variables , At the same time, it can also meaningfully represent the variable .
We can sum up ,embedding There are the following 3 There are two main purposes :
- stay embedding Find the nearest neighbor in space , This can be well used to make recommendations according to users' interests .
- As input to supervised learning tasks .
- Used to visualize the relationship between different discrete variables .
This means that for Wikipedia Book representation , Use Neural Network Embedding, All that we can get on Wikipedia 37,000 This book , For each article , Use only one containing 50 A vector of numbers . Besides , because embedding It's learnable , So in the process of continuous training , More similar books are represented in embedding space Will be closer to each other .
One-hot The biggest problem with coding is that its transformation does not depend on any internal relationship , And through a network of supervised learning tasks , We can reduce by optimizing the parameters and weight of the network loss To improve our embedding Express ,loss The smaller it is , Then it means that in the final vector representation , The more relevant categories , The closer their representations are
Embedding visualization , We can see the similarity between words
The picture below is for me tensorflow In the name of training the people embedding Result

3、 ... and 、 summary
Finally, let's make a summary , There are two main types of vectorial representation of natural language :one-hot encoding and word embedding. Their advantages and disadvantages are as follows :
Four 、 application
Embedding The basic content of is shown in the previous introduction , But what I want to say is that its value is not just word embedding perhaps entity embedding, The idea that the category data can be represented by low dimension and self-learning is more valuable . In this way , We can put neural networks , Deep learning is used in a broader field ,Embedding Can express more things , The key is to figure out the problems and applications we need to solve Embedding What do we get .
Speaking of embedding I have to say word2vector 了 ,word2vector The practice is embedding The process of , This can be extended to item2vector . By the way word2vec:
1. It can be regarded as a multi classification task
2. Shallow neural networks
3. Data construction pattern :cbow、skip-gram Pattern

The end !
Add a little more below tensorflow simulation embedding A little bit of code to make a record
import tensorflow as tf
import numpy as np
# Similar corpora
embedding_dict = tf.convert_to_tensor(np.random.random(size=(30,4)))
# Index of words entered
input_train = tf.convert_to_tensor([1,2,4,5])
Retrieve the corresponding words embedding The weight
embed = tf.nn.embedding_lookup(embedding_dict,input_train)
# Because of my tensorflow Higher version of ,session The opening mode is different
sess = tf.compat.v1.Session()
sess.run(tf.compat.v1.global_variables_initializer())
print(sess.run(embedding_dict))
print("====================================")
print(sess.run(embed))
sess.close()边栏推荐
- C language implements inserting and reading pictures into MySQL
- How to design the dead shot, the best and eye-catching performance of the watch Vanguard
- Multiple ways for idea to package jars
- Mysql5.6.36 tutorial
- Thread status (timed wait, lock blocking, infinite wait (key))
- Mt4/mql4 getting started to proficient in foreign exchange EA automatic trading tutorial - special identification of the K line on the chart
- 计算水费问题
- 培养以科学技能为本的Steam教育
- Enumerations, custom types, and swaggerignore in swagger
- Website sharing of program ape -- continuous update
猜你喜欢

学科融合对steam教育的作用

Failed to access the ES installed on Tencent ECs, but the solution to accessing the ES successfully on the server

Using KDJ metrics on MT4
On Fresnel phenomenon

Mt4/mql4 getting started to mastering EA tutorial lesson 4 - common functions of MQL language (IV) - common functions of K-line value

Three characteristics of concurrency 2-orderliness

依图在实时音视频中语音处理的挑战丨RTC Dev Meetup

QT error prompt 1:invalid use of incomplete type '***‘

Nisp online simulation question bank

Develop steam education based on scientific skills
随机推荐
Coding complexity C (n)
PostgreSQL source code (56) extensible type analysis expandedobject/expandedrecord
377. combined total Ⅳ
Example of QT qtableview
C语言实现往MySQL插入和读取图片
年度十强!赛宁网安再次入围《中国数字安全百强报告》
Why can't semaphores be used in interrupts and why can't interrupt context sleep
Basic concepts of homomorphic encryption
C language implements inserting and reading pictures into MySQL
Add, delete and modify easyUI data table
2022年CIO面临的七大挑战及应对方法
Submit values of various inputs of the form
Five skills to be an outstanding cloud architect
Mt4/mql4 getting started to mastering EA tutorial lesson 3 - common functions of MQL language (III) - common functions of K-line value taking
SQL triggers
The challenge of image based voice processing in real-time audio and video -- RTC dev Meetup
Idea reports an error "insufficient memory"
Oracle gets the working day time between two dates
Summary of sub database and sub table 1
Matrix operation