当前位置：网站首页>Deep learning: NLP word embedding

Deep learning: NLP word embedding

2022-06-10 08:07:00 【ShadyPi】

List of articles

Introduction to word embedding
Learning algorithms
application
- Emotional categories

Introduction to word embedding

stay RNN in , We learned a way to represent words with vectors —— The expression of sole heat . Use a column vector as long as the dictionary , Only the place value corresponding to the index position of the word in the dictionary is 1, Other values are 0. This practice has brought about a drawback , That is, the vectors of all words are orthogonal to each other , The Internet has no concept of synonyms or synonyms . And if we can describe these words with higher dimensional features , Such as adjective 、 Color 、 animal …… Or more abstract attributes as eigenvalues , Makes our representation of words more essential , The network can build synonyms 、 The concept of antonyms , Realize a more intelligent network .
Insert picture description here
Use words to embed , We can better migrate NLP Learning problems of , Using massive data to train word embedding to get embedded vector , These vectors are then used to perform a smaller amount of data NLP Learning problems .

meanwhile , Because embedded vectors can more essentially describe the attributes of a word , It can carry out analogical reasoning , For example, we find words man and woman The difference between the embedded vectors of and king and queen The difference between the vectors is very close . In the same way, it can also be compared with some other relationships , Such as country and capital , Prototype and comparative grade, etc .
Insert picture description here
In the training of word embedding , What we are going to learn is an embedded matrix $E$ , This is a matrix with embedded vectors and dictionary , The row coordinates are different features of the embedded vector , The column coordinate is the index of the word in the dictionary . Embedding matrix and heat independent vector $o$ Multiply , You can get the embedded vector of the word $e$ .
Insert picture description here

Learning algorithms

Simple algorithm

The most intuitive way of thinking is , The matrix $E$ Learning as a parameter of neural network . For each word , We use its unique heat vector and $E$ Multiply , Get a series of embedded vectors $e$ , Input these embedded vectors into a fully connected layer , Connect another softmax layer , To predict the probability of the next word . A more general , To fix the number of entries , We can agree to consider only the word before k A place （ It can also include post k A place ）.
Insert picture description here

Word2vec skip-gram Model

In the naive model , We select the context of a certain range to learn the embedding matrix $E$ ,skip-gram In the model , Our choices will be more random . We randomly choose a word as the context $c$ , Again from $c$ Randomly select a target to be predicted near $t$ . Although the prediction itself seems very unreliable , After all, there is too little known information , But it is surprisingly effective in learning embedded matrices .

Insert picture description here
Specifically speaking , We from $o_c$ set out , And $E$ Multiply to get $e_c$ , Introduce it to softmax layer （ Note the parameters of this layer $\theta_t$ Yes and forecast target $t$ The binding of ）, Get predictions about the target word $\hat{y}$ , The loss function is calculated and the parameters are corrected by back propagation .

This model only inputs one word at a time , And predict a word , But by softmax The loss function of the function optimizes the embedded matrix $E$ Very good . however skip-gram The main bottleneck is computing softmax Function is the denominator to traverse the entire dictionary , The amount of calculation is very large . One solution is to use layering softmax function , A binary tree like method is used to achieve the effect of multiple binary classifiers , So the time complexity is log Grade , And a heuristic binary classifier （ Put high-frequency words on shallow leaves ） More efficient .

Another thing to note is , If you completely random sample from the text segment , image a,the,of The frequency of such words is much higher than that of other notional words , Some heuristic methods may be needed to balance the proportion of different words in the training .

Negative sampling method

Select a pair of relatively close words from the passage as positive samples , Then choose at random from the dictionary $k$ Words form a negative sample （ Even random words can be used as a reasonable choice of context ）. For small samples , The number of negative samples is 5-20 Between , In the case of large samples, negative samples only need 2-5 individual .
Insert picture description here
When these samples are used for training , We also start from $o_c$ set out , And $E$ Multiply to get $e_c$ , But we don't bring it into softmax layer , But only those selected into the sample are trained $k + 1$ A binary classifier of words , Such as juice The classifier of only outputs the predicted word as juice Probability . In this way, we can simplify from traversing the whole dictionary every time to only training each time $k + 1$ A binary classifier .

Insert picture description here
There is also a point to note that , How should negative samples be selected , Many researchers use the method according to the frequency of vocabulary $\frac{3}{4}$ To the power ratio .

application

Emotional categories

Use a paragraph to assess whether the author's emotions are positive or negative , For example, convert each comment on social media into a specific score ：
Insert picture description here
One of the difficulties of emotion classification is , The training set may be small , But the task can be completed in the case of small data through word embedding .

A simpler implementation is , Find the embedded vectors of all words , Then average these embedded vectors , Input to a softmax In the classifier , Get a score .
Insert picture description here
But such an algorithm cannot contact the context , For some euphemistic or sinister comments , You can't normally recognize . So we can take RNN added , Using many to one model to consider the whole sentence .