当前位置:网站首页>Deep learning: NLP word embedding

Deep learning: NLP word embedding

2022-06-10 08:07:00 ShadyPi

Introduction to word embedding

stay RNN in , We learned a way to represent words with vectors —— The expression of sole heat . Use a column vector as long as the dictionary , Only the place value corresponding to the index position of the word in the dictionary is 1, Other values are 0. This practice has brought about a drawback , That is, the vectors of all words are orthogonal to each other , The Internet has no concept of synonyms or synonyms . And if we can describe these words with higher dimensional features , Such as adjective 、 Color 、 animal …… Or more abstract attributes as eigenvalues , Makes our representation of words more essential , The network can build synonyms 、 The concept of antonyms , Realize a more intelligent network .
 Insert picture description here
Use words to embed , We can better migrate NLP Learning problems of , Using massive data to train word embedding to get embedded vector , These vectors are then used to perform a smaller amount of data NLP Learning problems .

meanwhile , Because embedded vectors can more essentially describe the attributes of a word , It can carry out analogical reasoning , For example, we find words man and woman The difference between the embedded vectors of and king and queen The difference between the vectors is very close . In the same way, it can also be compared with some other relationships , Such as country and capital , Prototype and comparative grade, etc .
 Insert picture description here
In the training of word embedding , What we are going to learn is an embedded matrix E E E, This is a matrix with embedded vectors and dictionary , The row coordinates are different features of the embedded vector , The column coordinate is the index of the word in the dictionary . Embedding matrix and heat independent vector o o o Multiply , You can get the embedded vector of the word e e e.
 Insert picture description here

Learning algorithms

Simple algorithm

The most intuitive way of thinking is , The matrix E E E Learning as a parameter of neural network . For each word , We use its unique heat vector and E E E Multiply , Get a series of embedded vectors e e e, Input these embedded vectors into a fully connected layer , Connect another softmax layer , To predict the probability of the next word . A more general , To fix the number of entries , We can agree to consider only the word before k A place ( It can also include post k A place ).
 Insert picture description here

Word2vec skip-gram Model

In the naive model , We select the context of a certain range to learn the embedding matrix E E E,skip-gram In the model , Our choices will be more random . We randomly choose a word as the context c c c, Again from c c c Randomly select a target to be predicted near t t t. Although the prediction itself seems very unreliable , After all, there is too little known information , But it is surprisingly effective in learning embedded matrices .

 Insert picture description here
Specifically speaking , We from o c o_c oc set out , And E E E Multiply to get e c e_c ec, Introduce it to softmax layer ( Note the parameters of this layer θ t \theta_t θt Yes and forecast target t t t The binding of ), Get predictions about the target word y ^ \hat{y} y^, The loss function is calculated and the parameters are corrected by back propagation .

This model only inputs one word at a time , And predict a word , But by softmax The loss function of the function optimizes the embedded matrix E E E Very good . however skip-gram The main bottleneck is computing softmax Function is the denominator to traverse the entire dictionary , The amount of calculation is very large . One solution is to use layering softmax function , A binary tree like method is used to achieve the effect of multiple binary classifiers , So the time complexity is log Grade , And a heuristic binary classifier ( Put high-frequency words on shallow leaves ) More efficient .

Another thing to note is , If you completely random sample from the text segment , image a,the,of The frequency of such words is much higher than that of other notional words , Some heuristic methods may be needed to balance the proportion of different words in the training .

Negative sampling method

Select a pair of relatively close words from the passage as positive samples , Then choose at random from the dictionary k k k Words form a negative sample ( Even random words can be used as a reasonable choice of context ). For small samples , The number of negative samples is 5-20 Between , In the case of large samples, negative samples only need 2-5 individual .
 Insert picture description here
When these samples are used for training , We also start from o c o_c oc set out , And E E E Multiply to get e c e_c ec, But we don't bring it into softmax layer , But only those selected into the sample are trained k + 1 k+1 k+1 A binary classifier of words , Such as juice The classifier of only outputs the predicted word as juice Probability . In this way, we can simplify from traversing the whole dictionary every time to only training each time k + 1 k+1 k+1 A binary classifier .

 Insert picture description here
There is also a point to note that , How should negative samples be selected , Many researchers use the method according to the frequency of vocabulary 3 4 \frac{3}{4} 43 To the power ratio .

application

Emotional categories

Use a paragraph to assess whether the author's emotions are positive or negative , For example, convert each comment on social media into a specific score :
 Insert picture description here
One of the difficulties of emotion classification is , The training set may be small , But the task can be completed in the case of small data through word embedding .

A simpler implementation is , Find the embedded vectors of all words , Then average these embedded vectors , Input to a softmax In the classifier , Get a score .
 Insert picture description here
But such an algorithm cannot contact the context , For some euphemistic or sinister comments , You can't normally recognize . So we can take RNN added , Using many to one model to consider the whole sentence .
 Insert picture description here

原网站

版权声明
本文为[ShadyPi]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203021104192572.html