当前位置:网站首页>Deep learning: NLP word embedding
Deep learning: NLP word embedding
2022-06-10 08:07:00 【ShadyPi】
List of articles
Introduction to word embedding
stay RNN in , We learned a way to represent words with vectors —— The expression of sole heat . Use a column vector as long as the dictionary , Only the place value corresponding to the index position of the word in the dictionary is 1, Other values are 0. This practice has brought about a drawback , That is, the vectors of all words are orthogonal to each other , The Internet has no concept of synonyms or synonyms . And if we can describe these words with higher dimensional features , Such as adjective 、 Color 、 animal …… Or more abstract attributes as eigenvalues , Makes our representation of words more essential , The network can build synonyms 、 The concept of antonyms , Realize a more intelligent network .
Use words to embed , We can better migrate NLP Learning problems of , Using massive data to train word embedding to get embedded vector , These vectors are then used to perform a smaller amount of data NLP Learning problems .
meanwhile , Because embedded vectors can more essentially describe the attributes of a word , It can carry out analogical reasoning , For example, we find words man and woman The difference between the embedded vectors of and king and queen The difference between the vectors is very close . In the same way, it can also be compared with some other relationships , Such as country and capital , Prototype and comparative grade, etc .
In the training of word embedding , What we are going to learn is an embedded matrix E E E, This is a matrix with embedded vectors and dictionary , The row coordinates are different features of the embedded vector , The column coordinate is the index of the word in the dictionary . Embedding matrix and heat independent vector o o o Multiply , You can get the embedded vector of the word e e e.
Learning algorithms
Simple algorithm
The most intuitive way of thinking is , The matrix E E E Learning as a parameter of neural network . For each word , We use its unique heat vector and E E E Multiply , Get a series of embedded vectors e e e, Input these embedded vectors into a fully connected layer , Connect another softmax layer , To predict the probability of the next word . A more general , To fix the number of entries , We can agree to consider only the word before k A place ( It can also include post k A place ).
Word2vec skip-gram Model
In the naive model , We select the context of a certain range to learn the embedding matrix E E E,skip-gram In the model , Our choices will be more random . We randomly choose a word as the context c c c, Again from c c c Randomly select a target to be predicted near t t t. Although the prediction itself seems very unreliable , After all, there is too little known information , But it is surprisingly effective in learning embedded matrices .

Specifically speaking , We from o c o_c oc set out , And E E E Multiply to get e c e_c ec, Introduce it to softmax layer ( Note the parameters of this layer θ t \theta_t θt Yes and forecast target t t t The binding of ), Get predictions about the target word y ^ \hat{y} y^, The loss function is calculated and the parameters are corrected by back propagation .
This model only inputs one word at a time , And predict a word , But by softmax The loss function of the function optimizes the embedded matrix E E E Very good . however skip-gram The main bottleneck is computing softmax Function is the denominator to traverse the entire dictionary , The amount of calculation is very large . One solution is to use layering softmax function , A binary tree like method is used to achieve the effect of multiple binary classifiers , So the time complexity is log Grade , And a heuristic binary classifier ( Put high-frequency words on shallow leaves ) More efficient .
Another thing to note is , If you completely random sample from the text segment , image a,the,of The frequency of such words is much higher than that of other notional words , Some heuristic methods may be needed to balance the proportion of different words in the training .
Negative sampling method
Select a pair of relatively close words from the passage as positive samples , Then choose at random from the dictionary k k k Words form a negative sample ( Even random words can be used as a reasonable choice of context ). For small samples , The number of negative samples is 5-20 Between , In the case of large samples, negative samples only need 2-5 individual .
When these samples are used for training , We also start from o c o_c oc set out , And E E E Multiply to get e c e_c ec, But we don't bring it into softmax layer , But only those selected into the sample are trained k + 1 k+1 k+1 A binary classifier of words , Such as juice The classifier of only outputs the predicted word as juice Probability . In this way, we can simplify from traversing the whole dictionary every time to only training each time k + 1 k+1 k+1 A binary classifier .

There is also a point to note that , How should negative samples be selected , Many researchers use the method according to the frequency of vocabulary 3 4 \frac{3}{4} 43 To the power ratio .
application
Emotional categories
Use a paragraph to assess whether the author's emotions are positive or negative , For example, convert each comment on social media into a specific score :
One of the difficulties of emotion classification is , The training set may be small , But the task can be completed in the case of small data through word embedding .
A simpler implementation is , Find the embedded vectors of all words , Then average these embedded vectors , Input to a softmax In the classifier , Get a score .
But such an algorithm cannot contact the context , For some euphemistic or sinister comments , You can't normally recognize . So we can take RNN added , Using many to one model to consider the whole sentence .
边栏推荐
- php安全开发 07文章模块的修改功能编写
- 数据库主键用uuid好还是雪花或者其他什么比较好呢
- 格式化日期和文本长度的过滤器
- 04MySQL索引原理分析-1
- [apio2022] Mars - structure, state compression
- Improvement of sequencing in Engineering
- Is it safe to open an account online?
- QT makes simple video calls
- PHP security development 07 article module modification function compilation
- Dynamic programming 0/1 knapsack problem
猜你喜欢

The most comprehensive layer2 research summary in the current technology circle

easyexcel实现简单的上传下载

漏洞复现_CVE-2020-0796 永恒之黑漏洞_遇坑_已解决

Next generation enterprise IT architecture: cloud native architecture

How to use module export import: uncaught syntaxerror: cannot use import statement outside a module

What brand is a cheap Bluetooth headset? Four cheap and easy-to-use Bluetooth headsets on the digital control panel

A practice of encrypting server 3D model resources

用微信小游戏实现龙舟大战-打粽子

10 个派上用场的 Flutter 小部件

10 useful flutter widgets
随机推荐
How to extend the validity of a vos3000 account after it expires?
Ifndef action
General log of MySQL
力扣(LeetCode)160. 相交链表(2022.06.09)
Leetcode 160 Intersecting linked list (2022.06.09)
SQL makes a column empty
Analysis of UML class diagram symbol theory
[software testing] a collection of frequently asked questions from software testing interviews of several major manufacturers (bat, three major traffic manufacturers, and well-known manufacturers)
Talk about entrepreneurship
数据库主键用uuid好还是雪花或者其他什么比较好呢
Basic exercise decimal to hexadecimal
How to maintain the length of a solid array and how to delete elements completely
Filter for formatting date and text length
String problem summary
Notice on the issuance of Shenzhen action plan for cultivating and developing biomedical industry clusters (2022-2025)
Image Classification and Retrieval
[sans titre]
UE panorama, problems when encountering the outpout directory
What brand is a cheap Bluetooth headset? Four cheap and easy-to-use Bluetooth headsets on the digital control panel
用微信小游戏实现龙舟大战-打粽子