当前位置：网站首页>Am model in NLP field

Am model in NLP field

2022-07-29 06:11:00 【Quinn-ntmy】

1. Encoder-Decoder frame

In the vast majority of the literature AM Models are attached to Encoder-Decoder Within the framework of . but ！！AM The model itself does not depend on Encoder-Decoder frame .
Encoder-Decoder frame ： You can think of it as being suitable to deal with a sentence （ Or chapter ） Make another sentence （ Or chapter ） The general processing model of .

Encoder： For input sentences X Conduct code , Pass the input sentence Nonlinear transformation Turn into Intermediate semantic representation C：C=F(x1, x2, …, xm).
Decoder： According to the sentence X Of Intermediate semantic representation C And previously generated Historical information y1, y2, …, yi-1 Come on Generate i Words to be generated at any time yi：yi=g(C, y1, y2, …, yi-1).

Every yi All in turn produce , So it looks like the whole system is based on input sentences X The target sentence is generated Y.

When generating the words of the target sentence , No matter which word is generated ,y1, y2 Good ,y3 Good , They use sentences X Of Semantic coding C It's all the same , It doesn't make any difference . That is, sentences X Any word pair in generates a target yi The influence is the same for all of us . amount to A distraction model without focus .
【 but ！ If Encoder yes RNN Words , Theoretically, the later the words are input, the greater the influence , Not equal rights , So later Google Put forward seq2seq When the model is found, input the sentence Reverse order input The effect of translation will be better .】

2. AM

The core idea formula ：
Insert picture description here
The probability of each word represents when translating the current word , How much attention is assigned to different English words in the attention allocation model .【 It can be understood that each English word is important for translating a target word Different degrees of influence . The correlation ？】
So In generating each word Yi When , They are all the same intermediate semantic representations C It will be replaced with the one that changes according to the currently generated word Ci.
a key ： Fixed intermediate semantic representation C Instead of Adjust to... According to the current output word Add changes in the attention model Ci.

example ：“Tom chase Jerry.”

C Tom = g(0.6 * f2(“Tom”), 0.2 * f2(“chase”), 0.2 * f2(“Jerry”))
C Chase = g(0.2 * f2(“Tom”), 0.7 * f2(“chase”), 0.1 * f2(“Jerry”))
C Jerry = g(0.3 * f2(“Tom”), 0.2 * f2(“chase”), 0.5 * f2(“Jerry”))

among ,f2 Function representation Encoder Some kind of transformation function for input words ,eg： If Encoder Yes, it is RNN Model words , This f2 The result of a function is often an input at a certain time xi Hide the status value of the node after .【 The role of hidden layers ： Abstract the characteristics of input data , For better linear division 】
g Function representation Encoder According to the middle representation of words, the transformation function of the middle semantic representation of the whole sentence is synthesized . General practice ,g A function is a pair of components Weighted sum of elements , Formulas are often seen in papers ：
Insert picture description here
hypothesis Ci in i yes “ Tom ”, that Tx yes 3, Represents the length of the input sentence ,h1=f2(“Tom”),h2=f2(“Chase”),h3=f2(“Jerry”), The corresponding attention model weights are 0.6,0.2,0.2.

3. Probability distribution value of word attention distribution

above （Tom, 0.6）(chase, 0.2) (Jerry, 0.2) How to get ？？？
Suppose for the above framework ,Encoder use RNN Model ,Decoder Also used RNN Model .
Refined model ：
Bold style
Attention distribution probability calculation process ：
Insert picture description here

To adopt RNN Of Decoder Come on , If you want to generate yi word , At the moment i, We can know It's generating yi Previous hidden layer nodes i The output value of the moment Hi Of .
Then you can use it i The state of hidden layer nodes at any time Hi Corresponding to each word in the input sentence RNN Hidden layer node state hj Contrast one by one , That is, through the function F(hj, Hi) To obtain a Target words Yi The alignment possibilities corresponding to each input word .（ This F Functions take different methods in different papers ）
Finally, the function F The output of Softmax Normalize to get a 0~1 The probability distribution value of attention distribution .
Most of the AM All models adopt the above calculation framework , It's just F It may be different .

Usually the AM The model is regarded as a word alignment model .
The probability distribution of each word generated by the target sentence corresponding to the input sentence word can be understood as Enter sentence words And this The target generates words Of Alignment probability .

原网站

版权声明
本文为[Quinn-ntmy]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290519491278.html