当前位置：网站首页>Analysis of ESIM short text matching model

Analysis of ESIM short text matching model

2022-06-24 01:48:00 【Goose】

ESIM It is a comprehensive application BiLSTM And the attention mechanism , The effect is very powerful in text matching .

Text matching is to analyze whether two sentences have a certain relationship , For example, there is a problem , Now give me an answer , We need to analyze whether the answer matches the question , So it can also be regarded as a binary classification problem （ Output is or is not ）. Now it's mainly based on SNIL and MutilNLI These two corpora , They contain two sentences premise and hypothesis And one. label,label Is to judge the relationship between these two sentences , This article mainly explains how to use ESIM Analysis of the problem .

1. brief introduction

ESIM The model is mainly used for text reasoning , Give a premise premise pp Deduce the hypothesis hypothesis pp, The objective of the loss function is to judge pp And hh Is there a connection , That is, whether it can be determined by pp Deduce hh, therefore , This model can also be used for text matching , But the goal of the loss function is whether the two sequences are synonymous sentences .

2. Model structure

ESIM Papers , The author proposes two structures , As shown in the figure below , On the left is the natural language understanding model ESIM, On the right is based on the syntax tree structure HIM, This article also mainly explains ESIM Structure , If you are right HIM If you are interested, you can read the original paper .

ESIM It consists of four parts ,Input Encoding、Local Inference Modeling、 Inference Composition、Prediction

2.1 Input Encoding

The input content of this layer structure , Generally, we can use pre trained word vectors or add embedding layer . Then there is a two-way LSTM, The function mainly depends on the input value encoding, It can also be understood as feature extraction , Finally, keep the value of its hidden state , Write them down as \bar{a}_i and \bar{b}_i, among i And j They represent different moments ,a And b It means the above mentioned p And h.

\begin{array}{l} \bar{a}_{i}=\operatorname{BiLSTM}(a, i) \\ \bar{b}_{i}=\operatorname{BiLSTM}(b, i) \end{array}

2.2 Local Inference Modeling

The next step is to analyze the relationship between the two sentences , How to analyze , The first thing to notice is , We now have the representation vectors of sentences and words , It is based on the current context and the comprehensive analysis of the meaning between words , So if there is a greater connection between two words , It means that the distance and included angle between them are less , such as （1,0） and （0,1） There is no connection between （0.5,0.5） and （0.5,0.5） Large connection between . After understanding this , Let's see ESIM How to analyze .

First , Multiplication between word vectors of two sentences

e_{i j}=\bar{a}_{i}^{T} \bar{b}_{J}

As I said before , If two word vectors are more related , Then the product will be larger , Then proceed softmax Calculate its weight ：

\begin{array}{l} \tilde{a}_{i}=\sum_{j=1}^{l_{b}} \frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{l_{b}} \exp \left(e_{i k}\right)} \bar{b}_{j} \\\\ \widetilde{b}_{j}=\sum_{i=1}^{l_{a}} \frac{\exp \left(e_{i j}\right)}{\sum_{k=1}^{l_{a}} \exp \left(e_{k j}\right)} \bar{a}_{i} \end{array}

The purpose of the above formulas , In short, it can be understood in this way , such as premise There is a word in "good", First, I analyze the relationship between this word and the words in another sentence , The result of the calculation e_{ij}eij As a weight after standardization , Each word vector in another sentence is represented by weight "good", Such analysis and comparison one by one , Get a new sequence .

The above operation is a attention Mechanism ,\tilde{a}_{i} and \tilde{b}_{j} The first fraction of is attention weight. Pay attention here , Calculation \tilde{a}_{i} The calculation method is the same as \bar{b}_{j} Do weighted sum . instead of \bar{a}_{j}, about \tilde{b}_{j} Empathy .

The next step is to analyze the differences , So as to judge whether the connection between the two sentences is big enough ,ESIM It mainly calculates the difference and product between the new and old sequences , And combine all the information and store it in a sequence ：

\begin{array}{l} m_{a}=[\bar{a} ; \tilde{a} ; \bar{a}-\tilde{a} ; \bar{a} \odot \tilde{a}] \\\\ m_{b}=[\bar{b} ; \tilde{b} ; \bar{b}-\tilde{b} ; \bar{b} \odot \tilde{b}] \end{array}

2.3 Inference Composition

The reason for the above is to store all the information in one sequence , because ESIM Finally, you need to integrate all the information , Do a global analysis , This process is still through BiLSTM Process these two sequences ：

\begin{array}{l} v_{a, t}=\operatorname{BiLSTM}\left(F\left(m_{a, t}\right), t\right) \\\\ v_{b, t}=\operatorname{BiLSTM}\left(F\left(m_{b, t}\right), t\right) \end{array}

It is worth noting that ,F It is a single-layer neural network （ReLU As ** function ）, It is mainly used to reduce the parameters of the model and avoid over fitting , in addition , above t Express BiLSTM stay t The output of time .

Because for different sentences , The resulting vector v The length is different , In order to facilitate the final analysis , Here is the BiLSTM The obtained values are pooled , Store the result in a fixed length vector . It is worth noting that , Because considering that the sum operation is sensitive to the sequence length , Thus, the robustness of the model is reduced , therefore ESIM Choose to do both sequences at the same time average pooling and max pooling, Then put the result into a vector ：

\begin{aligned} V_{a, a v e} &=\sum_{i=1}^{l_{a}} \frac{V_{a}, i}{l_{a}}, \quad V_{a, \max }=\max _{i=1}^{l_{a}} V_{a, i} \\ V_{b, a v e} &=\sum_{j=1}^{l_{b}} \frac{V_{b}, j}{l_{b}}, \quad V_{b, \max }=\max _{j=1}^{l_{b}} V_{b, j} \\ V &=\left[V_{a, a v e} ; V_{a, \max } ; V_{b, a v e} ; V_{b, \max }\right] \end{aligned}

2.4 prediction

Finally came to the last step , That's the vector v Throw it into a multi-layer perceptron classifier , Use... In the output layer softmax function .

summary

ESIM The first is to input sentences word embedding Or use the pre - trained word vector directly to BiLSTM In the network , take LSTM The output of the network is Attention Calculation （ take p Each word vector in the sentence uses h The weighted sum of all word vectors in , In the same way h Each word vector in the sentence uses p The weighted sum of all word vectors in the sentence indicates ）, Then calculate the difference . Feed the two difference matrices into BiLSTM In the network , take LSTM The network output is pooled in average and maximum （ The two are connected ）, Finally, the pooled output is sent to the multi-layer perceptron classifier , Use softmax classification .

ESIM The loss function is used to judge whether the two sentences are semantically matched , Match for 1, Don't match for 0; So the cross entropy loss function is used .