当前位置：网站首页>ABSA1: Attentional Encoder Network for Targeted Sentiment Classification

ABSA1: Attentional Encoder Network for Targeted Sentiment Classification

2022-07-29 06:12:00 【Quinn-ntmy】

ABSA1: Attentional Encoder Network for Targeted Sentiment Classification

Paper title ：
Attentional Encoder Network for Targeted Sentiment Classification（ Click to download pdf）
Paper source code ：ABSA model base （PyTorch edition ）

One 、 introduction

In the past , about ABSA Most of the models created by the problem are RNN + Attention The idea of .

Existing problems Q：

RNN Series model （ for example NLP The golden oil in the task LSTM ） Very expressive but Difficult to parallelize , And back propagation over time requires A lot of memory and Computing , Basically every RNN All training algorithms are truncated BPTT, This will affect the ability of the model to capture dependencies over a longer period of time .LSTM To some extent, it can alleviate the problem of gradient disappearance , But usually A lot of training data .
Most of the previous work neglected The label is not trusted problem （label unreliability issue）—— Neutral label is a vague expression of emotion , Training samples with neutral emotional labels are not credible .

Two 、 Solution

Propose an attention based model , Draw the target word with attention target And context words context Introspection between （introspective） And interaction （interactive） semantics .
The label unreliability issue—— Adding an effective label smoothing regularization term to the loss function promotes the model to learn fuzzy labels .
Label smoothing regularization LSR May refer to ：
https://zhuanlan.zhihu.com/p/64970719

3、 ... and 、 Model structure AEN

Insert picture description here
By embedded layer 、 Attention encoder layer 、 Target specific attention layer and output layer .

1、Embedding Layer
There are two kinds of Embedding The way ：
（1）GloVe Embedding;
（2）BERT Embedding： You need to convert the given context and target into “[CLS] + context + [SEP]” and “[CLS] + target + [SEP]”.

2、Attentional Encoder Layer
The attention coder layer is LSTM Parallelizable and interactive alternatives , To calculate the Input Embedding The hidden state of . This layer consists of multiple heads （MHA） And pointwise convolution transform （PCT） It consists of two sub modules . It's equivalent to using MHA —> PCT Feature extraction .

（1）MHA（Multi-Head Attention）

Given a context, embed e^c, Use Intra-MHA Introspective context words （context） modeling , namely self-attention：
Given a context, embed $e^c$ Embedded with a target $e^t$ , Use Inter-MHA Context aware target words （context about target） modeling , That is, tradition attention：

（2）PCT（Point-wise Convolution Transformation）
PCT To convert MHA Collected context information . Point by point convolution , That is, the size of convolution kernel is 1, For the above two attention encoder Do the following ：
Insert picture description here
Get the output hidden state of the attention encoder layer ：

3、Target-specific Attention Layer
After obtaining introspective context representation and context aware target representation , Use MHA To get the target specific context representation ：
Insert picture description here

4、Output Layer
The final representation of the output of the previous step is obtained through average pooling , Then connect them into the final representation , And use the full connection layer to project the connected vector to the target C Class space .
Insert picture description here
【 The role of pooling ： Reduce the size of the feature map , That is, it can reduce the amount of computation and the required video memory . That is, feature dimensionality reduction . Average pooling can well preserve the characteristics of the overall data , Can highlight the background information ; Maximum pooling can better preserve the characteristics of texture .】

5、Loss Function
In order to solve the problem of untrusted labels , Introduced LSR：
Insert picture description here

Let's learn about LSR（Label Smoothing Regularization）：

By outputting Y Add noise to , Implement constraints on the model , So as to reduce the over fitting of the model , It is used to classify problems .

In the classification problem ,p(y|x) Is the predicted probability distribution ,q(y|x) There are multiple categories of real probability distribution data , Usually use one-hot The form , The category is 1 Express , Other use 0 Express . Use one-hot There are two problems with form ：

It is easy to cause over fitting
It's easy to rely too much on models , It is easy to make the prediction result deviate from the fact seriously

LSR It can be used to solve the above two problems , Introduce a priori knowledge u(y), It is generally expressed as 1/k,k Represents the number of classes ,ϵ It's the smoothing factor , Belong to [0,1],
Insert picture description here
This formula is equivalent to the label Y Added noise , Prevent the model from over concentrating the predicted value on the classification with high probability , And assign some probability values to categories with low probability .