当前位置：网站首页>Text classification still stays at Bert? The dual contrast learning framework is too strong

Text classification still stays at Bert? The dual contrast learning framework is too strong

2022-07-06 05:27:00 【To great】

Brief introduction of the paper ： Dual contrast learning ： How to apply contrastive learning to supervised text classification
Paper title ：Dual Contrastive Learning: Text Classification via Label-Aware Data Augmentation
Thesis link ：https://arxiv.org/abs/2201.08702
Code link ：https://github.com/hiyouga/dual-contrastive-learning
Author of the paper ：{Qianben Chen}
Zhihu notes ：https://zhuanlan.zhihu.com/p/466685216

Abstract of paper

Contrastive learning has achieved remarkable success in representational learning through self-supervision in an unsupervised environment . However , Effectively adapting contrastive learning to supervised learning tasks is still a challenge in practice . In this work , The author proposes a dual contrastive learning (DualCL) frame , Simultaneously learn the features of input samples and the parameters of classifier in the same space . say concretely ,DualCL The parameters of the classifier are treated as enhanced samples associated with different tags , Then it is used for comparative learning between input samples and enhancement samples . Yes 5 The experimental study of a benchmark text classification dataset and its corresponding low resource version dataset shows that ,DualCL The classification accuracy has been significantly improved , And confirmed that DualCL It can realize the effect of sample discriminant representation .

DualCL brief introduction

Representation learning is the core of deep learning . In the context of unsupervised learning , Contrastive learning has recently been proved to be an effective method to obtain the general representation of downstream tasks . In short , Unsupervised contrastive learning uses a loss function , It forces differences in the same sample “ visual angle ” The representation vectors of are similar , The representation vectors of different samples are different . Recently, the effectiveness of comparative learning has been proved to be due to the simultaneous realization of “ Alignment ”【alignment】 and “ Uniformity ”【uniformity】.

The contrastive learning method is also suitable for supervised representation learning , Similar comparative losses have been used in previous studies , The basic principle is to insist that the representation of samples in the same class is similar , The sample representations of different classes are similar . Clear . However , Despite the success , But compared with unsupervised comparative learning , The principle of this method seems to be much worse . for example , The unity of representation is no longer valid ; It's not necessary either , Generally speaking, the spatial distribution of features is no longer uniform , Therefore, we believe that the standard supervised contrastive learning method is not natural for supervised representation learning . Another fact is that , That is, the result of this comparative learning method does not directly give us a classifier , Another classification algorithm needs to be developed to solve the classification task .

Let's talk about DualCL Proposed motivation , The author aims to develop a more natural method to realize comparative learning under supervised tasks . The author's key motivation is that supervised representation learning should include learning two parameters ： One is input $x$ In appropriate spatial features $z$ , Used to meet the needs of classification tasks , The other is the parameters of the classifier , Or the parameters acting on the classifier space $\theta$ ; We call this classifier $x$ Of “one example” classifier . In this view , Natural sample $x$ Think of two parameters ： One dimension is $d$ Of $z\in \mathbb{R}^d$ , Used to represent a feature ; One is the classifier parameter $\theta \in \mathbb{R}^{d \times K}$ , among $K$ Represents the total number of classifications in the sample . Then supervised representation learning can be considered as input samples $x$ Generate $(z,\theta)$ .

To ensure that the classifier $\theta$ For features $z$ It works , Just make sure that $\theta^Tz$ With samples $x$ Keep your labels aligned , Can pass softmax The normalized probability is constrained by cross entropy . besides , In contrast, learning methods can be used to enforce these $(z, θ)$ Means to constrain , In particular , We will $\theta^*$ Record as sample $x$ The real label corresponds to the classifier $\theta$ Ideal parameter of , Here we can design two kinds of contrast loss . first loss To compare $(z,\theta^*)$ With many $(z',\theta^*)$ , among $z^{'}$ Representative and sample $x$ Sample characteristics of different categories ; the second loss To compare $(z,\theta^*)$ With many $(z,\theta'^*)$ , among $\theta'$ Different categories representing samples correspond to classifier parameters , The author calls this learning framework dual contrastive learning（DualCL）, Dual contrast learning .

On the basis of comparative learning , As the title of the paper ,DualCL It can be considered as a unique data enhancement method . say concretely , For each sample $x$ , Its $θ$ Each column of can be regarded as “ Tag inspired input represents ”, Or the feature space is filled with label information $x$ Enhanced view of . surface 1 The power of this approach is illustrated in , As can be seen from the two pictures on the left , Standard contrastive learning cannot use label information . contrary , From the two figures on the right ,DualCL Effectively use the label information to classify the input samples in its class .

In the paper experiment , The author in 5 It is verified on a benchmark text classification data set DualCL The effectiveness of the . By using the dual ratio loss to the pre trained language model (BERT and RoBERTa) Fine tuning ,DualCL The best performance is obtained compared with the existing baseline of comparative learning supervision . The author also found that ,DualCL The classification accuracy is improved , Especially in low resource scenarios . In addition, by visualizing the learned representation and attention map , Yes DualCL Some interpretability analyses are given .

The contributions of this paper can be summarized as follows ：

1) Double contrast learning is proposed (DualCL), Naturally combine comparative losses with supervised tasks ;
2) Tag aware data enhancement is introduced to obtain multiple views of input samples , be used for DualCL Training for ;
3) stay 5 It is empirically verified on a benchmark text classification data set DualCL The effectiveness of the framework ;

DualCL principle

“ dual ” The purpose of supervised contrastive learning is ： The first is to discriminate the input of the classification task in an appropriate space , The second is to build a classifier for supervisory tasks , Learn the parameters of the classifier in the classifier space . Now let's see DualCL The core of .

Data enhancement of tag heuristics

In order to obtain different views of training samples （views） Express , The author uses the idea of data enhancement to obtain features $z_{i}$ And classifiers $\theta_{i}$ It means . Specifically, the corresponding parameters of each category of the classifier $\theta_{i}$ As $z_{i}$ A unique representation of , Write it down as $\theta_{i}^{k}$ , Called tag aware input representation , Label $k$ Information is injected into $x_{i}$ , As an additional enhanced view .

In practice , Set tags ${1,...,K\}$ Insert into input sequence $x_{i}$ , A new input sequence can be obtained $r_{i}\in \mathbb{R}^{L+K}$ , And then through PLMS（Bert perhaps Roberta） Model as encoder $f$ , To get each of the input sequences token features , among [CLS] Characteristics as samples $x_{i}$ Characteristics of $z_{i}$ , The inserted tag corresponds to the tag inspired input representation $\theta_{i}^{k}$ . The name of the tag is used as a marker , Form a sequence $r_{i}$ , Such as “positive”、“negative” etc. . For tags that contain multiple words , We use token Average pooling of features to obtain a tag aware input representation . This operation is very similar to the previous paper , If you are interested, you can read ：Bert It can also be used in this way ： Fuse the label vector to BERT

Dual contrast loss

Using input samples $x_{i}$ Feature representation of $z_{i}$ And classifiers $θ_{i}$ ,DualCL What it does is it takes $θ_{i}^{T}z_{i}$ Of softmax Normalized probability and $x_{i}$ Alignment of labels . take $θ^∗_{i}$ Express $θ_{i}$ A column of , Corresponding to $x_{i}$ Real label index ,DualCL expect $θ_{i}^{*T}z_{i}$ The dot product is maximized . In order to learn better $z_{i}$ and $θ_{i}$ ,DualCL The dual contrast loss is defined by using the relationship between different training samples , If $x_{j}$ And $x_{i}$ Have the same label , Then try to maximize $θ_{i}^{*T}z_{j}$ , And if the $x_{j}$ And $x_{i}$ There are different labels , Minimize $θ_{i}^{*T}z_{j}$ .

Given a sample from the input $x_{i}$ The anchor point $z_{i}$ , $\{ \theta^*_j \}_j\in P_{i}$ It's a positive sample set , $\{ \theta^*_j \}_j\in A_i \ | P_i$ It's a negative sample set , About z The comparative loss can be defined as follows ：

Empathy , Given a sample from the input $x_{i}$ The anchor point $\theta_{i}^*$ , $\{ z_j \}_j\in P_{i}$ It's a positive sample set , $\{ z_j \}_j\in A_i \ | P_i$ It's a negative sample set , About z The comparative loss can be defined as follows ：

The dual ratio loss is the combination of the above two comparative loss terms ：
$\mathcal{L_{Dual}}= \mathcal{L_{z}}+ \mathcal{L_{ \theta }}$

Compare training with supervised prediction

In order to make full use of the supervision signal ,DualCL Also expected $θ_{i}$ It's a very good $z_{i}$ classifier . Therefore, the author uses an improved version of cross entropy loss to maximize each input sample $x_i$ Of $θ_{i}^{*T}z_{i}$ ：

Last , Minimize these two training objectives to train the encoder $f$ . These two goals improve both the representation quality of features and the representation quality of classifiers . The total loss should be ：
$\mathcal{L_{overall}}= \mathcal{L_{CE}}+\lambda \mathcal{L_{ Dual }}$
among , $λ$ It is a super parameter that controls the weight of double contrast loss term .

In the process of classification , We use trained encoders $f$ To generate input sentences $x_i$ Feature representation of $z_i$ And classifiers $θ_i$ . there $θ_i$ Can be seen as a “one-example” Classifier , for example $x_i$ , We will $θ_{i}^{T}z_{i}$ Of argmax The results are predicted as a model ：
$\widehat{y}_i=argmax(\theta_i^k ·z_{i})$

chart 1 The framework of dual contrast learning is explained , among $e_{CLS}$ It's a feature representation , $e_{POS}$ and $e_{NEG}$ Is the classifier representation . In this particular case , We assume to have “positive” Class as an anchor , And there is a positive sample with the same class label , There is a negative sample with different class labels . The dual contrast loss aims to attract the feature representation to the classifier representation between positive samples at the same time , The feature representation is excluded to the classifier between negative samples .

Duality between representations

The comparison loss adopts the point product function as the measure of similarity between , This makes DualCL Feature representation in $z$ And classifiers represent $θ$ There is a dual relationship between . In linear classifier , A similar phenomenon occurs in the relationship between input features and parameters . Then we can put $θ$ As a parameter of a linear classifier , Such a pre trained encoder $f$ A linear classifier can be generated for each input sample . therefore ,DualCL Naturally learn how to generate a linear classifier for each input sample to perform the classification task .

Experimental setup

Data sets

The paper adopts SST-2、SUBJ、TREC、PC and CR Four data sets , The relevant statistics of the data set are as follows ：

experimental result

As can be seen from the results , Besides using RoBERTa Of TREC Outside the data set , Use at the same time BERT and RoBERTa The encoder achieves the best classification performance in almost all settings . With complete training data CE+CL comparison ,DualCL Yes BERT and RoBERTa The average improvement rates are 0.46% and 0.39%. Besides , We observed that , stay 10% Training data ,DualCL The performance of is significantly greater than CE+CL Method , stay BERT and RoBERTa Above each other 0.74% and 0.51%. meanwhile ,CE and CE+SCL The performance cannot exceed DualCL. This is because CE The method ignores the relationship between samples ,CE+SCL Method can not directly learn the classifier of classification task .

In addition, the paper finds that the double contrast loss term helps the model achieve better performance on all five data sets . It shows that using the relationship between samples is helpful for the model to learn better representation in comparative learning .

case analysis

In order to verify DualCL Is it possible to capture information characteristics , The author also calculated [CLS] The marked features and the attention score between each word in the sentence . First, fine tune the whole training set RoBERTa Encoder . Then we calculate the between features $l_2$ distance , And visualize 4 Notice the figure in . It turns out that , When classifying emotions , The captured features are different . The above example comes from SST-2 Data sets , We can see that our model focuses more on expression “ positive ” Emotional sentences “predictably heart warming”. The following example comes from CR Data sets , We can see that our model expresses “ negative ” Emotional sentences pay more attention to “small”. contrary ,CE The method does not focus on these distinguishing features . It turns out that DualCL Be able to successfully deal with informative keywords in sentences .

A summary of the paper

In this study , From the perspective of text classification task , A dual contrast learning method is proposed DualCL, To solve the task of supervised learning .
stay DualCL in , Author use PLMs Learn both expressions at the same time . One is the identification feature of the input example , The other is the classifier of this example . We introduce tag aware data enhancement to generate different views of input samples , It contains features and classifiers . Then a dual contrast loss is designed , Make the classifier valid for the input features .
Dual contrast loss uses the supervision signals between training samples to learn better representation , The effectiveness of dual contrast learning is verified by a large number of experiments .

Core code

About Dual-Contrastive-Learning Realization , You can check the open source code ：

https://github.com/hiyouga/Dual-Contrastive-Learning/blob/main/main_polarity.py

 def _contrast_loss(self, cls_feature, label_feature, labels):
        normed_cls_feature = F.normalize(cls_feature, dim=-1)
        normed_label_feature = F.normalize(label_feature, dim=-1)
        list_con_loss = []
        BS, LABEL_CLASS, HS = normed_label_feature.shape
        normed_positive_label_feature = torch.gather(normed_label_feature, dim=1,
                                                     index=labels.reshape(-1, 1, 1).expand(-1, 1, HS)).squeeze(1)  # (bs, 768)
        if "1" in self.opt.contrast_mode:
            loss1 = self._calculate_contrast_loss(normed_positive_label_feature, normed_cls_feature, labels)
            list_con_loss.append(loss1)
        if "2" in self.opt.contrast_mode:
            loss2 = self._calculate_contrast_loss(normed_cls_feature, normed_positive_label_feature, labels)
            list_con_loss.append(loss2)
        if "3" in self.opt.contrast_mode:
            loss3 = self._calculate_contrast_loss(normed_positive_label_feature, normed_positive_label_feature, labels)
            list_con_loss.append(loss3)
        if "4" in self.opt.contrast_mode:
            loss4 = self._calculate_contrast_loss(normed_cls_feature, normed_cls_feature, labels)
            list_con_loss.append(loss4)
        return list_con_loss

    def _calculate_contrast_loss(self, anchor, target, labels, mu=1.0):
        BS = len(labels)
        with torch.no_grad():
            labels = labels.reshape(-1, 1)
            mask = torch.eq(labels, labels.T)  # (bs, bs)
            # compute temperature using mask
            temperature_matrix = torch.where(mask == True, mu * torch.ones_like(mask),
                                             1 / self.opt.temperature * torch.ones_like(mask)).to(self.opt.device)
            # # mask-out self-contrast cases,  That is, you don't take yourself into account 
            # logits_mask = torch.scatter(
            #     torch.ones_like(mask),
            #     1,
            #     torch.arange(BS).view(-1, 1).to(self.opt.device),
            #     0
            # )
            # mask = mask * logits_mask
        # compute logits
        anchor_dot_target = torch.multiply(torch.matmul(anchor, target.T), temperature_matrix)  # (bs, bs)
        # for numerical stability
        logits_max, _ = torch.max(anchor_dot_target, dim=1, keepdim=True)
        logits = anchor_dot_target - logits_max.detach()  # (bs, bs)
        # compute log_prob
        exp_logits = torch.exp(logits)  # (bs, bs)
        exp_logits = exp_logits - torch.diag_embed(torch.diag(exp_logits))  #  Subtract diagonal elements , You can't 
        log_prob = logits - torch.log(exp_logits.sum(dim=1, keepdim=True) + 1e-12)  # (bs, bs)
        # in case that mask.sum(1) has no zero
        mask_sum = mask.sum(dim=1)
        mask_sum = torch.where(mask_sum == 0, torch.ones_like(mask_sum), mask_sum)
        # compute mean of log-likelihood over positive
        mean_log_prob_pos = (mask * log_prob).sum(dim=1) / mask_sum.detach()
        loss = - mean_log_prob_pos.mean()
        return loss

Reference material

ICML 2020: from Alignment and Uniformity Understanding contrastive representation learning from the perspective of
https://blog.csdn.net/c2a2o2/article/details/117898108

原网站

版权声明
本文为[To great]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202132106118137.html