当前位置：网站首页>Naacl 2021 | contrastive learning sweeping text clustering task

Naacl 2021 | contrastive learning sweeping text clustering task

2022-07-05 02:02:00 【Necther】

introduction

After all, West Lake is in June , The scenery is not the same as the four seasons . The lotus leaves are endless , The lotus against the sun is different . Hello, guys , I'm a little boy selling hot and dry noodles , Today I share with you Amazon's publication on NAACL 2021 An article on ：Supporting Clustering with Contrastive Learning. This article combines with popular traffic Xiaosheng 「 Comparative learning 」 Propose a simple and effective 「 Unsupervised text clustering 」 Method ：SCCL. The model blew out 7 A short text clustering task ,6 To fly up ~

「 Address of thesis 」

https://arxiv.org/abs/2103.12953

「 Paper code 」

https://github.com/amazon-research/sccl

brief introduction

Here first QA The way is briefly introduced Supporting Clustering with Contrastive Learning What does this paper do .

Q1: What problem does the article want to solve ？

A1: The article is dedicated to solving 「 Unsupervised clustering 」 Mission . The so-called unsupervised clustering task is to distinguish different by specific similarity measures in the representation space 「 Semantic clusters 」. It can be seen that , This involves 2 aspect , Input how to characterize and how to measure the similarity between representations . The representation space obtained by the existing scheme has been 「 overlap 」, This start makes it difficult for subsequent clustering algorithms to break through its front ceiling .

Q2: How does the article solve the above problems ？

A2: The article combines 「 Comparative learning 」 Propose a method called SCCL(Supporting Clustering with Contrastive Learning) Model of . The model combines bottom-up Examples of comparative learning and top-down The clustering , Better clustering results are obtained .

Q3: How effective is the solution of the article ？

A3: This paper focuses on the task of short text clustering SCCL To evaluate . Experimental results show that ,SCCL In the vast majority benchmark The data set is significantly better than the previous SOTA Method . In terms of accuracy 3%-11% The advantages of rolling predecessors , The normalized mutual information is higher than 4%-15% The advantage of hanging before SOTA Model .

SCCL

Here is a brief introduction to comparative learning , Then I will introduce in detail the SCCL Model .

Comparative learning

Self supervised learning goes beyond CV field , stay CV Self supervision in the field can be divided into two types ：「 Generative 」 and 「 Discriminant 」 Self supervised learning .VAE and GAN It is a typical representative of generative self supervised learning , This kind of method requires the model to reconstruct the image or part of the image , The task is relatively difficult , Pixel level reconstruction is required , The image coding in the middle must contain many details . Comparative learning is typical 「 Discriminant 」 Self supervised learning , Relative generative self supervised learning , The task of comparative learning is less difficult . However , At present, the effect of several comparative learning models has exceeded that of supervised models , The result is really exciting , No wonder the two giants of deep learning Bengio and LeCun stay ICLR 2020 Roll call Self-Supervised Learning（SSL, Self supervised learning ） yes AI The future of .

In recent years, comparative learning has become increasingly popular , All gods are like Hinton、Yann LeCun、Kaiming He Etc. have also frequently fought in this research direction . from CV In the field MoCo series 、SimCLR series 、BYOL、SwAV And more recently NLP In the field SimCSE, Various methods learn from each other , Each has its own innovation , It can be said that there are hundreds of disputes ( Inside ) singing ( volume ). Contrastive learning is a kind of self supervised learning , Such methods do not rely on annotation data , Instead, learn knowledge from unmarked data . The core idea of comparative learning is to construct similar instances and dissimilar instances , So as to acquire a representation learning model , Through this model , Similar instances are relatively close in the representation space , But dissimilar instances are far away in the representation space .

SCCL frame

SCCL The framework process is as follows Figure 2 Shown .

SCCL from 3 Part of it is made up of ： Neural network feature extraction layer 、clustering head and Instance-CL head. The feature extraction layer maps the input to the vector representation space ,SCCL It's using Sentence Transformer released distilbert-base-nli-stsb-mean-tokens Pre training model , The model download address ：https://huggingface.co/sentence-transformers/distilbert-base-nli-stsb-mean-tokens/tree/main .Instance-CL head ( Write it down as ) and clustering head ( Write it down as ) Respectively use 「contrastive loss and clustering loss」.Instance-CL head from 「 monolayer MLP」 form , Its activation function uses ReLU. The input dimension is 768, The output dimension is 128. as for clustering head It is a 「 Linear mapping layer 」, Dimension is 768*K, among K Is the number of clusters . therefore , The overall network structure is very concise ~

The picture below is in SearchSnippets On dataset TSNE Visualization results . among Clustering and Instance-CL Respectively refers to using only SCCL One species head Result . You can see the combination 2 Kind of head Of SCCL Clustering results are more fragrant ~ That is, using comparative learning SCCL You can spread out the overlapping categories .

Instance-CL head

Instance-CL(Instance-wise Contrastive Learning) It is already the most dazzling new star in self supervision .Instance-CL First, we use data amplification method to enhance the sample data to get an auxiliary data set , Then optimize based on this data set . Use in the optimization process contrastive loss Make the enhanced samples from the same instance close to each other in the representation space , The enhanced samples from different instances are far away from each other in the representation space . In other words ,Instance-CL Disperse instances of different origins , To some extent, approximate instances are implicitly gathered . This attribute makes it possible to break up overlapping categories . Finally, cluster , Thus, different clusters can be better separated , At the same time, the cluster is more compact , That is, the distance between samples in the cluster is smaller .

Use Instance-CL Method , Different instances are well separated in the learned representation space , And maintain the local invariance of each instance . although Instance-CL Can come from different... In the representation space The instances of the original instance are divided into the same group , And ignore these instances that come from different original instances but are semantically similar . therefore ,Instance-CL The implicit grouping of is not very stable , And it depends on the amount of data , Thus, its generalization ability is insufficient .

One batch size by M Each instance of is obtained by data amplification 2 Amplified instances ( At this time, the expanded data set has 2M An example ). Examples of the same amplification source are regarded as a pair of positive samples , remainder 2M-2 Instances are regarded as negative samples .

For a pair of positive samples , about , Try to The loss function separated from all negative samples is as follows ：

among Express clustering head Output of positive sample pair . Is an indicator function .sim The function is doc product, namely . therefore , On the entire expanded data set 「Instance-CL loss」 as follows :

Besides , The article further explores 3 Two data amplification methods are right SCCL Influence , It turns out that contextual augmenter It works best .PS: More details are provided later .

Clustering head

meanwhile ,SCCL When encoding category semantic information 「 Unsupervised clustering 」. Clustering tasks and Instance-CL Different , Its focus is on high-level semantic concepts , And try to put together instances from the same semantic category . Suppose there is K Categories ( cluster ), Each category is represented by its centroid ： . Original instance The representation in the representation space is as follows ： .SCCL Use t The original instance of distributed computing is divided into k The probability of categories ：

among Express t The degrees of freedom of distribution , Here the value is set to 1.

SCCL Use a linear mapping layer , namely Figure 2 Medium cluster head To approximate the center of each category , And an auxiliary distribution is used to iteratively optimize it . Auxiliary probability is defined as follows ：

among Can be regarded as soft One of the cluster frequencies mini-batch The approximate . The target distribution first passes through a second-order soft Distribution probability Make it sharp , Then it is normalized by the frequency of related categories . This can promote learning from high confidence categories , At the same time, it will reduce the deviation caused by category imbalance .

adopt KL Divergence will distribute the probability of categories (

) Distribute to the target ( ) near ：

therefore , The clustering objective function is defined as follows ：

therefore , The overall objective function is as follows ：

It should be noted that ：clustering loss Optimize only on the original data set , It does not involve the expanded data set ; and Instance-CL loss It is for the expanded data set .

Experimental results

The article uses 8 Short text dataset validation SCCL The effectiveness of the , The evaluation index uses ACC（Accuracy） and NMI（Normalized Mutual Information）. Specific evaluation results are as follows Table 1 Shown ：

It can be seen from the experimental results above ,SCCL Before hanging on almost all data sets SOTA Method .SCCL stay Biomedical The frustration on the data set is entirely due to the low correlation between the task data and the pre training data set , Its SOTA The model is based on a large number of biomedical corpus ~

Melting research :Instance-CL and Clustering Is the optimization of separate or joint ？

SCCL There are two loss functions ：Clustering loss and Instance-CL loss . So for this 2 Optimization of loss function , It's choice pipeline The way is optimized one by one , Or use joint optimization ？ The article also further compares the effect of using only one of them , The specific experimental results are as follows Figure 3 Shown .

It can be seen from it that ：

1) Use alone Instance-CL perhaps Clustering The effect is not as good as using both .

2) Joint optimization （SCCL） The effect is better than pipeline The way ( namely SCCL-Seq, Optimize first Instance-CL To optimize the Clustering) Optimize .

Which is the best amplification method ？

The article compares 3 Data amplification methods ：

1)Augmenter WordNet (https://github.com/QData/TextAttack)

2)Augmenter Contextual(https://github.com/makcedward/nlpaug)

3)Paraphrase via back translation (https://github.com/pytorch/fairseq/tree/master/examples/paraphraser)

Above 3 The experimental results of these methods are as follows Table 3 Shown .

in general ,Contexual Augmenter (Ctxt) The effect is the best on all data sets . Under supplementary instructions ,Ctxt Is to use pre training Transformer( The thesis selects Bert-base and Roberta) Find top-n Insert or replace a suitable word . In addition, we can see , There are some datasets that show great differences under different amplification methods , such as SearchSnippers, Some of them are less sensitive , such as AgNews、Biomedical and GooglenewsTS.

The article also further tested the effect of mixed use of different amplification methods , The result is as follows Figure 5 Shown .

Blue means only Contextual Augmenter Data amplification , Orange means 「 successively 」 Use Contextual Augmenter and CharSwap Augmenter this 2 Data amplification methods . It can be seen from the experimental results that ：

stay GoogleNew-TS Mixed use on data sets 2 Two amplification methods will indeed improve , And it does not decrease with the increase of the proportion of replaced words in the enhanced amplification data ;
stay StackOverflow Data sets are quite different , With the increase of replacement ratio , Use 2 Three amplification methods have led to a significant decline in performance .

To further explore the reasons , The researchers also compared different replacement ratios 、 Different hybrid amplification methods （1 Or 2 individual ） Next , Original text and expanded text cosine Similarity degree . From the above experimental results, it can be seen that , When mixed with 2 Two amplification methods ( Orange ) when , The similarity between the amplified text and the original text gradually decreases . let me put it another way , Use the 2 After three amplification methods ,StackOverflow The expanded data set deviates greatly from the original text in the representation space . This explains why mixed use 2 Two amplification methods do not necessarily improve the performance of the model .

summary

This paper is based on Instance-CL A model for unsupervised clustering task is proposed ：SCCL.SCCL Through joint optimization Instance-CL Loss and clustering loss , Make the distance between different categories in the text semantic space larger , The distance within the class is shortened . Besides , stay 8 On a short text clustering dataset SCCL Make a full evaluation . Experimental results show that ,SCCL On most datasets SOTA result ,Accuracy Promoted 3% ~ 11%,NMI Promoted 4%~15%.

machine learning / Deep learning algorithm / Natural language processing communication group

Machine learning algorithm has been established - Natural language processing wechat communication group ！ Students who want to study in the communication group , You can add my wechat directly ：HIT_NLP. Make a remark when adding ： You know + School + nickname （ I won't accept consent without remark , Hope to understand ）, Want to enter pytorch Group , I don't know + School + nickname +Pytorch that will do . Then we can pull you into the group . There are already many college students in the group , The communication atmosphere is very good .

We highly recommend you to pay attention Machine learning algorithms and natural language processing Account number and Machine learning algorithms and natural language processing WeChat official account , Can quickly understand the latest high-quality dry goods resources .