当前位置:网站首页>[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks

[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks

2022-07-25 22:45:00 BQW_

AugSBERT: Improve the Bi-Encoders Data enhancement methods for
《Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks》

Address of thesis :https://arxiv.org/pdf/2010.08240.pdf

One 、 brief introduction

​ Sentence pair rating task in NLP \text{NLP} NLP Is widely used in . It can be used for information retrieval 、 Question and answer 、 Duplicate problem detection and clustering . For many tasks that contain sentences, the scoring task achieves sota The method is to use BERT \text{BERT} BERT. Two sentences are delivered to the network , And the attention mechanism is applied to all inputs tokens in . This method of transmitting two sentences to the network at the same time is called cross-encoder.

cross-encoder One drawback of is that the amount of computation is too large for many tasks . for example , Yes 10000 Cluster sentences ,cross-encoder need n 2 n^2 n2 Complexity , Use BERT \text{BERT} BERT It is necessary 65 Hours . End to end information retrieval is also unlikely to be used cross-encoder, Because it cannot return an independent representation for input , Used to be indexed . As a contrast , image Sentence BERT \text{Sentence BERT} Sentence BERT In this way bi-encoder Code sentences independently , And map it to dense vector space . In this way, you can effectively index and compare . for example ,10000 The clustering complexity of sentences ranges from 65 Hours reduced to 5 second . Many real-world applications rely on bi-encoders The quality of the .

 Please add a picture description

bi-encoder The disadvantage of is that the effect is worse than cross-encoder. The picture above is the author's popular English STS \text{STS} STS The benchmark data set is relatively fine tuned cross-encoder(BERT) And fine tuned bi-encoder(SBERT).

​ The effect gap is greatest when only a few training data are available .BERT cross-encoder Be able to compare two inputs at the same time , and SBERT bi-encoder More challenging tasks must be solved , Map the input independently to a meaningful vector space , This requires a sufficient number of training samples to fine tune .

​ In this paper , The author presents a data enhancement method , It is called Augmented SBERT(AugSBERT) \text{Augmented SBERT(AugSBERT)} Augmented SBERT(AugSBERT), Its use BERT cross-encoder To improve SBERT bi-encoder The performance of the . say concretely , Use cross-encoder To mark the new input sample pair , It will be added to bi-encoder In the training set of .SBERT bi-encoder Fine tune on a larger enhancement training set , It can significantly improve the effect . As the author shows , Select for cross-encoder Input sample pairs for soft labeling , It is crucial to improve the effect . The method in this paper can be simply applied to many pairwise classification tasks and regression problems .

​ First , The author in 4 The proposed AugSBERT \text{AugSBERT} AugSBERT Method : Argument similarity \text{Argument similarity} Argument similarity semantic textual similarity \text{semantic textual similarity} semantic textual similarity duplicate question detection \text{duplicate question detection} duplicate question detection and news paraphrase identification \text{news paraphrase identification} news paraphrase identification. Can be observed , Compared with the best effect at present SBERT bi-encoder \text{SBERT bi-encoder} SBERT bi-encoder, It can increase by percent 1 To 6 A little bit . secondly , The author shows AugSBERT \text{AugSBERT} AugSBERT Advantages in field adaptation . because bi-encoder New domains cannot be mapped to perceptible spaces , And BERT cross-encoders \text{BERT cross-encoders} BERT cross-encoders comparison , SBERT bi-encoder \text{SBERT bi-encoder} SBERT bi-encoder The effect on the target domain will decrease more . In this scenario , AugSBERT \text{AugSBERT} AugSBERT Realized 37 Percentage point effect growth .

Two 、Augmented SBERT

​ Given a pre trained 、 And the effect is very good cross-encoder, The sentence pairs are sampled through a specific sampling strategy and annotated with a cross encoder . These weakly labeled samples are called silver dataset, And merge them into the original training data set . Train based on the expanded training set bi-encoder. We call this model Augmented SBERT(AugSBERT) \text{Augmented SBERT(AugSBERT)} Augmented SBERT(AugSBERT). The whole process is shown in the figure above :

1. Sample pair sampling strategy

​ Use cross-encoder The labeled new sample pair can be new data 、 You can also reuse the independent sentences in the training set to synthesize . In this experiment , Reuse the sentences in the original training set . Of course, this is possible when not all combinations are labeled . However , This rarely happens , Because for n n n A sentence has n × ( n − 1 ) / 2 n\times (n-1)/2 n×(n1)/2 There are three possible merger schemes . Weak labeling of all possible merged samples may bring a great amount of computation , And it may not bring effect improvement . contrary , Adopting the correct sampling strategy is very important to improve the effect .

  • Random Sampling(RS)

    Randomly sample a sentence pair and use cross-encoder Weak labeling . Random sampling of two sentences usually leads to dissimilar sample pairs , Positive sample pairs are very few . This biased label distribution makes silver dataset Heavily inclined to negative samples .

  • Kernel Density Estimation(KDE)

    The goal is to make silver dataset The label distribution of is similar to the original training set . So , The author annotates a large number of randomly sampled sample pairs in the way of weak annotation , Then only some of the sample pairs are retained . For classification tasks , The author keeps all positive sample pairs . Then, some negative sample pairs are sampled from the remaining random negative sample pairs , Make its positive and negative proportion consistent with the original training set . For the return mission , Use kernel density estimation (kernel density estimation,KDE) \text{(kernel density estimation,KDE)} (kernel density estimation,KDE) To estimate the score s s s Continuous density function of F g o l d ( s ) F_{gold}(s) Fgold(s) and F s i l v e r ( s ) F_{silver}(s) Fsilver(s). then , Use a probability Q ( s ) Q(s) Q(s) Keep the sampling score s s s To minimize the... Between two functions KL \text{KL} KL The divergence .
    Q ( s ) = { 1 i f    F g o l d ( s ) ≥ F s i l v e r ( s ) F g o l d ( s ) F s i l v e r ( s ) i f    F g o l d ( s ) < F s i l v e r ( s ) Q(s)= \begin{cases} 1\quad if \;F_{gold}(s)\geq F_{silver}(s) \\ \frac{F_{gold}(s)}{F_{silver}(s)}\quad if\;F_{gold}(s)< F_{silver}(s) \end{cases} Q(s)={ 1ifFgold(s)Fsilver(s)Fsilver(s)Fgold(s)ifFgold(s)<Fsilver(s)
    Be careful , KDE \text{KDE} KDE The computational efficiency of the sampling strategy is relatively low , Because it needs to label many random samples , These samples may then be discarded .

  • BM25 Sampling(BM25)

    In Information Retrieval , Okapi BM25 \text{Okapi BM25} Okapi BM25 The algorithm is based on lexical overlap , And it is often used in the scoring function of many search engines . The author uses ElasticSearch \text{ElasticSearch} ElasticSearch To create an index , For faster retrieval query Relevant results . In the author's experiment , Index every sentence , And each query Will search for top-k \text{top-k} top-k Similar sentences . You'll use cross-encoder To mark these sample pairs . The efficiency of indexing and retrieving similar sentences is high , And all weakly labeled sample pairs will be used silver dataset.

  • Sematic Search Sampling(SS)

    BM25 \text{BM25} BM25 One of the disadvantages of , Only sentences covered by words will be found . Synonymous sentences with no or few overlapping words are not returned , Therefore, it will not be called silver dataset Part of . The author trained a bi-encoder, And use it to further adopt similar sentence pairs . Author use cosine \text{cosine} cosine Similarity and retrieve the most similar top k \text{top k} top k The sentence . For large-scale data , Use something similar to Faiss \text{Faiss} Faiss To quickly retrieve the most similar k k k A sentence .

  • BM25+Semantic Search Sampling(BM25-S.S.)

    Use at the same time BM25 \text{BM25} BM25 And language retrieval strategies SS \text{SS} SS. Aggregating the two strategies helps to capture sentences with similar morphology and semantics , But it will make the label distribution tend to negative sample pairs .

2. Seed optimization

Dodge \text{Dodge} Dodge Their research shows , image BERT \text{BERT} BERT This is based on Transformers \text{Transformers} Transformers The model is highly dependent on random number seeds , Because it converges to different minima , It will be expanded to unprecedented data in different ways . In this experiment , The author applied seed optimization (seed optimization) \text{(seed optimization)} (seed optimization): Use 5 A random number seed training model , And choose the model that performs best in the validation set . To speed up the process , In the training step Of 20% application early stopping, Only continue to train the best model until the end .

3、 ... and 、 AugSBERT \text{AugSBERT} AugSBERT Field adaptation

​ So far, , Discuss AugSBERT \text{AugSBERT} AugSBERT They are all set up in the field , That is, when the training set and the test set come from the same field . However , The author expects SBERT \text{SBERT} SBERT Data outside the field has higher performance . This is because SBERT \text{SBERT} SBERT You can't map sentences you haven't seen to meaningful spaces . Unfortunately , Annotation data of new fields is usually unavailable .

​ therefore , The author evaluates the effect of the proposed data enhancement strategy on domain adaptation : First, train on the source domain data containing labeled data cross-encoder. After fine tuning , Use fine tuned cross-encoder To mark the target field . Once the marking is completed , Train sentence pairs in the marked target field bi-encoder.

Four 、 experiment

 Please add a picture description

原网站

版权声明
本文为[BQW_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/206/202207252241071377.html