当前位置：网站首页>[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks

[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks

2022-07-25 22:45:00 【BQW_】

AugSBERT： Improve the Bi-Encoders Data enhancement methods for 《Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks》

Address of thesis ：https://arxiv.org/pdf/2010.08240.pdf

One 、 brief introduction

Sentence pair rating task in $\text{NLP}$ Is widely used in . It can be used for information retrieval 、 Question and answer 、 Duplicate problem detection and clustering . For many tasks that contain sentences, the scoring task achieves sota The method is to use $\text{BERT}$ . Two sentences are delivered to the network , And the attention mechanism is applied to all inputs tokens in . This method of transmitting two sentences to the network at the same time is called cross-encoder.

cross-encoder One drawback of is that the amount of computation is too large for many tasks . for example , Yes 10000 Cluster sentences ,cross-encoder need $n^2$ Complexity , Use $\text{BERT}$ It is necessary 65 Hours . End to end information retrieval is also unlikely to be used cross-encoder, Because it cannot return an independent representation for input , Used to be indexed . As a contrast , image $\text{Sentence BERT}$ In this way bi-encoder Code sentences independently , And map it to dense vector space . In this way, you can effectively index and compare . for example ,10000 The clustering complexity of sentences ranges from 65 Hours reduced to 5 second . Many real-world applications rely on bi-encoders The quality of the .

Please add a picture description

bi-encoder The disadvantage of is that the effect is worse than cross-encoder. The picture above is the author's popular English $\text{STS}$ The benchmark data set is relatively fine tuned cross-encoder(BERT) And fine tuned bi-encoder(SBERT).

The effect gap is greatest when only a few training data are available .BERT cross-encoder Be able to compare two inputs at the same time , and SBERT bi-encoder More challenging tasks must be solved , Map the input independently to a meaningful vector space , This requires a sufficient number of training samples to fine tune .

In this paper , The author presents a data enhancement method , It is called $\text{Augmented SBERT(AugSBERT)}$ , Its use BERT cross-encoder To improve SBERT bi-encoder The performance of the . say concretely , Use cross-encoder To mark the new input sample pair , It will be added to bi-encoder In the training set of .SBERT bi-encoder Fine tune on a larger enhancement training set , It can significantly improve the effect . As the author shows , Select for cross-encoder Input sample pairs for soft labeling , It is crucial to improve the effect . The method in this paper can be simply applied to many pairwise classification tasks and regression problems .

First , The author in 4 The proposed $\text{AugSBERT}$ Method ： $\text{Argument similarity}$ 、 $\text{semantic textual similarity}$ 、 $\text{duplicate question detection}$ and $\text{news paraphrase identification}$ . Can be observed , Compared with the best effect at present $\text{SBERT bi-encoder}$ , It can increase by percent 1 To 6 A little bit . secondly , The author shows $\text{AugSBERT}$ Advantages in field adaptation . because bi-encoder New domains cannot be mapped to perceptible spaces , And $\text{BERT cross-encoders}$ comparison , $\text{SBERT bi-encoder}$ The effect on the target domain will decrease more . In this scenario , $\text{AugSBERT}$ Realized 37 Percentage point effect growth .

Two 、Augmented SBERT

Given a pre trained 、 And the effect is very good cross-encoder, The sentence pairs are sampled through a specific sampling strategy and annotated with a cross encoder . These weakly labeled samples are called silver dataset, And merge them into the original training data set . Train based on the expanded training set bi-encoder. We call this model $\text{Augmented SBERT(AugSBERT)}$ . The whole process is shown in the figure above ：

1. Sample pair sampling strategy

Use cross-encoder The labeled new sample pair can be new data 、 You can also reuse the independent sentences in the training set to synthesize . In this experiment , Reuse the sentences in the original training set . Of course, this is possible when not all combinations are labeled . However , This rarely happens , Because for $n$ A sentence has $n\times (n-1)/2$ There are three possible merger schemes . Weak labeling of all possible merged samples may bring a great amount of computation , And it may not bring effect improvement . contrary , Adopting the correct sampling strategy is very important to improve the effect .

Random Sampling(RS)
Randomly sample a sentence pair and use cross-encoder Weak labeling . Random sampling of two sentences usually leads to dissimilar sample pairs , Positive sample pairs are very few . This biased label distribution makes silver dataset Heavily inclined to negative samples .
Kernel Density Estimation(KDE)
The goal is to make silver dataset The label distribution of is similar to the original training set . So , The author annotates a large number of randomly sampled sample pairs in the way of weak annotation , Then only some of the sample pairs are retained . For classification tasks , The author keeps all positive sample pairs . Then, some negative sample pairs are sampled from the remaining random negative sample pairs , Make its positive and negative proportion consistent with the original training set . For the return mission , Use kernel density estimation $\text{(kernel density estimation,KDE)}$ To estimate the score $s$ Continuous density function of $F_{gold}(s)$ and $F_{silver}(s)$ . then , Use a probability $Q (s)$ Keep the sampling score $s$ To minimize the... Between two functions $\text{KL}$ The divergence .
$\begin{cases} 1\quad if \;F_{gold}(s)\geq F_{silver}(s) \\ \frac{F_{gold}(s)}{F_{silver}(s)}\quad if\;F_{gold}(s)< F_{silver}(s) \end{cases}$
Be careful , $\text{KDE}$ The computational efficiency of the sampling strategy is relatively low , Because it needs to label many random samples , These samples may then be discarded .
BM25 Sampling(BM25)
In Information Retrieval , $\text{Okapi BM25}$ The algorithm is based on lexical overlap , And it is often used in the scoring function of many search engines . The author uses $\text{ElasticSearch}$ To create an index , For faster retrieval query Relevant results . In the author's experiment , Index every sentence , And each query Will search for $\text{top-k}$ Similar sentences . You'll use cross-encoder To mark these sample pairs . The efficiency of indexing and retrieving similar sentences is high , And all weakly labeled sample pairs will be used silver dataset.
Sematic Search Sampling(SS)
$\text{BM25}$ One of the disadvantages of , Only sentences covered by words will be found . Synonymous sentences with no or few overlapping words are not returned , Therefore, it will not be called silver dataset Part of . The author trained a bi-encoder, And use it to further adopt similar sentence pairs . Author use $\text{cosine}$ Similarity and retrieve the most similar $\text{top k}$ The sentence . For large-scale data , Use something similar to $\text{Faiss}$ To quickly retrieve the most similar $k$ A sentence .
BM25+Semantic Search Sampling(BM25-S.S.)
Use at the same time $\text{BM25}$ And language retrieval strategies $\text{SS}$ . Aggregating the two strategies helps to capture sentences with similar morphology and semantics , But it will make the label distribution tend to negative sample pairs .

2. Seed optimization

$\text{Dodge}$ Their research shows , image $\text{BERT}$ This is based on $\text{Transformers}$ The model is highly dependent on random number seeds , Because it converges to different minima , It will be expanded to unprecedented data in different ways . In this experiment , The author applied seed optimization $\text{(seed optimization)}$ ： Use 5 A random number seed training model , And choose the model that performs best in the validation set . To speed up the process , In the training step Of 20% application early stopping, Only continue to train the best model until the end .

3、 ... and 、 $\text{AugSBERT}$ Field adaptation

So far, , Discuss $\text{AugSBERT}$ They are all set up in the field , That is, when the training set and the test set come from the same field . However , The author expects $\text{SBERT}$ Data outside the field has higher performance . This is because $\text{SBERT}$ You can't map sentences you haven't seen to meaningful spaces . Unfortunately , Annotation data of new fields is usually unavailable .

therefore , The author evaluates the effect of the proposed data enhancement strategy on domain adaptation ： First, train on the source domain data containing labeled data cross-encoder. After fine tuning , Use fine tuned cross-encoder To mark the target field . Once the marking is completed , Train sentence pairs in the marked target field bi-encoder.