当前位置:网站首页>[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks
[natural language processing] [vector representation] augsbert: improve the data enhancement method of Bi encoders for paired sentence scoring tasks
2022-07-25 22:45:00 【BQW_】
Address of thesis :https://arxiv.org/pdf/2010.08240.pdf
One 、 brief introduction
Sentence pair rating task in NLP \text{NLP} NLP Is widely used in . It can be used for information retrieval 、 Question and answer 、 Duplicate problem detection and clustering . For many tasks that contain sentences, the scoring task achieves sota The method is to use BERT \text{BERT} BERT. Two sentences are delivered to the network , And the attention mechanism is applied to all inputs tokens in . This method of transmitting two sentences to the network at the same time is called cross-encoder.
cross-encoder One drawback of is that the amount of computation is too large for many tasks . for example , Yes 10000 Cluster sentences ,cross-encoder need n 2 n^2 n2 Complexity , Use BERT \text{BERT} BERT It is necessary 65 Hours . End to end information retrieval is also unlikely to be used cross-encoder, Because it cannot return an independent representation for input , Used to be indexed . As a contrast , image Sentence BERT \text{Sentence BERT} Sentence BERT In this way bi-encoder Code sentences independently , And map it to dense vector space . In this way, you can effectively index and compare . for example ,10000 The clustering complexity of sentences ranges from 65 Hours reduced to 5 second . Many real-world applications rely on bi-encoders The quality of the .

bi-encoder The disadvantage of is that the effect is worse than cross-encoder. The picture above is the author's popular English STS \text{STS} STS The benchmark data set is relatively fine tuned cross-encoder(BERT) And fine tuned bi-encoder(SBERT).
The effect gap is greatest when only a few training data are available .BERT cross-encoder Be able to compare two inputs at the same time , and SBERT bi-encoder More challenging tasks must be solved , Map the input independently to a meaningful vector space , This requires a sufficient number of training samples to fine tune .
In this paper , The author presents a data enhancement method , It is called Augmented SBERT(AugSBERT) \text{Augmented SBERT(AugSBERT)} Augmented SBERT(AugSBERT), Its use BERT cross-encoder To improve SBERT bi-encoder The performance of the . say concretely , Use cross-encoder To mark the new input sample pair , It will be added to bi-encoder In the training set of .SBERT bi-encoder Fine tune on a larger enhancement training set , It can significantly improve the effect . As the author shows , Select for cross-encoder Input sample pairs for soft labeling , It is crucial to improve the effect . The method in this paper can be simply applied to many pairwise classification tasks and regression problems .
First , The author in 4 The proposed AugSBERT \text{AugSBERT} AugSBERT Method : Argument similarity \text{Argument similarity} Argument similarity、 semantic textual similarity \text{semantic textual similarity} semantic textual similarity、 duplicate question detection \text{duplicate question detection} duplicate question detection and news paraphrase identification \text{news paraphrase identification} news paraphrase identification. Can be observed , Compared with the best effect at present SBERT bi-encoder \text{SBERT bi-encoder} SBERT bi-encoder, It can increase by percent 1 To 6 A little bit . secondly , The author shows AugSBERT \text{AugSBERT} AugSBERT Advantages in field adaptation . because bi-encoder New domains cannot be mapped to perceptible spaces , And BERT cross-encoders \text{BERT cross-encoders} BERT cross-encoders comparison , SBERT bi-encoder \text{SBERT bi-encoder} SBERT bi-encoder The effect on the target domain will decrease more . In this scenario , AugSBERT \text{AugSBERT} AugSBERT Realized 37 Percentage point effect growth .
Two 、Augmented SBERT
Given a pre trained 、 And the effect is very good cross-encoder, The sentence pairs are sampled through a specific sampling strategy and annotated with a cross encoder . These weakly labeled samples are called silver dataset, And merge them into the original training data set . Train based on the expanded training set bi-encoder. We call this model Augmented SBERT(AugSBERT) \text{Augmented SBERT(AugSBERT)} Augmented SBERT(AugSBERT). The whole process is shown in the figure above :
1. Sample pair sampling strategy
Use cross-encoder The labeled new sample pair can be new data 、 You can also reuse the independent sentences in the training set to synthesize . In this experiment , Reuse the sentences in the original training set . Of course, this is possible when not all combinations are labeled . However , This rarely happens , Because for n n n A sentence has n × ( n − 1 ) / 2 n\times (n-1)/2 n×(n−1)/2 There are three possible merger schemes . Weak labeling of all possible merged samples may bring a great amount of computation , And it may not bring effect improvement . contrary , Adopting the correct sampling strategy is very important to improve the effect .
Random Sampling(RS)
Randomly sample a sentence pair and use
cross-encoderWeak labeling . Random sampling of two sentences usually leads to dissimilar sample pairs , Positive sample pairs are very few . This biased label distribution makessilver datasetHeavily inclined to negative samples .Kernel Density Estimation(KDE)
The goal is to make
silver datasetThe label distribution of is similar to the original training set . So , The author annotates a large number of randomly sampled sample pairs in the way of weak annotation , Then only some of the sample pairs are retained . For classification tasks , The author keeps all positive sample pairs . Then, some negative sample pairs are sampled from the remaining random negative sample pairs , Make its positive and negative proportion consistent with the original training set . For the return mission , Use kernel density estimation (kernel density estimation,KDE) \text{(kernel density estimation,KDE)} (kernel density estimation,KDE) To estimate the score s s s Continuous density function of F g o l d ( s ) F_{gold}(s) Fgold(s) and F s i l v e r ( s ) F_{silver}(s) Fsilver(s). then , Use a probability Q ( s ) Q(s) Q(s) Keep the sampling score s s s To minimize the... Between two functions KL \text{KL} KL The divergence .
Q ( s ) = { 1 i f F g o l d ( s ) ≥ F s i l v e r ( s ) F g o l d ( s ) F s i l v e r ( s ) i f F g o l d ( s ) < F s i l v e r ( s ) Q(s)= \begin{cases} 1\quad if \;F_{gold}(s)\geq F_{silver}(s) \\ \frac{F_{gold}(s)}{F_{silver}(s)}\quad if\;F_{gold}(s)< F_{silver}(s) \end{cases} Q(s)={ 1ifFgold(s)≥Fsilver(s)Fsilver(s)Fgold(s)ifFgold(s)<Fsilver(s)
Be careful , KDE \text{KDE} KDE The computational efficiency of the sampling strategy is relatively low , Because it needs to label many random samples , These samples may then be discarded .BM25 Sampling(BM25)
In Information Retrieval , Okapi BM25 \text{Okapi BM25} Okapi BM25 The algorithm is based on lexical overlap , And it is often used in the scoring function of many search engines . The author uses ElasticSearch \text{ElasticSearch} ElasticSearch To create an index , For faster retrieval
queryRelevant results . In the author's experiment , Index every sentence , And eachqueryWill search for top-k \text{top-k} top-k Similar sentences . You'll usecross-encoderTo mark these sample pairs . The efficiency of indexing and retrieving similar sentences is high , And all weakly labeled sample pairs will be usedsilver dataset.Sematic Search Sampling(SS)
BM25 \text{BM25} BM25 One of the disadvantages of , Only sentences covered by words will be found . Synonymous sentences with no or few overlapping words are not returned , Therefore, it will not be called
silver datasetPart of . The author trained abi-encoder, And use it to further adopt similar sentence pairs . Author use cosine \text{cosine} cosine Similarity and retrieve the most similar top k \text{top k} top k The sentence . For large-scale data , Use something similar to Faiss \text{Faiss} Faiss To quickly retrieve the most similar k k k A sentence .BM25+Semantic Search Sampling(BM25-S.S.)
Use at the same time BM25 \text{BM25} BM25 And language retrieval strategies SS \text{SS} SS. Aggregating the two strategies helps to capture sentences with similar morphology and semantics , But it will make the label distribution tend to negative sample pairs .
2. Seed optimization
Dodge \text{Dodge} Dodge Their research shows , image BERT \text{BERT} BERT This is based on Transformers \text{Transformers} Transformers The model is highly dependent on random number seeds , Because it converges to different minima , It will be expanded to unprecedented data in different ways . In this experiment , The author applied seed optimization (seed optimization) \text{(seed optimization)} (seed optimization): Use 5 A random number seed training model , And choose the model that performs best in the validation set . To speed up the process , In the training step Of 20% application early stopping, Only continue to train the best model until the end .
3、 ... and 、 AugSBERT \text{AugSBERT} AugSBERT Field adaptation
So far, , Discuss AugSBERT \text{AugSBERT} AugSBERT They are all set up in the field , That is, when the training set and the test set come from the same field . However , The author expects SBERT \text{SBERT} SBERT Data outside the field has higher performance . This is because SBERT \text{SBERT} SBERT You can't map sentences you haven't seen to meaningful spaces . Unfortunately , Annotation data of new fields is usually unavailable .
therefore , The author evaluates the effect of the proposed data enhancement strategy on domain adaptation : First, train on the source domain data containing labeled data cross-encoder. After fine tuning , Use fine tuned cross-encoder To mark the target field . Once the marking is completed , Train sentence pairs in the marked target field bi-encoder.
Four 、 experiment

边栏推荐
- Today, let's talk about the if branch statement of the selection structure
- Ribbon execution logic source code analysis
- Solve several common problems
- Qtreewidget control of QT
- JVM内存区域
- ECMA 262 12 Lexical Grammer
- [training day15] boring [tree DP]
- Simulated Xiaomi mall head home page
- 【集训DAY13】Backpack【动态规划】【贪心】
- Floating effect and characteristics
猜你喜欢

Ffmpeg plays audio and video, time_ Base solves the problem of audio synchronization and SDL renders the picture
![[MySQL rights] UDF rights (with Malaysia)](/img/72/d3e46a820796a48b458cd2d0a18f8f.png)
[MySQL rights] UDF rights (with Malaysia)
![[training Day12] tree! Tree! Tree! [greed] [minimum spanning tree]](/img/46/1c7f6abc74e11c4c2e09655aade223.png)
[training Day12] tree! Tree! Tree! [greed] [minimum spanning tree]

【集训DAY12】Bee GO!【动态规划】【数学】

Two methods of printing strings in reverse order in C language

Why should we launch getaverse?

冯诺依曼体系结构

(1) Integrating two mapping frameworks of Dao
![[training Day12] min ratio [DFS] [minimum spanning tree]](/img/f8/e37efd0d2aa0b3d79484b9ac42b7eb.png)
[training Day12] min ratio [DFS] [minimum spanning tree]

【集训DAY13】Travel【暴力】【动态规划】
随机推荐
Session and cookie, token and storage
QVariant的使用
【MySQL提权】UDF提权(附带大马)
ECMA 262 12 Lexical Grammer
3 lexical analysis
Floating effect and characteristics
[leetcode] 502.ipo (difficult)
XSS tool beef XSS installation and use
[training day15] simple calculation [tree array] [mathematics]
Data type conversion
The new media operation strategy (taking xiaohongshu as an example) helps you quickly master the creative method of popular models
C语言逆序打印字符串的两种方法
PE格式: 分析IatHook并实现
[database learning] redis parser & single thread & Model
ECMA 262 12 Lexical Grammer
ORM common requirements
【集训DAY12】X equation 【高精度】【数学】
冯诺依曼体系结构
【集训DAY12】Minn ratio 【dfs】【最小生成树】
Data governance under data platform