当前位置：网站首页>Text matching - [naacl 2021] augsbert

Text matching - [naacl 2021] augsbert

2022-06-30 14:48:00 【User 1621453】

Background and challenges

Address of thesis ：https://arxiv.org/abs/2010.08240

at present , State-of-the-art NLP Architectural models are often reused in Wikipedia and Toronto Books Corpus And other large text corpora BERT Model as baseline . Through deep pre training BERT Fine tuning , Many alternative architectures have been invented , for example DeBERT、RetriBERT、RoBERTa …… They substantially improve the benchmarks for various language understanding tasks . stay NLP Common tasks in , Pairwise sentence scoring in information retrieval 、 Question and answer 、 Repeated problem detection or clustering has a wide range of applications . Usually , Two typical methods are proposed ：Bi-encoders and Cross-encoders.

Cross-encoders： Perform a complete... For a given input and label candidate （ cross ）self-attention, And often more than their Bi-encoders Get higher accuracy . however , It must recalculate the encoding of each input and tag ; result , They cannot retrieve end-to-end information , Because they don't produce independent representations for input , And the speed is very slow when testing . for example ,10,000 The clustering of sentences has quadratic complexity , It takes about 65 Hours of training .

Bi-encoders： Execute... For input and candidate tags respectively self-attention, Map them to a dense vector space , Then combine them at the end to get the final representation . therefore ,Bi-encoders Ability to index coded candidates and compare these representations for each input , So as to speed up the prediction time . In clustering 10,000 At the same complexity of sentences , The time from 65 Hours reduced to about 5 second . advanced Bi-encoder Bert The performance of the model is determined by Ubiquitous Knowledge Processing Lab (UKP-TUDA) Put forward , be called Sentence-BERT (SBERT).

On the other hand , No method is perfect in all respects ,Bi-encoders No exception . And Cross-encoders Methods compared ,Bi-encoders The performance of methods is usually low , And it needs a lot of training data . as a result of Cross-encoders You can compare two inputs at the same time , and Bi-encoders The input must be independently mapped to a meaningful vector space , This requires a sufficient number of training examples to fine tune .

To solve this problem , Invented “Poly-encoders”. “Poly-encoders” Use two independent Converters （ Be similar to cross-encoders）, But only apply attention between the top two inputs , Lead to more than Bi-encoders Better performance gain and ratio Cross-encoders Greater speed gain . However ,“Poly-encoders” There are still some shortcomings ： Because of the asymmetric score function , They cannot be applied to tasks with symmetric similarity , also “Poly-encoders” The representation of cannot be effectively indexed , This leads to problems in large corpus size retrieval tasks .

In this paper , I would like to introduce a method that can effectively use Cross-encoders and Bi-encoders New method of —— Data to enhance . This strategy is called enhancement SBERT (AugSBERT) , It USES BERT Cross-encoders To mark a larger set of input pairs , To enhance SBERT Bi-encoders Training data . then ,SBERT Bi-encoders Fine tune on this larger enhancement training set , This significantly improves performance . This idea is related to computer vision 《Self-Supervised Learning by Relational Reasoning》 Very similar . therefore , Simply speaking , We can think of it as self supervised learning in naturallanguageprocessing . For more information , It will be introduced in the next section .

Technical highlights

Enhancement for paired sentence regression or classification tasks SBERT There are three main scenarios for the method .

scene 1： Complete annotation data set （ All marked sentences are correct ）

under these circumstances , Apply direct data enhancement strategies to prepare and extend tagged datasets . There are three most common levels ： character , word , The sentence ：

however , The word level is the most appropriate level for the sentence to task . Training based Bi-encoders Performance of , Few recommended methods ： Embed... Through context words （BERT、DistilBERT、RoBERTA or XLNet） Insert / To replace with or by synonyms （WordNet、PPDB）. After creating enhanced text data , Combine it with the original text data and put it in Bi-Encoders.

However , In rare or exceptional cases of marked data sets , The simple word replacement or increment strategy shown in the does not help the data enhancement in the sentence pair task , Even worse performance than a model without enhancements .

In short , A direct data enhancement strategy involves three steps ：

The first 1 Step ： Prepare the semantic text similarity data set of complete markup （gold data）
The first 2 Step ： Replace synonyms in pairs （silver data）
The first 3 Step ： Expanding （gold + silver） Train dual coders on the training data set (SBERT)

scene 2： Limited or fewer annotation data sets （ There are few tagged sentences ）

under these circumstances , Due to tag dataset （gold data） Co., LTD. , So use pre trained Cross-encoders For unmarked data （ Same domain ） Weak marking . However , Choosing two sentences at random usually results in different （ no ） Yes ; Positive pairs are extremely rare . This makes silver dataset The label distribution of is heavily biased towards negative pairs . therefore , Two appropriate sampling methods are recommended ：

BM25 Sampling (BM25)： The algorithm is based on lexical overlap , It is usually used as a scoring function by many search engines . Before querying and retrieving from uniquely indexed sentences k A similar sentence .

Semantic search sampling (SS)： In the process of the training Bi-Encoders (SBERT) Used to retrieve the previous in our collection k The most similar sentences . For large collections , You can use the Faiss Such an approximate nearest neighbor search is used for fast retrieval k The most similar sentences . It can solve BM25 Disadvantages in synonymous sentences with little or no lexical overlap .

after , The sampled sentence pairs will pass the pre training Cross-encoders Weak marking , And merge with the gold dataset . then , Train dual coders on this extended training data set . This model is called enhancement SBERT (AugSBERT).AugSBERT May improve existing Bi- encoders And reduce the performance with Cross-encoders The difference of .

All in all , For finite data sets AugSBERT There are three steps involved ：

The first 1 Step ： In a small （gold dataset） Fine tune up Cross-encoders (BERT)
step 2.1： Create pairs through reorganization and pass BM25 Or semantic search to reduce
step 2.2： Use Cross-encoders（silver dataset） Weakly marked new pairs
The first 3 Step ： Expanding （gold + silver） Training on the training data set bi-encoder (SBERT)

scene 3： No annotated datasets （ Only unmarked sentence pairs ）

When we wish SBERT Data in different domains （ No comment ） For high performance , That's what happens . Basically ,SBERT It is impossible to map a sentence with an unknown term to a reasonable vector space . therefore , Proposed the related data enhancement strategy domain adaptation ：

The first 1 Step ： Train from scratch on the source dataset Cross-encoders (BERT). The first 2 Step ： Use these Cross-encoders (BERT) Tag your target dataset , That is, unmarked sentence pairs The first 3 Step ： Last , Train on the marked target data set Bi-encoders (SBERT) Generally speaking , When the source domain is quite generic and the target domain is quite specific , AugSBERT Will benefit a lot . conversely , When it moves from a specific domain to a common target domain , Performance is only slightly improved .

原网站

版权声明
本文为[User 1621453]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/181/202206301443505559.html