当前位置:网站首页>The strongest installation of the twin tower model, Google is playing "antique" again?

The strongest installation of the twin tower model, Google is playing "antique" again?

2022-07-07 21:55:00 Zhiyuan community

The twin tower model has proved to be a very effective modeling method in search and question and answer tasks , The theory and business are quite mature . The two towers share different degrees according to parameters , It usually falls into two categories :Simese dual encoder and Asymmetric dual encoder, The former parameter structure is completely symmetrical , The latter is not completely symmetrical ( Hereinafter referred to as" SDE and ADE).

This paper is after the long silence of the twin towers , Google pushed it to the center of the universe again , And open the strongest export of the twin towers , Explore the differences and connections between the two in detail , More empirical conclusions of the double tower structure are also given through experiments . It is suitable for old drivers to recall classics and Xiaobai again and make a deep and systematic introduction ~

Thesis title :
Exploring Dual Encoder Architectures for Question Answering

Thesis link :
https://arxiv.org/abs/2204.07120

 

background

First of all, what is popular science SDE and ADE? The dual encoder network structure will text1 and text2 Respectively encoded into vector representation , Then calculate the sum of the two cosine Equidistance function measures its similarity .SDE Is a twin network that fully shares parameters , That is, although it is a double tower , But actually query/user and doc/item Share a set of parameters ;ADE Only some parameters are shared or not shared at all , It is an independent two parameter network . What they have in common is that they will not interact deeply , contrast BERT Is a typical interactive network . A typical application of double tower structure is recall or Rough row , Scenarios that require strict computing speed .

The modeling idea of twin towers is relatively simple and easy to understand . This article is short and concise , The highlight is Provide a more general conclusion under the twin tower application scenario , Explain several questions clearly :

  • ADE and SDE stay QA Which one works better on the task ?
  • ADE What are the reasons for poor performance ? What's the solution ?

The author draws a reliable conclusion through reasonable and detailed experiments , Xiaobai can also quickly get To how in ( towards ) real ( guide ) Examination ( t ) Do a section ( Remit ) study ( newspaper ).

experiment

The author in QA The retrieval task is carried out 5 An experiment , Calculation query And candidates answer(doc or passage) The similarity of , The evaluation task is MS MARCO and MultiReQA. Model encoder Is based on transformer,cosine As a distance measurement function , The goal is to explore the influence of the sharing degree of parameters on the modeling effect . 5 A group of experimental networks are the standards of Figure 1 SDE and ADE, as well as 3 Variant structure :• ADE with shared token embedder (ADE-STE) • ADE with frozen token embedder (ADE-FTE) • ADE with shared projection layer (ADE-SPL) The experimental results are as follows :

The experimental conclusion :

  • ADE Performance on multiple tasks is significantly inferior to SDE. The reasonable explanation given by the author is due to ADE The essence is two networks with different parameters , So the query and doc Map to two completely different vector spaces . This point later gives more powerful evidence .
  • ADE-SPL Our performance is comparable to SDE. after 3 The first experiment is the structure proposed by the author to explore the degree of parameter sharing , At the same time, it also gives which part of the network is limited ADE The key to the effect . Just share or fix the bottom token embedder The effect improvement brought by parameters is not obvious , But when the last top-level parameters share a full connection layer , Can get and SDE The effect of proximity . Why? ? The author's guess is because of the last MLP The parameters are constrained to the same vector space again .

To further illustrate the problem , The author conducted another experiment , take NaturalQuestions Test set query and answer Calculate in advance , And then through t-SNE Map and cluster into a two-dimensional space , Be surprised to find ,dual encoder The performance of depends on whether the last two are in a comparable vector space .

原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071910479967.html