当前位置:网站首页>Distill knowledge from the interaction model! China University of science and Technology & meituan proposed virt, which combines the efficiency of the two tower model and the performance of the intera
Distill knowledge from the interaction model! China University of science and Technology & meituan proposed virt, which combines the efficiency of the two tower model and the performance of the intera
2022-07-06 18:11:00 【I love computer vision】
Official account , Find out CV The beauty of Technology
This sharing paper 『VIRT: Improving Representation-based Models for Text Matching through Virtual Interaction』, Distill knowledge from interaction model ! China University of science and technology & Meituan put forward VIRT, Both the efficiency of the two tower model and the performance of the interaction model , Achieve a balance between performance and efficiency in text matching !
The details are as follows :
Thesis link :https://arxiv.org/abs/2112.04195
01
Abstract
With pre training Transformer Vigorous development , Based on twins Transformer The representation model of encoder has become the mainstream technology of efficient text matching . However , Compared with the interaction based model , Due to the lack of interaction between text pairs , The performance of these models is seriously degraded . The existing technology attempts to solve this problem through additional interaction with the twin encoded representation , The interaction in the coding process is still ignored .
To solve this problem , The author puts forward a kind of Virtual interaction mechanism (VIRT), The interaction knowledge is transferred from the interaction based model to the twin coder through attention Graph Extraction .VIRT As a component for training runtime only , It can completely maintain the efficiency of twin structure , And it will not bring additional computational costs in the reasoning process . In order to make full use of the learned interactive knowledge , The author further designs an adaptation VIRT Interaction strategy .
Experimental results on multiple text matching datasets show that , The method in this paper is superior to the existing representation based model . Besides ,VIRT It can be easily integrated into existing representation based methods , To achieve further improvements .
02
Motivation
Text matching aims to model the semantic association between a pair of texts , This is a basic problem in the application of natural language understanding . for example , Ask and answer in the community (CQA) In the system , A key component is to find similar problems related to user problems from the database through problem matching . Similarly , Dialog agents need to make logical inference by predicting the implicit relationship between user statements and some predefined assumptions .
lately , Deep pre training Transformer Has made remarkable progress in text matching tasks . There are usually two based on fine tuning Transformer Example of encoder : Interaction based model ( Cross encoder ) And representation based models ( Double encoder ), Pictured above (a) and (b) Shown . Interaction based model ( for example ,BERT) Align text concat Is a single sequence , And perform complete interaction between text pairs . Although full interaction provides a rich matching signal from the bottom to the top of the model , But it also brings higher calculation cost and reasoning delay , This makes it difficult to deploy in actual scenarios .
for example , In an e-commerce search system , Because there are millions of good query pairs , It takes tens of days to grade these pairs using an interaction based model . contrary , The representation based model encodes text pairs independently by two twin encoders , Without any interaction . therefore , It supports embedded offline computing , This greatly reduces online latency , Thus, this kind of model is very useful in practice . Unfortunately , Independent coding without any interaction may lose the matching signal , This leads to a serious performance degradation .
To balance efficiency and effectiveness , Some work attempts to equip the twin structure with interaction modules . Various interaction strategies have been proposed , For example, attention level and Transformer layer . However , For efficiency , These interaction modules are added after the twin encoder , In order to preserve the twin character , The interaction in the coding process of twin encoder is still ignored . therefore , Rich interactive signals are lost , The existing representation based model still lags far behind the interaction based model in the following aspects .
In this work , The author tries to break the dilemma between interaction based model and representation based model . The key idea is in the representation based model , Without destroying the twin structure , Integrate the interaction in the twin coding process . So , The author proposes virtual interaction (VIRT), This is a new mechanism to transfer the knowledge in text pair interaction to the twin encoder based on the representation model .
say concretely , The twin coder learns the interactive information between the two texts by imitating the complete interaction , And guided by the knowledge transferred by the interaction based model . Knowledge transfer is realized as an attention graph extraction task in the training process , In the process of reasoning, the task can be deleted to maintain the twin characteristics . So it's called “ Virtual interaction ”. Besides , In order to further utilize the interactive knowledge learned after twin coding , The author designs an interactive strategy of virtual adaptation . In a program called VIRT The representation based model of the encoder jointly realizes VIRT and VIRT Adaptive interaction strategies , Pictured above (c) Shown .
In this paper, the contribution It can be summarized as follows :
The author proposes a new virtual interaction mechanism VIRT, By extracting attention maps from interaction based models , It is integrated into the twin encoder based on the representation model , Without additional reasoning costs .
A lot of experiments show that , What this article puts forward VIRT The encoder is better than that based on SOTA The model of representation , At the same time, the reasoning efficiency is maintained .
VIRT It can be easily integrated into other representation based text matching models , To further improve its performance .
03
Method
In this section , The author will first describe the interaction based model and the representation based model . then , Introduce the virtual interaction mechanism (VIRT), It extracts interaction knowledge from interaction based models to twin coders . Besides , adopt VIRT Adaptive interaction strategies , You can make full use of the interactive knowledge learned .VIRT The architecture of is shown in the figure above .
3.1 Interaction-based Models
Given two text sequences and as input , Interaction based models will X and Y concat by , And use L layer Transformer Encode :.Transformer Each layer of consists of two residual sublayers : A multi head attention operation (MHA) And a feedforward network (FFN):
, among d Is the dimension of the hidden state . here , For the sake of description , omitted batch Dimensions of size and attention head .
402 Payment Required
It's No (l-1) The middle of the layer represents , It's right X and Y Code the interactive information between . and It's No l Attention parameter of layer , from From mapping . Express LayerNorm operation .You can see , The interaction based model can encode the interaction information into X and Y In the expression of . In particular , Combine representations to generate attention maps M, It represents the weight of different interaction signals . These expressions are based on M To choose and integrate .
3.2 Representation-based Models
Compared with the interaction based model , The representation based model first passes through two independent twins Transformer Encoders are respectively aligned X and Y Encoding ( It is assumed that each encoder has L individual Transformer layer ):,. then , They have additional interactions with twin encoded and .transformer The structure of is the same as the interaction based model , Just pay attention to the picture ( or ) Just use X( or Y) Calculated separately :
Compared with the interaction based model , In the coding process X and Y No interaction between . In a representation based model ,X and Y The fine-grained interaction information between them will be lost , This can lead to performance degradation .
3.3 Virtual Interaction
As mentioned earlier , The main disadvantage of the representation based model is the lack of interaction when encoding two input sequences separately . Intuitively speaking , The interaction based model passes MHA Mechanism performs interaction . By corresponding to X and Y To calculate . Compared with the interaction based model , The representation based model only passes X( or Y) Calculated separately ( or ). In the next section , We will first explain in detail the two models in MHA Operational differences . Next , It introduces VIRT Mechanism , It can improve the representation based model without adding additional reasoning cost .
First, decompose the... Based on the interaction model MHA operation , Pictured above (b) Note the blue color in the figure . To be specific , In the interaction based model l The input representation of the layer , namely , Decomposable into X Part and Y part . therefore ,, among ,
402 Payment Required
. Based on this , The attention parameter can also be rewritten as X part ( Expressed as and ) and Y part ( Expressed as and ) The combination of .Softmax(·) Final attention score before operation ( Expressed as ) It can be decomposed into the following partition matrix :Therefore, it can be divided into four parts :
402 Payment Required
, namely :It's only in X or Y Implemented in MHA operation , It corresponds to MHA operation . And refers to interaction based models x and Y Interaction between , These models are responsible for rich representation of interactive information . However , They are missing from the representation based model , This leads to the performance gap between the two models .
Through the above analysis , Missing interactions based on representation models can be extracted as X and Y Between MHA operation . To restore this missing interaction , Let the representation based model simulate the interaction in the interaction based model , As shown below :
Where means by attend To the generated attention map , By attend To the generated attention map . These two additional attention maps represent the missing interaction signals in the representation based model .
Because we hope to improve the performance of representation based model to interaction based model , Therefore, the author proposes to align the missing attention graph with its existing correspondence in the interaction based model . The attention graph in the interaction based model can guide the representation ( Namely and ) Towards the direction of rich interaction , It's like interaction in the coding process . In this way , Extract knowledge during interaction , And transmit it to the dual encoder , Without any additional computing overhead . This is why this mechanism is called “ Virtual interaction ”.
In order to achieve VIRT, The author adopts knowledge distillation technology , One of the trained interaction based models is considered a teacher , A representation based model that requires training is considered a student . And corresponding interaction based models X and Y Interaction between . They can go directly from softmax(·) From the previous attention score : Yes, before m OK and last n Column section , Corresponding to the last n Line and front m Column . except softmax Out of operation , The author also directly selects these two slices , To form a guided attention graph from an interaction based model :
As a supervised interaction, knowledge is further transferred , To guide VIRT. say concretely , The goal is to minimize and L2 distance :
Take the above formula as the optimization goal only in the training stage , And delete it during reasoning . This preserves the twin properties of the representation based model , At the same time, it will not bring additional reasoning costs .
3.4 VIRT-Adapted Interaction
adopt VIRT, Interaction knowledge can be penetrated into every coding layer of the representation based model . However , After twin coding , The representation of the last layer , Namely and , Still can't see each other , Therefore, there is a lack of clear interaction . In order to make full use of the learned interactive knowledge , The author further designs an adaptation VIRT Interaction strategy , The strategy is VIRT Under the guidance of the attention map of learning, and .
To be specific , The author performs the... Between and in the following process VIRT Adapt to interaction . The formula of the generated attention graph is as follows :
among Pool(·) Indicates the average pool operation . Last , Using simple fusion prediction to match tags y:
Where is concat operation ,MLP Represents a multi-layer perceptron . The overall training goal is to lose the supervision of specific tasks ( That's cross entropy loss ) The combination of and is minimized :
among α It is a super parameter to measure the impact of virtual interaction .
It is worth noting that ,VIRT Is a general strategy , It can be used to enhance any representation based matching model .
04
experiment
The performance of different methods is shown in the table above .BERT-base As a powerful interaction based model , It shows its effectiveness . And BERT comparison ,Siamese BERT The performance of . What this article puts forward VIRT Encoder achieves the best performance , Better than all presentation based baseline, Even with interaction based BERT The model is competitive . This proof VIRT Be able to approximate the deep interaction modeling ability of interaction based models .
The above table shows the VIRT as well as VIRT The contribution of adaptive interaction . Don't use VIRT or VIRT The performance degradation of adapting to interaction shows the effectiveness of these two architectures . about MNLI and RTE, Due to removal VIRT The performance degradation caused by adaptive interaction is more serious .
In order to verify the proposed VIRT The generality of , The author further imports it into the above representation based model . The results are shown in the table above . Based on the results , Can be observed VIRT It can be easily integrated into other representation based text matching models , To further improve performance .
05
summary
Representation based models are widely used in text matching tasks because of their high efficiency , The interaction based model has poor performance due to the lack of interaction . Previous work often introduced additional interaction layers , The interaction in the twin encoder is still missing .
In this paper , The author proposes a virtual interaction (VIRT) Mechanism , This mechanism can approximate the ability of interaction modeling by extracting attention graph from interaction based model to twin encoder based on representation model , Without additional reasoning costs .
The proposed VIRT The encoder adopts VIRT and VIRT Adapt to interaction strategies , The latest performance of the existing representation based model is implemented in multiple text matching tasks . Besides ,VIRT The existing representation based model is further improved .
Reference material
[1]https://arxiv.org/abs/2112.04195
END
Join in 「 Computer vision 」 Exchange group notes :CV
边栏推荐
- Is it meaningful for 8-bit MCU to run RTOS?
- ADB common commands
- Jielizhi obtains the customized background information corresponding to the specified dial [chapter]
- I want to say more about this communication failure
- Take you through ancient Rome, the meta universe bus is coming # Invisible Cities
- STM32按键状态机2——状态简化与增加长按功能
- Excel usage record
- Kill -9 system call used by PID to kill process
- Scratch epidemic isolation and nucleic acid detection Analog Electronics Society graphical programming scratch grade examination level 3 true questions and answers analysis June 2022
- Olivetin can safely run shell commands on Web pages (Part 1)
猜你喜欢
8位MCU跑RTOS有没有意义?
Interesting - questions about undefined
Pytest learning ----- pytest confitest of interface automation test Py file details
What is the reason why the video cannot be played normally after the easycvr access device turns on the audio?
JMeter interface test response data garbled
sql语句优化,order by desc速度优化
1700C - Helping the Nature
std::true_type和std::false_type
Compilation principle - top-down analysis and recursive descent analysis construction (notes)
虚拟机VirtualBox和Vagrant安装
随机推荐
F200 - UAV equipped with domestic open source flight control system based on Model Design
IP, subnet mask, gateway, default gateway
Open source and safe "song of ice and fire"
78 year old professor Huake has been chasing dreams for 40 years, and the domestic database reaches dreams to sprint for IPO
The shell generates JSON arrays and inserts them into the database
【Android】Kotlin代码编写规范化文档
Nodejs 开发者路线图 2022 零基础学习指南
D binding function
《ASP.NET Core 6框架揭秘》样章发布[200页/5章]
Declval of template in generic programming
Getting started with pytest ----- allow generate report
历史上的今天:Google 之母出生;同一天诞生的两位图灵奖先驱
VR全景婚礼,帮助新人记录浪漫且美好的场景
编译原理——自上而下分析与递归下降分析构造(笔记)
The latest financial report release + tmall 618 double top, Nike energy leads the next 50 years
SQL statement optimization, order by desc speed optimization
FMT开源自驾仪 | FMT中间件:一种高实时的分布式日志模块Mlog
微信为什么使用 SQLite 保存聊天记录?
最新财报发布+天猫618双榜第一,耐克蓄力领跑下个50年
Release of the sample chapter of "uncover the secrets of asp.net core 6 framework" [200 pages /5 chapters]