当前位置:网站首页>Deberta (decoding enhanced Bert with distinguished attention)
Deberta (decoding enhanced Bert with distinguished attention)
2022-06-30 09:37:00 【A grain of sand in the vast sea of people】
Catalog
1. Brief introduction of the paper
2.1. Decoupling attention mechanism (Disentangled attention)
2.2. Enhanced mask decoder (Enhanced mask decoder)
1. Brief introduction of the paper
DeBerta (Decoding-enhanced BERT with disentangled attention), The architecture uses two new technologies to improve BERT and RoBERTa Model : It turns out that it's better than Xlnet,BERT And RoBERTa Duqiang .
And for the first time in SuperGLUE It's more than human beings on the list .
2. contribution
2.1. Decoupling attention mechanism (Disentangled attention)
why
for example ,“deep” and “learning” These two words i Appear next to each other , The dependence and influence between them are much stronger than when they appear in different sentences or are not adjacent .
So in order to solve this problem . Relative position information can be introduced
How
Each word is represented by two vectors , Code the content and location respectively , The attention weight between words is calculated by the decoupling matrix of their content and relative position .
{H_i} and {P_i|j}, Respectively represents the content code 、 And relative position coding .
The attention weight of a word pair can be calculated as the sum of four attention scores , Use a scatter matrix on its content and location as content to content 、 Content to location 、 Location to content and location to location .
Bert Only content and content are considered . stay DDeBerta in , Location to location does not provide much additional information , In our implementation, we removed from the above equation .
primary self-attention The calculation is as follows :
token i And j The relative position is calculated as follows :
among k Represents the maximum relative distance ;
Decoupling matrix calculation details .
Is the relative distance K Value and Q value
The overall algorithm flow is shown in the following figure :
2.2. Enhanced mask decoder (Enhanced mask decoder)
Why
Example :“a new store opened beside the new mall”, If you want to predict “store” and “mall” These two words . Although the context of the two words is similar , But their syntactic functions in sentences are different .
The subject here is “store” instead of “mall”. So in order to solve this problem .DeBERTa stay softmax The absolute position of the word is embedded before the layer . Here with Bert Different .Bert The absolute position embedding vector is added when inputting
How
2.3. Anti training Method
Anti training Method to fine tune , To improve the generalization ability of the model .
Adversarial training is a regularization method to improve the generalization of the model . It does this by improving the robustness of the model to adversarial instances , Adversarial instances are created by perturbing the input slightly . The model is regularized , So when given a task - specific example , The output distribution produced by this model is the same as that produced by the anti disturbance of this example .
For NLP tasks , Perturbation is applied to word embedding rather than to the original word sequence . However , Different words and models , The value range of the embedded vector ( standard ) Is different . In a large model with billions of parameters , The variance will increase , Thus leading to the instability of antagonistic training .
Inspired by layer normalization , We have put forward SiFT(Scale-invariant-Fine-Tuning) Algorithm , The stability of training is improved by applying perturbation to normalized word embedding . say concretely , In our experiment , When the DeBERTa Downstream of NLP When the task is fine tuned ,SiFT Firstly, the word embedding vector is normalized to a random vector , Then the normalized embedding vector is perturbed . We found that , Normalization greatly improves the performance of the fine tuning model . This kind of improvement is in the larger DeBERTa More prominent in the model .
3. experimental result
Pre training data set size
It can be seen that the least is Bert, The second is DeBERTA 了 ,XLNet most
stay Large models result
stay Base Models Result
4 Other
The picture is from microsoft Official website . You can see it clearly Enhanced Mask Decoder And Disentangled attention(Relative Position Embedding)
5 Code
build_relative_position Method in da_util.py Returns a relative position matrix .
If query_size =4 and key_size= 4 when , Back to shape yes (1,4,4)
The matrix is roughly as follows :
0 | -1 | -2 | -3 |
1 | 0 | -1 | -2 |
2 | 1 | 0 | -1 |
3 | 2 | 1 | 0 |
Disentangled attention Code implementation
The calculation of relative position Code
c2p_pos = torch.clamp(relative_pos + att_span, 0, att_span*2-1)
边栏推荐
- Niuke IOI weekly competition 20 popularization group (problem solving)
- Common query and aggregation of ES
- thrift简单使用
- float
- 近期学习遇到的比较问题
- Function simplification principle: save if you can
- Tclistener server and tcpclient client use -- socket listening server and socketclient use
- Guilin robust medical acquired 100% equity of Guilin Latex to fill the blank of latex product line
- UltraEdit delete empty line method
- Electron, which can wrap web page programs into desktop applications
猜你喜欢
桂林 稳健医疗收购桂林乳胶100%股权 填补乳胶产品线空白
8.8 heap insertion and deletion
Recommend a very easy-to-use network communication framework HP socket
Summary of Android knowledge points and common interview questions
Baidu map JS browsing terminal
9.JNI_ Necessary optimization design
4. use ibinder interface flexibly for short-range communication
[shutter] solve failed assertion: line 5142 POS 12: '_ debugLocked‘: is not true.
Solution to pychart's failure in importing torch package
About the smart platform solution for business hall Terminal Desktop System
随机推荐
近期学习遇到的比较问题
AutoUpdater. Net client custom update file
Flutter 0001, environment configuration
Pass anonymous function to simplification principle
Linear-gradient()
Pipe pipe --namedpipe and anonymouspipe
utils session&rpc
The elegant combination of walle and Jianbao
Pytorch for former Torch users - Tensors
CentOS MySQL installation details
Talking about kotlin process exception handling mechanism
Distributed things
Clickhouse installation (quick start)
Deep Learning with Pytorch-Train A Classifier
Reading notes of "Introduction to deep learning: pytoch"
Talk about the kotlin cooperation process and the difference between job and supervisorjob
oracle跨数据库复制数据表-dblink
【新书推荐】MongoDB Performance Tuning
Why must redis exist in distributed systems?
Deep Learning with Pytorch- neural network