当前位置:网站首页>[paper introduction] r-drop: regulated dropout for neural networks
[paper introduction] r-drop: regulated dropout for neural networks
2022-07-02 07:22:00 【lwgkzl】
executive summary
The starting point of this paper is : Previous dropout There is a problem that the model is inconsistent between training and testing .
Based on this starting point , This paper proposes R-Dropout The way to solve this problem .
Experimental proof ,R-Dropout Valid on multiple datasets ( Both slightly improved )
Yes Dropout Thinking
First of all, we need to understand , Why the previous dropout There is a problem of inconsistency between training and testing . During training ,dropout It's random mask Some nodes of the model , Then use the remaining network to fit the data ( Prevent over fitting ). In different batch In the process of data training , because mask It's random , Therefore, different data may be processed through different networks . Therefore, the whole training process can be regarded as integrated learning of multiple different networks . And when it comes to testing , Because it won't be random mask Drop the node , Therefore, it can be regarded as a complete model to make predictions on the test set , So there is inconsistency .
Because in training , Learning is a sub model , And when testing , A complete model is used to make predictions .
The author's thinking is very strange , Perhaps the intuitive method is to find a way to reduce the gap between the sub model and the complete model , The idea of this article is not so intuitive , But rather : If the inputs of all sub models are the same , Then the output of the complete model and the output of the sub model should also be similar ., Therefore, the optimization goal of this paper is for the same set of inputs , Through the same architecture , But in a different way mask dropout Post model , The output should be consistent .
R-Dropout Introduce
The main idea is shown in the previous section , This picture can also intuitively show his ideas , As shown on the right , For the same input X, After two identical Transformer encoder structure , However, these two structures will be different mask Conduct dropout Then get two outputs P1(y|x) as well as P2(y|x),R-dropout These two outputs are required to be consistent as much as possible . So with these two outputs KL Divergence is optimized as one of the loss functions of the model .
Experiment and conclusion
Standard experimental proof R-Dropout stay 18 There is a little improvement on all data sets (1%–2%).
Among them, ablation experiment is more interesting , Will test several ideas .
idea 1:
Every time R-dropout When , It can be more than the output generated by two modules KL The divergence , Multiple modules can be used at the same time to compare .
Conclusion : The author did three experiments at the same time dropout Module , The effect is slightly better than the two modules , But it doesn't make much sense .
idea 2: These two modules ,dropout The probability can be different , So you can try to do it with different probabilities mask. Get the following matrix . Conclusion : Two module dropout The probability is 0.3-0.5 Between time , The results are not much different .
Code
import torch.nn.functional as F
# define your task model, which outputs the classifier logits
model = TaskModel()
def compute_kl_loss(self, p, q, pad_mask=None):
p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='none')
q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='none')
# pad_mask is for seq-level tasks
if pad_mask is not None:
p_loss.masked_fill_(pad_mask, 0.)
q_loss.masked_fill_(pad_mask, 0.)
# You can choose whether to use function "sum" and "mean" depending on your task
p_loss = p_loss.sum()
q_loss = q_loss.sum()
loss = (p_loss + q_loss) / 2
return loss
# keep dropout and forward twice
logits = model(x)
logits2 = model(x)
# cross entropy loss for classifier
ce_loss = 0.5 * (cross_entropy_loss(logits, label) + cross_entropy_loss(logits2, label))
kl_loss = compute_kl_loss(logits, logits2)
# carefully choose hyper-parameters
loss = ce_loss + α * kl_loss
In fact, when using , You can put X to copy One copy , Then input to model in , Therefore, there is no need to X Two passes model. namely :
double_x = torch.stack([x,x],0).view(-1,x.size(-1)
tot_logits = model(double_x).view(2, x.size(0), -1)
logits = tot_logits[0]
logits2 = tot_logits[1]
# .....
边栏推荐
- Agile development of software development pattern (scrum)
- pySpark构建临时表报错
- Cognitive science popularization of middle-aged people
- 使用 Compose 实现可见 ScrollBar
- Changes in foreign currency bookkeeping and revaluation general ledger balance table (Part 2)
- PM2 simple use and daemon
- Sparksql data skew
- view的绘制机制(三)
- 离线数仓和bi开发的实践和思考
- ARP攻击
猜你喜欢
SSM laboratory equipment management
ssm垃圾分类管理系统
Sqli labs customs clearance summary-page1
IDEA2020中测试PySpark的运行出错
Network security -- intrusion detection of emergency response
类加载器及双亲委派机制
How to call WebService in PHP development environment?
【BERT,GPT+KG调研】Pretrain model融合knowledge的论文集锦
SSM second hand trading website
MySQL has no collation factor of order by
随机推荐
CRP实施方法论
Sqli labs customs clearance summary-page1
【信息检索导论】第七章搜索系统中的评分计算
【Torch】解决tensor参数有梯度,weight不更新的若干思路
A summary of a middle-aged programmer's study of modern Chinese history
allennlp 中的TypeError: Object of type Tensor is not JSON serializable错误
ORACLE EBS ADI 开发步骤
【BERT,GPT+KG调研】Pretrain model融合knowledge的论文集锦
Oracle EBS DataGuard setup
SSM laboratory equipment management
中年人的认知科普
【调参Tricks】WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
MySQL组合索引加不加ID
【信息检索导论】第六章 词项权重及向量空间模型
Error in running test pyspark in idea2020
【模型蒸馏】TinyBERT: Distilling BERT for Natural Language Understanding
RMAN incremental recovery example (1) - without unbacked archive logs
Build FRP for intranet penetration
oracle apex ajax process + dy 校验
矩阵的Jordan分解实例