当前位置：网站首页>[paper introduction] r-drop: regulated dropout for neural networks

[paper introduction] r-drop: regulated dropout for neural networks

2022-07-02 07:22:00 【lwgkzl】

executive summary

The starting point of this paper is ： Previous dropout There is a problem that the model is inconsistent between training and testing .
Based on this starting point , This paper proposes R-Dropout The way to solve this problem .
Experimental proof ,R-Dropout Valid on multiple datasets （ Both slightly improved ）

Yes Dropout Thinking

First of all, we need to understand , Why the previous dropout There is a problem of inconsistency between training and testing . During training ,dropout It's random mask Some nodes of the model , Then use the remaining network to fit the data （ Prevent over fitting ）. In different batch In the process of data training , because mask It's random , Therefore, different data may be processed through different networks . Therefore, the whole training process can be regarded as integrated learning of multiple different networks . And when it comes to testing , Because it won't be random mask Drop the node , Therefore, it can be regarded as a complete model to make predictions on the test set , So there is inconsistency .
Because in training , Learning is a sub model , And when testing , A complete model is used to make predictions .
The author's thinking is very strange , Perhaps the intuitive method is to find a way to reduce the gap between the sub model and the complete model , The idea of this article is not so intuitive , But rather ： If the inputs of all sub models are the same , Then the output of the complete model and the output of the sub model should also be similar ., Therefore, the optimization goal of this paper is for the same set of inputs , Through the same architecture , But in a different way mask dropout Post model , The output should be consistent .

R-Dropout Introduce

R-Dropout Picture introduction of
The main idea is shown in the previous section , This picture can also intuitively show his ideas , As shown on the right , For the same input X, After two identical Transformer encoder structure , However, these two structures will be different mask Conduct dropout Then get two outputs P1(y|x) as well as P2(y|x),R-dropout These two outputs are required to be consistent as much as possible . So with these two outputs KL Divergence is optimized as one of the loss functions of the model .

Experiment and conclusion

Standard experimental proof R-Dropout stay 18 There is a little improvement on all data sets （1%–2%）.
Among them, ablation experiment is more interesting , Will test several ideas .
idea 1：
Every time R-dropout When , It can be more than the output generated by two modules KL The divergence , Multiple modules can be used at the same time to compare .
Conclusion ： The author did three experiments at the same time dropout Module , The effect is slightly better than the two modules , But it doesn't make much sense .

idea 2： These two modules ,dropout The probability can be different , So you can try to do it with different probabilities mask. Get the following matrix .
Insert picture description here Conclusion ： Two module dropout The probability is 0.3-0.5 Between time , The results are not much different .

Code

import torch.nn.functional as F

# define your task model, which outputs the classifier logits
model = TaskModel()

def compute_kl_loss(self, p, q, pad_mask=None):
    
    p_loss = F.kl_div(F.log_softmax(p, dim=-1), F.softmax(q, dim=-1), reduction='none')
    q_loss = F.kl_div(F.log_softmax(q, dim=-1), F.softmax(p, dim=-1), reduction='none')
    
    # pad_mask is for seq-level tasks
    if pad_mask is not None:
        p_loss.masked_fill_(pad_mask, 0.)
        q_loss.masked_fill_(pad_mask, 0.)

    # You can choose whether to use function "sum" and "mean" depending on your task
    p_loss = p_loss.sum()
    q_loss = q_loss.sum()

    loss = (p_loss + q_loss) / 2
    return loss

# keep dropout and forward twice
logits = model(x)

logits2 = model(x)

# cross entropy loss for classifier
ce_loss = 0.5 * (cross_entropy_loss(logits, label) + cross_entropy_loss(logits2, label))

kl_loss = compute_kl_loss(logits, logits2)

# carefully choose hyper-parameters
loss = ce_loss + α * kl_loss

In fact, when using , You can put X to copy One copy , Then input to model in , Therefore, there is no need to X Two passes model. namely ：

double_x = torch.stack([x,x],0).view(-1,x.size(-1)
tot_logits = model(double_x).view(2, x.size(0), -1)
logits = tot_logits[0]
logits2 = tot_logits[1]
# .....