当前位置：网站首页>Deep learning brush a bunch of tricks of SOTA

Deep learning brush a bunch of tricks of SOTA

2022-07-29 05:09:00 【AI Hao】

This article is reprinted. ： official account ： Packet algorithm notes

Stable and useful trick

0. Model fusion

Understand everything , It is necessary to play games , Everyone knows that it's useless to write an article trick, When the model was young, it was still used stacking, The effect of direct probability fusion is also good .

1、 Confrontation training

Confrontation training is to add disturbance at the input level , According to the sample generated by the disturbance , To do a back propagation . With FGM For example , stay NLP On , Disturbance acts on embedding layer . Give me a plug and play code snippet , Quoted Zhihu id:Nicolas Code for , Well written , It's easy to understand with the principle .

#  initialization 
fgm = FGM(model)
for batch_input, batch_label in data:
    #  Normal training 
    loss = model(batch_input, batch_label)
    loss.backward() #  Back propagation , Get normal grad
    #  Confrontation training 
    fgm.attack() #  stay embedding Add anti disturbance to the 
    loss_adv = model(batch_input, batch_label)
    loss_adv.backward() #  Back propagation , And in normal grad On the basis of , Add up the gradient of confrontation training 
    fgm.restore() #  recovery embedding Parameters 
    #  gradient descent , Update parameters 
    optimizer.step()
    model.zero_grad()

Specifically FGM The implementation of the

import torch
class FGM():
    def __init__(self, model):
        self.model = model
        self.backup = {
    }

    def attack(self, epsilon=1., emb_name='emb.'):
        # emb_name This parameter needs to be changed to your model embedding Parameter name of 
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name:
                self.backup[name] = param.data.clone()
                norm = torch.norm(param.grad)
                if norm != 0 and not torch.isnan(norm):
                    r_at = epsilon * param.grad / norm
                    param.data.add_(r_at)

    def restore(self, emb_name='emb.'):
        # emb_name This parameter needs to be changed to your model embedding Parameter name of 
        for name, param in self.model.named_parameters():
            if param.requires_grad and emb_name in name: 
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {
    }

2.EMA/SWA

Moving average , Save a parameter of history , After a certain training stage , Take the historical parameters to smooth the current parameters . This thing , I was there before. earhian In the ancestral code of . He likes it + Decay learning rate . It really works every time .

#  initialization 
ema = EMA(model, 0.999)
ema.register()

#  During training , After updating the parameters , Sync update shadow weights
def train():
    optimizer.step()
    ema.update()

# eval front ,apply shadow weights;eval after , Restore the parameters of the original model 
def evaluate():
    ema.apply_shadow()
    # evaluate
    ema.restore()

Specifically EMA Realization , Plug and play ：

class EMA():
    def __init__(self, model, decay):
        self.model = model
        self.decay = decay
        self.shadow = {
    }
        self.backup = {
    }

    def register(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                self.shadow[name] = param.data.clone()

    def update(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
                self.shadow[name] = new_average.clone()

    def apply_shadow(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.shadow
                self.backup[name] = param.data
                param.data = self.shadow[name]

    def restore(self):
        for name, param in self.model.named_parameters():
            if param.requires_grad:
                assert name in self.backup
                param.data = self.backup[name]
        self.backup = {
    }

The problem with these two methods is that they will slow down , And the scoring points are in the first place , But it can be plug and play .

3.Rdrop And other comparative learning methods

It's a little useful , It won't get worse , It's also very easy to implement .

# Training process context 
ce = CrossEntropyLoss(reduction='none')
kld = nn.KLDivLoss(reduction='none')
logits1 = model(input)
logits2 = model(input)
# The following is the core implementation of comparative learning in the training process ！！！！
kl_weight = 0.5 # contrast loss The weight 
ce_loss = (ce(logits1, target) + ce(logits2, target)) / 2
kl_1 = kld(F.log_softmax(logits1, dim=-1), F.softmax(logits2, dim=-1)).sum(-1)
kl_2 = kld(F.log_softmax(logits2, dim=-1), F.softmax(logits1, dim=-1)).sum(-1)
loss = ce_loss + kl_weight * (kl_1 + kl_2) / 2

Everybody knows , In the training phase .dropout Is open , You infer many times dropout It's random .

If the model is robust , You have the same sample , Even if the inference time , Open dropout, The result should be similar . Okay , Then its principle is ready to come out . To describe it with a picture is ：

Insert picture description here
Kick as you like (dropout), Ben AI Steady as an old dog .

KLD loss It measures the distance between two distributions , So he is in the primitive loss On , Added a loss, This loss Describe the model after two inferences , Resist because dropout The ability to cause disturbances .

4.TTA

This sentence is clear , When testing, construct reliable data enhancement , A simpler way of data enhancement is better , Then add up the prediction results to calculate an average .

5. Pseudo label

Code and principle implementation is not difficult , The price is also the slow training , After all, there are some more data to make it clear , Is to use the training model , Put the test data , Or data without labels , Infer it again . Form a fake tag , Then take it back for training . Be careful not to. leak.
That sounds outrageous , Let's implement the steps with pseudo code .

model1.fit(train_set,label,  val=validation_set) #step1
pseudo_label=model.pridict(test_set)  #step2
new_label = concat(pseudo_label, label) #step3
new_train_set =  concat(test_set, train_set)  #step3
model2.fit(new_train_set, new_label,   val=validation_set) #step4
final_predict = model2.predict(test_set) #step5

In terms of a classic picture on the Internet, it is .
Insert picture description here

6. Neural network automatically fills in the blank

Table data in NN Upper trick, Be quickly tabnet Integrated , This method is to put the missing value outside the position mask, As 1 In this way, a parameter can be learned , Add this back feature Input on . You can see the implementation of his article .

Scene Limited trick

Useful but the scene is limited or unstable

1、PET Or other prompt The plan

Useful in some specific scenarios , such as zeroshot, Or small sample supervision training , It is useful for model fusion when there is enough data , A single model is not necessarily too hard .

2、Focalloss

Sometimes it works , Most of the time, it's not very useful , Look at indicators , In some pairs of long tails , And rare categories pay special attention to tasks and indicators .

3、mixup/cutmix Wait for data enhancement

Pick data , Most of the data and tasks are of little use , Tasks with sensitive local features are useful , Such as audio classification

4、 Face and other changes softmax The way

Useful when the amount of data is small , It is of little use when there is a huge amount of data in industry

5、 Pre training after the field

Put your own data set , stay Bert base On the use of MLM Go over the task again , The price is also slower , Thanks to the huggingface Highly available code , It's also very simple to implement , It is applicable to some scenes that are quite different from the pre training expectation , For example, traditional Chinese Medicine ,ai4code etc. , It is not useful in some ordinary news text classification data sets .

6、 Classification becomes retrieval

This is the standard solution of small sample classification problem , Similar to the face field baseline, There are many things that can be divided around classes , Aggregated within the class loss improvement , image aa-softmax,arcface,am-softmax etc.

In text categorization , The effect of image classification is good .

Breakthrough performance trick

1. Mixed precision training

AMP Plug and play , immediate .

2. Gradient accumulation

Before the optimizer updates the parameters , Carry out several forward and backward propagation with the same model parameters . The gradient calculated at each back propagation is accumulated （ Add up ）. However, this method will affect BN The calculation of , Can be used to break through batchsize ceiling .