当前位置:网站首页>Deep learning brush a bunch of tricks of SOTA
Deep learning brush a bunch of tricks of SOTA
2022-07-29 05:09:00 【AI Hao】
This article is reprinted. : official account : Packet algorithm notes
Stable and useful trick
0. Model fusion
Understand everything , It is necessary to play games , Everyone knows that it's useless to write an article trick, When the model was young, it was still used stacking, The effect of direct probability fusion is also good .
1、 Confrontation training
Confrontation training is to add disturbance at the input level , According to the sample generated by the disturbance , To do a back propagation . With FGM For example , stay NLP On , Disturbance acts on embedding layer . Give me a plug and play code snippet , Quoted Zhihu id:Nicolas Code for , Well written , It's easy to understand with the principle .
# initialization
fgm = FGM(model)
for batch_input, batch_label in data:
# Normal training
loss = model(batch_input, batch_label)
loss.backward() # Back propagation , Get normal grad
# Confrontation training
fgm.attack() # stay embedding Add anti disturbance to the
loss_adv = model(batch_input, batch_label)
loss_adv.backward() # Back propagation , And in normal grad On the basis of , Add up the gradient of confrontation training
fgm.restore() # recovery embedding Parameters
# gradient descent , Update parameters
optimizer.step()
model.zero_grad()
Specifically FGM The implementation of the
import torch
class FGM():
def __init__(self, model):
self.model = model
self.backup = {
}
def attack(self, epsilon=1., emb_name='emb.'):
# emb_name This parameter needs to be changed to your model embedding Parameter name of
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
self.backup[name] = param.data.clone()
norm = torch.norm(param.grad)
if norm != 0 and not torch.isnan(norm):
r_at = epsilon * param.grad / norm
param.data.add_(r_at)
def restore(self, emb_name='emb.'):
# emb_name This parameter needs to be changed to your model embedding Parameter name of
for name, param in self.model.named_parameters():
if param.requires_grad and emb_name in name:
assert name in self.backup
param.data = self.backup[name]
self.backup = {
}
2.EMA/SWA
Moving average , Save a parameter of history , After a certain training stage , Take the historical parameters to smooth the current parameters . This thing , I was there before. earhian In the ancestral code of . He likes it + Decay learning rate . It really works every time .
# initialization
ema = EMA(model, 0.999)
ema.register()
# During training , After updating the parameters , Sync update shadow weights
def train():
optimizer.step()
ema.update()
# eval front ,apply shadow weights;eval after , Restore the parameters of the original model
def evaluate():
ema.apply_shadow()
# evaluate
ema.restore()
Specifically EMA Realization , Plug and play :
class EMA():
def __init__(self, model, decay):
self.model = model
self.decay = decay
self.shadow = {
}
self.backup = {
}
def register(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
self.shadow[name] = param.data.clone()
def update(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
new_average = (1.0 - self.decay) * param.data + self.decay * self.shadow[name]
self.shadow[name] = new_average.clone()
def apply_shadow(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.shadow
self.backup[name] = param.data
param.data = self.shadow[name]
def restore(self):
for name, param in self.model.named_parameters():
if param.requires_grad:
assert name in self.backup
param.data = self.backup[name]
self.backup = {
}
The problem with these two methods is that they will slow down , And the scoring points are in the first place , But it can be plug and play .
3.Rdrop And other comparative learning methods
It's a little useful , It won't get worse , It's also very easy to implement .
# Training process context
ce = CrossEntropyLoss(reduction='none')
kld = nn.KLDivLoss(reduction='none')
logits1 = model(input)
logits2 = model(input)
# The following is the core implementation of comparative learning in the training process !!!!
kl_weight = 0.5 # contrast loss The weight
ce_loss = (ce(logits1, target) + ce(logits2, target)) / 2
kl_1 = kld(F.log_softmax(logits1, dim=-1), F.softmax(logits2, dim=-1)).sum(-1)
kl_2 = kld(F.log_softmax(logits2, dim=-1), F.softmax(logits1, dim=-1)).sum(-1)
loss = ce_loss + kl_weight * (kl_1 + kl_2) / 2
Everybody knows , In the training phase .dropout Is open , You infer many times dropout It's random .
If the model is robust , You have the same sample , Even if the inference time , Open dropout, The result should be similar . Okay , Then its principle is ready to come out . To describe it with a picture is :

Kick as you like (dropout), Ben AI Steady as an old dog .
KLD loss It measures the distance between two distributions , So he is in the primitive loss On , Added a loss, This loss Describe the model after two inferences , Resist because dropout The ability to cause disturbances .
4.TTA
This sentence is clear , When testing, construct reliable data enhancement , A simpler way of data enhancement is better , Then add up the prediction results to calculate an average .
5. Pseudo label
Code and principle implementation is not difficult , The price is also the slow training , After all, there are some more data to make it clear , Is to use the training model , Put the test data , Or data without labels , Infer it again . Form a fake tag , Then take it back for training . Be careful not to. leak.
That sounds outrageous , Let's implement the steps with pseudo code .
model1.fit(train_set,label, val=validation_set) #step1
pseudo_label=model.pridict(test_set) #step2
new_label = concat(pseudo_label, label) #step3
new_train_set = concat(test_set, train_set) #step3
model2.fit(new_train_set, new_label, val=validation_set) #step4
final_predict = model2.predict(test_set) #step5
In terms of a classic picture on the Internet, it is .
6. Neural network automatically fills in the blank
Table data in NN Upper trick, Be quickly tabnet Integrated , This method is to put the missing value outside the position mask, As 1 In this way, a parameter can be learned , Add this back feature Input on . You can see the implementation of his article .
Scene Limited trick
Useful but the scene is limited or unstable
1、PET Or other prompt The plan
Useful in some specific scenarios , such as zeroshot, Or small sample supervision training , It is useful for model fusion when there is enough data , A single model is not necessarily too hard .
2、Focalloss
Sometimes it works , Most of the time, it's not very useful , Look at indicators , In some pairs of long tails , And rare categories pay special attention to tasks and indicators .
3、mixup/cutmix Wait for data enhancement
Pick data , Most of the data and tasks are of little use , Tasks with sensitive local features are useful , Such as audio classification
4、 Face and other changes softmax The way
Useful when the amount of data is small , It is of little use when there is a huge amount of data in industry
5、 Pre training after the field
Put your own data set , stay Bert base On the use of MLM Go over the task again , The price is also slower , Thanks to the huggingface Highly available code , It's also very simple to implement , It is applicable to some scenes that are quite different from the pre training expectation , For example, traditional Chinese Medicine ,ai4code etc. , It is not useful in some ordinary news text classification data sets .
6、 Classification becomes retrieval
This is the standard solution of small sample classification problem , Similar to the face field baseline, There are many things that can be divided around classes , Aggregated within the class loss improvement , image aa-softmax,arcface,am-softmax etc.
In text categorization , The effect of image classification is good .
Breakthrough performance trick
1. Mixed precision training
AMP Plug and play , immediate .
2. Gradient accumulation
Before the optimizer updates the parameters , Carry out several forward and backward propagation with the same model parameters . The gradient calculated at each back propagation is accumulated ( Add up ). However, this method will affect BN The calculation of , Can be used to break through batchsize ceiling .
3.Queue perhaps memery bank
It can make batchsize Break through the sky , You can refer to MoCo The implementation method used for comparative learning
4. Unnecessary out of sync
Dorka ddp In training , When gradient accumulation is used , have access to no_sync Reduce unnecessary gradient synchronization , Speed up
边栏推荐
- [wechat applet -- solve the alignment problem of the last line of display:flex. (discontinuous arrangement will be divided into two sides)]
- Office提示系统配置无法运行怎么办?
- The most comprehensive promotion plan for the launch of new products
- Pivot table of odoo development tutorial
- MySQL time calculation function
- js(forEach)出现return无法结束函数的解决方法
- Solve the warning prompt of MySQL mapping file
- Torch.nn.crossentropyloss() details
- The method and detailed code of automatically pop-up and QQ group when players visit the website
- Academic | [latex] super detailed texlive2022+tex studio download installation configuration
猜你喜欢

Flink+Iceberg环境搭建及生产问题处理

Learn matlab to draw geographical map, line scatter bubble density map

Excel卡住了没保存怎么办?Excel还没保存但是卡住了的解决方法

WPS如何进行快速截屏?WPS快速截屏的方法

开源汇智创未来 | 2022开放原子全球开源峰会 openEuler 分论坛圆满召开

Google GTEST event mechanism

Introduction of JDBC preparestatement+ database connection pool

深度学习刷SOTA的一堆trick

Diagram of odoo development tutorial

TCP三次握手四次挥手
随机推荐
Glory 2023 push, push code ambubk
深度学习刷SOTA的一堆trick
AUTOSAR从入门到精通100讲(七十八)-AUTOSAR-DEM模块
关于thymeleaf的配置与使用
WPS如何进行快速截屏?WPS快速截屏的方法
JDBC statement + resultset introduction
【微信小程序】swiper滑动页面,滑块左右各露出前后的一部分,露出一部分
[file download] easyexcel quick start
Jackson解析JSON详细教程
Using jupyter (I), install jupyter under windows, open the browser, and modify the default opening address
JS daily question (11)
带你搞懂 Kubernetes 集群中几种常见的流量暴露方案
Force deduction ----- sort odd and even subscripts respectively
The method and detailed code of automatically pop-up and QQ group when players visit the website
Word如何查看文档修改痕迹?Word查看文档修改痕迹的方法
ThreadPoolExecutor simple to use
How does word view document modification traces? How word views document modification traces
五个关联分析,领略数据分析师一大重要必会处理技能
The representation of time series analysis: is the era of learning coming?
Box horizontal vertical center layout (summary)