当前位置:网站首页>[model distillation] tinybert: distilling Bert for natural language understanding
[model distillation] tinybert: distilling Bert for natural language understanding
2022-07-02 07:22:00 【lwgkzl】
executive summary
TinyBert It mainly explores how to use model distillation to realize BERT Compression of models .
It mainly includes two innovations :
- Yes Transformer Parameters for distillation , Attention should be paid to embedding,attention_weight, After the complete connection layer hidden, And finally logits.
- For the pre training language model , It's divided into pretrain_model Distillation and task-specific Distillation . Study separately pretrain The initial parameters of the model in order to give a good initialization to the parameters of the compressed model , The second step is to learn pretrain model fine-tuning Of logits Let the compression model learn again .
Model
The model is mainly divided into three parts :
- Transformer Layer Distillation of
It mainly distills two parts , The first is on each floor attention weight, The second is the output of each layer hidden. As shown in the figure below .
The formula :
Using the mean square error as the loss function , And in hidden A comparison is introduced Wh, This is because the vector coding dimensions of the student model and the teacher model are inconsistent ( The vector dimension of the student model is smaller )
2. Embedding layer Distillation of
E Express embeddign Layer output .
3. Predict logits Distillation of
z Indicates that the teacher model and the student model are in task-specific Prediction probability on the task .
Another detail is data enhancement , The student model is task-specific On mission fine-tuning When ,Tinybert The original data set is enhanced .(ps: This is actually very strange , Because we can see in the experiment later , After removing the data enhancement , The effect of the model is better than that of the previous sota Not much improvement . The main selling point of this article is model distillation ummm)
Experiment and conclusion
The importance of distillation at all levels
It can be seen that , In terms of importance : Attn > Pred logits > Hidn > emb. among ,Attn,Hidn as well as emb It is useful in both stages of distillation .The importance of data enhancement
GD (General Distillation) Indicates the first stage distillation .
TD (Task-specific Distillation) Indicates the second stage distillation .
and DA (Data Augmentation). Indicates data enhancement .
The conclusion of this table is , Data enhancement is important : (.Which layers of the teacher model does the student model need to learn
Suppose the student model 4 layer , Teacher model 12 layer
top It means that the student model learns the teacher's post 4 layer (10,11,12),bottom Represents the front of the learning teacher model 4 layer (1,2,3,4),uniform It means even learning ( Equidistant ,3,6,9,12).
You can see , It is better to study each layer evenly .
Code
# This part of the code should be written in Trainer Inside , loss.backward Before .
# Get the student model logits, attention_weight as well as hidden
student_logits, student_atts, student_reps = student_model(input_ids, segment_ids, input_mask,
is_student=True)
# Get the teacher model in the test environment logits, attention_weight as well as hidden
with torch.no_grad():
teacher_logits, teacher_atts, teacher_reps = teacher_model(input_ids, segment_ids, input_mask)
# Divided into two steps , The next step is learning attentino_weight and hidden, Another step is to learn predict_logits. The general idea is to do the output of student model and teacher model loss, It's about attention_weight and hidden yes MSE loss, in the light of logits It's cross entropy .
if not args.pred_distill:
teacher_layer_num = len(teacher_atts)
student_layer_num = len(student_atts)
assert teacher_layer_num % student_layer_num == 0
layers_per_block = int(teacher_layer_num / student_layer_num)
new_teacher_atts = [teacher_atts[i * layers_per_block + layers_per_block - 1]
for i in range(student_layer_num)]
for student_att, teacher_att in zip(student_atts, new_teacher_atts):
student_att = torch.where(student_att <= -1e2, torch.zeros_like(student_att).to(device),
student_att)
teacher_att = torch.where(teacher_att <= -1e2, torch.zeros_like(teacher_att).to(device),
teacher_att)
tmp_loss = loss_mse(student_att, teacher_att)
att_loss += tmp_loss
new_teacher_reps = [teacher_reps[i * layers_per_block] for i in range(student_layer_num + 1)]
new_student_reps = student_reps
for student_rep, teacher_rep in zip(new_student_reps, new_teacher_reps):
tmp_loss = loss_mse(student_rep, teacher_rep)
rep_loss += tmp_loss
loss = rep_loss + att_loss
tr_att_loss += att_loss.item()
tr_rep_loss += rep_loss.item()
else:
if output_mode == "classification":
cls_loss = soft_cross_entropy(student_logits / args.temperature,
teacher_logits / args.temperature)
elif output_mode == "regression":
loss_mse = MSELoss()
cls_loss = loss_mse(student_logits.view(-1), label_ids.view(-1))
loss = cls_loss
tr_cls_loss += cls_loss.item()
边栏推荐
- [Bert, gpt+kg research] collection of papers on the integration of Pretrain model with knowledge
- 2021-07-17c /cad secondary development creation circle (5)
- ARP attack
- Error in running test pyspark in idea2020
- Ingress Controller 0.47.0的Yaml文件
- @Transational踩坑
- ORACLE EBS中消息队列fnd_msg_pub、fnd_message在PL/SQL中的应用
- How to call WebService in PHP development environment?
- 【信息检索导论】第二章 词项词典与倒排记录表
- Build FRP for intranet penetration
猜你喜欢
ORACLE EBS 和 APEX 集成登录及原理分析
User login function: simple but difficult
Oracle 11g uses ords+pljson to implement JSON_ Table effect
TCP攻击
Yaml file of ingress controller 0.47.0
Oracle EBS ADI development steps
Yolov5 practice: teach object detection by hand
ORACLE EBS中消息队列fnd_msg_pub、fnd_message在PL/SQL中的应用
ssm垃圾分类管理系统
Ding Dong, here comes the redis om object mapping framework
随机推荐
RMAN incremental recovery example (1) - without unbacked archive logs
One field in thinkphp5 corresponds to multiple fuzzy queries
Three principles of architecture design
【信息检索导论】第二章 词项词典与倒排记录表
使用 Compose 实现可见 ScrollBar
数仓模型事实表模型设计
PM2 simple use and daemon
Take you to master the formatter of visual studio code
华为机试题-20190417
类加载器及双亲委派机制
【论文介绍】R-Drop: Regularized Dropout for Neural Networks
oracle-外币记账时总账余额表gl_balance变化(上)
Yolov5 practice: teach object detection by hand
Proteus -- RS-232 dual computer communication
Message queue fnd in Oracle EBS_ msg_ pub、fnd_ Application of message in pl/sql
allennlp 中的TypeError: Object of type Tensor is not JSON serializable错误
Get the uppercase initials of Chinese Pinyin in PHP
parser.parse_args 布尔值类型将False解析为True
Sparksql data skew
Illustration of etcd access in kubernetes