当前位置:网站首页>[model distillation] tinybert: distilling Bert for natural language understanding
[model distillation] tinybert: distilling Bert for natural language understanding
2022-07-02 07:22:00 【lwgkzl】
executive summary
TinyBert It mainly explores how to use model distillation to realize BERT Compression of models .
It mainly includes two innovations :
- Yes Transformer Parameters for distillation , Attention should be paid to embedding,attention_weight, After the complete connection layer hidden, And finally logits.
- For the pre training language model , It's divided into pretrain_model Distillation and task-specific Distillation . Study separately pretrain The initial parameters of the model in order to give a good initialization to the parameters of the compressed model , The second step is to learn pretrain model fine-tuning Of logits Let the compression model learn again .
Model
The model is mainly divided into three parts :
- Transformer Layer Distillation of
It mainly distills two parts , The first is on each floor attention weight, The second is the output of each layer hidden. As shown in the figure below .
The formula :

Using the mean square error as the loss function , And in hidden A comparison is introduced Wh, This is because the vector coding dimensions of the student model and the teacher model are inconsistent ( The vector dimension of the student model is smaller )
2. Embedding layer Distillation of 
E Express embeddign Layer output .
3. Predict logits Distillation of 
z Indicates that the teacher model and the student model are in task-specific Prediction probability on the task .
Another detail is data enhancement , The student model is task-specific On mission fine-tuning When ,Tinybert The original data set is enhanced .(ps: This is actually very strange , Because we can see in the experiment later , After removing the data enhancement , The effect of the model is better than that of the previous sota Not much improvement . The main selling point of this article is model distillation ummm)
Experiment and conclusion
The importance of distillation at all levels

It can be seen that , In terms of importance : Attn > Pred logits > Hidn > emb. among ,Attn,Hidn as well as emb It is useful in both stages of distillation .The importance of data enhancement

GD (General Distillation) Indicates the first stage distillation .
TD (Task-specific Distillation) Indicates the second stage distillation .
and DA (Data Augmentation). Indicates data enhancement .
The conclusion of this table is , Data enhancement is important : (.Which layers of the teacher model does the student model need to learn

Suppose the student model 4 layer , Teacher model 12 layer
top It means that the student model learns the teacher's post 4 layer (10,11,12),bottom Represents the front of the learning teacher model 4 layer (1,2,3,4),uniform It means even learning ( Equidistant ,3,6,9,12).
You can see , It is better to study each layer evenly .
Code
# This part of the code should be written in Trainer Inside , loss.backward Before .
# Get the student model logits, attention_weight as well as hidden
student_logits, student_atts, student_reps = student_model(input_ids, segment_ids, input_mask,
is_student=True)
# Get the teacher model in the test environment logits, attention_weight as well as hidden
with torch.no_grad():
teacher_logits, teacher_atts, teacher_reps = teacher_model(input_ids, segment_ids, input_mask)
# Divided into two steps , The next step is learning attentino_weight and hidden, Another step is to learn predict_logits. The general idea is to do the output of student model and teacher model loss, It's about attention_weight and hidden yes MSE loss, in the light of logits It's cross entropy .
if not args.pred_distill:
teacher_layer_num = len(teacher_atts)
student_layer_num = len(student_atts)
assert teacher_layer_num % student_layer_num == 0
layers_per_block = int(teacher_layer_num / student_layer_num)
new_teacher_atts = [teacher_atts[i * layers_per_block + layers_per_block - 1]
for i in range(student_layer_num)]
for student_att, teacher_att in zip(student_atts, new_teacher_atts):
student_att = torch.where(student_att <= -1e2, torch.zeros_like(student_att).to(device),
student_att)
teacher_att = torch.where(teacher_att <= -1e2, torch.zeros_like(teacher_att).to(device),
teacher_att)
tmp_loss = loss_mse(student_att, teacher_att)
att_loss += tmp_loss
new_teacher_reps = [teacher_reps[i * layers_per_block] for i in range(student_layer_num + 1)]
new_student_reps = student_reps
for student_rep, teacher_rep in zip(new_student_reps, new_teacher_reps):
tmp_loss = loss_mse(student_rep, teacher_rep)
rep_loss += tmp_loss
loss = rep_loss + att_loss
tr_att_loss += att_loss.item()
tr_rep_loss += rep_loss.item()
else:
if output_mode == "classification":
cls_loss = soft_cross_entropy(student_logits / args.temperature,
teacher_logits / args.temperature)
elif output_mode == "regression":
loss_mse = MSELoss()
cls_loss = loss_mse(student_logits.view(-1), label_ids.view(-1))
loss = cls_loss
tr_cls_loss += cls_loss.item()
边栏推荐
- CSRF attack
- Implementation of purchase, sales and inventory system with ssm+mysql
- SSM student achievement information management system
- 【BERT,GPT+KG调研】Pretrain model融合knowledge的论文集锦
- Oracle段顾问、怎么处理行链接行迁移、降低高水位
- Sqli Labs clearance summary - page 2
- Sqli-labs customs clearance (less2-less5)
- 2021-07-17c /cad secondary development creation circle (5)
- 离线数仓和bi开发的实践和思考
- Oracle RMAN semi automatic recovery script restore phase
猜你喜欢

Oracle EBS数据库监控-Zabbix+zabbix-agent2+orabbix

Oracle 11g uses ords+pljson to implement JSON_ Table effect

Two table Association of pyspark in idea2020 (field names are the same)

Practice and thinking of offline data warehouse and Bi development

Ceaspectuss shipping company shipping artificial intelligence products, anytime, anywhere container inspection and reporting to achieve cloud yard, shipping company intelligent digital container contr

使用 Compose 实现可见 ScrollBar

读《敏捷整洁之道:回归本源》后感

JSP智能小区物业管理系统

Sqli Labs clearance summary - page 2

Take you to master the formatter of visual studio code
随机推荐
使用MAME32K进行联机游戏
数仓模型事实表模型设计
【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
Oracle segment advisor, how to deal with row link row migration, reduce high water level
Module not found: Error: Can't resolve './$$_gendir/app/app.module.ngfactory'
一个中年程序员学习中国近代史的小结
CSRF attack
【MEDICAL】Attend to Medical Ontologies: Content Selection for Clinical Abstractive Summarization
Spark SQL task performance optimization (basic)
Explain in detail the process of realizing Chinese text classification by CNN
Principle analysis of spark
【信息检索导论】第七章搜索系统中的评分计算
Error in running test pyspark in idea2020
Sparksql data skew
離線數倉和bi開發的實踐和思考
Conda 创建,复制,分享虚拟环境
Classloader and parental delegation mechanism
Spark的原理解析
SSM second hand trading website
Cloud picture says | distributed transaction management DTM: the little helper behind "buy buy buy"