当前位置:网站首页>[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
2022-07-29 10:08:00 【zoetu】
1. Paper information
author :Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Publishing unit :University of California at Berkeley University of California, Berkeley
meeting :AAAI 2020
Time of publication :2019.9.25
2. Research background
- be based on BERT Model of , It needs a lot of memory , The delay is very high , It's hard to deploy , So it needs to be quantified ;
- Different encoder The layer of focuses on different feature structures , The sensitivity to quantification is different , Therefore, it is necessary to quantify the mixing accuracy ;
This paper is based on BERT The model performs ultra-low precision quantization , Designed to minimize performance degradation , While maintaining hardware efficiency .
3. Method
Fine tuned BERT_BASE The model consists of three parts :
Embedded layer 91MB、 Encoder 325MB、 Output layer 0.01MB
Due to the small scale of the output layer , The study did not quantify this part . therefore , This paper studies the quantization of embedding and encoder parameters in different ways .
3.1 be based on Hessian Mixed precision quantization of information
Previous work :HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
The mixing accuracy is mainly aimed at weight Parameters , this paper activation Of bit The bit width is uniformly set to 8bit,
Determine the bit-width:
First , Calculate the parameters of each layer hessian The maximum eigenvalue of the matrix , The process is as follows :

secondly , Enter some training data , Statistics input the average value of the maximum eigenvalue when different data are trained . in addition , in the light of NLP The task needs to add variance to improve statistical information .
The larger the eigenvalue, the more sensitive it is to quantization , The higher you need to allocate bit-width, The more obvious the eigenvalue is, the less sensitive it is to quantification , Corresponding to the disturbance loss lanscape The smoother , Distribute less bit-width that will do .
about NLP The variance of the maximum eigenvalue of different input data of the task is very large , It's not good to rely only on the average , So the variance is added to further improve the statistical information , Take advantage of 10% Training data for statistics .
- Last , Calculate the quantitative sensitivity index , The formula is as follows , And distribute according to statistical information bit-width.

3.2 Group quantification (gourp-ise quantization)
Let's say the input sequence has n Word , Each word has a d Dimensions are embedded in vectors (BERT_BASE in d = 768). stay Transformer Encoding , Every self attention head (self-attention) Have 4 A dense matrix , Each self attention head calculates the weighted sum by clicking the formula :
Directly use the same quantification strategy to quantify different Head These four matrices , The accuracy will decrease a lot !
Group quantification mechanism :
Pay attention to the Bulls (MHSA) In the dense matrix of head matrix W As a group . for example ,12 individual head Altogether 12 Group . In each group , Output multiple sequences weight As a subgroup , Each subgroup has its own quantitative range .
This article takes every d / ( 2 ∗ N h ) d/(2*N_h) d/(2∗Nh) A continuous group , among d yes embedding length , N h N_h Nh yes head Number .
4. experiment
Experiment 1 : Compression effect and precision loss
Contrast the model :
- Baseline: Unquantified BERT Model
- Q-BERT( Only group quantification is used )
- Q-BERT_MP( Mixing accuracy and group quantization are adopted )
- DirectQ: Directly quantified BERT Model ( Do not use this method )
except baseline All models except the model use 8 Bit activation , Directly quantified Q-BERT Mixing accuracy or group quantization is not used .
notes :
- w-bits: Weight quantization
- e-bits: Embed quantification
- size-w/o-e: With MB Is the model size of the non embedded layer
summary :Q-BERT Reached 13 Times the weight compression ratio , The size of the model is only the original 1/8, And the accuracy loss is 2.3% within .
- Q-BERT Than DirectQ The method performs much better . Under the super low position , This difference becomes more obvious .
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.

- Besides , When the weight is quantified as 3 bit Settings ,Q-BERT and DirectQ The gap between them is further widened , In all kinds of tasks ,Q-BERT and DirectQ The gap between them is 9.68-27.83%.
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
Experiment two : Group quantify the impact
Q-BERT The effect of group quantification on three tasks . The weight of all tasks is set to 4, The embedded quantization bit is set to 8, The active quantization bit is set to 8.
Increase the number of groups , See the impact on the accuracy of the model :
result :
- No group quantization : The accuracy decreases by about 7% To 11.5%.
- When increasing the number of groups , Performance will be significantly improved , The performance gain is 128 The left and right sides of the group are almost saturated .
- for example , about 12 Group , The performance degradation of all tasks is less than 2%.
- Change the number of groups from 12 Add to 128, Accuracy will be further improved at least 0.3% The accuracy of .
- However , Change the number of groups from 128 Further increase to 768 Only in 0.1% Improve performance in . It indicates that the performance gain is 128 The left and right sides of the group are almost saturated .
Revelation : It's best not to set too many groups
Because this will increase the lookup table required for each matrix multiplication (lut) The number of . This may adversely affect hardware performance , According to our results , The return on accuracy is diminishing . This is why other experiments in this paper use 128 Group quantification .
Experiment three : Analyze the quantitative sensitivity of different modules

As shown in the figure above :
- The embedding layer is more sensitive to quantization than weight . for example , Use 4 Bit layer by layer quantization embedding layer , Lead to SST-2、MNLI、CoNLL-03 Performance degradation of up to 10%,SQuAD Even more than 20%. contrary , The coding layer consumes about 79% The total parameters of (4× Embedded parameter size ), Embedded layer parameters are in the minority , So the table 1 Quantify it as 4 When a , Less performance loss .
- Location embedding is more sensitive to quantization than word embedding . for example , The quantization position is embedded in 4 Bits will cause... Than quantifying word embedding 2% Performance degradation of . Even if the location embedding only accounts for less than 5%, This shows the importance of location information in natural language understanding tasks . Considering that the location embedding only accounts for a small part of the size of the model , Therefore, the mixing accuracy of the embedded layer can be quantified , Further compress the model with a tolerable decrease in accuracy , As shown in the figure below :
notes :4/8 Express 4bit Word embedding and 8bit Position insertion .
The performance decreases by about 0.5% Under the circumstances , The embedded table size can be reduced to 11.6MB.
- The self attention layer is more robust to quantization than the fully connected network . for example ,1/2MP Self attention will lead to performance degradation of about 5%, and 1/2MP Full connection will degrade performance 11%.
Experiment four : qualitative analysis
Use attention information for qualitative analysis Q-BERT And DirectQ differences :
Calculated the quantized BERT And full accuracy BERT The distribution of attention information for the same input Kullback-Leibler (KL) The divergence .
KL Distribution is used to measure 2 Distance of data distribution , KL The smaller the divergence , It shows that the more attention output of the two models is close .

As shown in the figure above :Q-BERT And Baseline The distance is far less than DirectQ And Baseline Distance of
5. summary
- Researchers on second-order information ( namely Hessian Information ) Do a lot of layer by layer analysis , And then BERT Perform hybrid accuracy quantization . The study found that , Compared with neural networks in the field of computer vision ,BERT Of Hessian There are great differences in behavior . therefore , This study proposes a method based on top Sensitivity measures of eigenvalue mean and variance , To achieve better mixing accuracy quantization .
- Researchers propose new quantitative mechanisms —— Group quantification (group-wise quantization), The method It can alleviate the decline of accuracy , At the same time, it will not lead to a significant increase in hardware complexity . To be specific , The group quantization mechanism divides each matrix into different groups , Each group has its own quantization range and lookup table .
- The researchers investigated BERT Bottlenecks in quantification , That is, how different factors affect NLP Tradeoff between performance and model compression , These factors include quantitative mechanisms , And embedding 、 Modules such as self attention and full connection layer .
This content can be viewed in three parts of the experiment , The quantitative sensitivity of embedding layer and weight, self attention layer and full connection layer .
6. REFERENCE
- Previous work of this article :DONG Z, YAO Z, GHOLAMI A, etc. . HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision[M/OL]. arXiv, 2019[2022-07-26].
HAWQ The work of the team can be seen more , It seems that there is open source !
Reference material :
- https://zhuanlan.zhihu.com/p/440401538
- https://developer.aliyun.com/article/819840
边栏推荐
- [AAAI] attention based spatiotemporal graph convolution network for traffic flow prediction
- 待人宽容大度
- 高效能7个习惯学习笔记
- 【AAAI】用于交通流预测的基于注意力的时空图卷积网络
- Docker安装Redis、配置及远程连接
- 数据可视化的利器-Seaborn简易入门
- Sublime Text3 set different indents for different files
- Only simple function test? One article takes you to advanced interface automatic testing technology in 6 steps
- TCP failure model
- 【配置相关】
猜你喜欢

Node (II)

Shell notes (super complete)

SiC Power Semiconductor Industry Summit Forum successfully held

熊市下PLATO如何通过Elephant Swap,获得溢价收益?

MySQL infrastructure: SQL query statement execution process

根据给定字符数和字符,打印输出“沙漏”和剩余数

关系型数据库之MySQL8——由内而外的深化全面学习

一文读懂Plato Farm的ePLATO,以及其高溢价缘由

After eating Alibaba's core notes of highly concurrent programming, the backhand rose 5K

Efficient 7 habit learning notes
随机推荐
ORBSLAM2安装测试,及各种问题汇总
Harmonyos 3.0 release!
[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types
我的问题解决记录1:类上使用了@Component注解,想要使用这个类中的方法,便不能直接new,而应该使用# @Autowired进行注入,否则会报错(如空指针异常等)
造型科幻、标配6安全气囊,风行·游艇11.99万起售
这才是开发者神器正确的打开方式
10 suggestions for 10x improvement of application performance
This developer, who has been on the list for four consecutive weeks, has lived like a contemporary college student
TCP failure model
JS to achieve full screen effect
【配置相关】
[C language] minesweeping (recursive expansion + marking function)
On memory computing integrated chip technology
Notes for Resume Writing
The purpose of DDD to divide domains, sub domains, core domains, and support domains
shell编程之sed,正则表达式
皕杰报表之文本附件属件
Youboxun, the gold donor of the open atom open source foundation, joined hands with partners to help openharmony break the circle!
[C language] Sanzi chess (intelligent chess playing + blocking players)
电竞入亚后,腾讯要做下一个“NBA赛事捕手”?


