当前位置:网站首页>[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
2022-07-29 10:08:00 【zoetu】
1. Paper information
author :Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Publishing unit :University of California at Berkeley University of California, Berkeley
meeting :AAAI 2020
Time of publication :2019.9.25
2. Research background
- be based on BERT Model of , It needs a lot of memory , The delay is very high , It's hard to deploy , So it needs to be quantified ;
- Different encoder The layer of focuses on different feature structures , The sensitivity to quantification is different , Therefore, it is necessary to quantify the mixing accuracy ;
This paper is based on BERT The model performs ultra-low precision quantization , Designed to minimize performance degradation , While maintaining hardware efficiency .
3. Method
Fine tuned BERT_BASE The model consists of three parts :
Embedded layer 91MB、 Encoder 325MB、 Output layer 0.01MB
Due to the small scale of the output layer , The study did not quantify this part . therefore , This paper studies the quantization of embedding and encoder parameters in different ways .
3.1 be based on Hessian Mixed precision quantization of information
Previous work :HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
The mixing accuracy is mainly aimed at weight Parameters , this paper activation Of bit The bit width is uniformly set to 8bit,
Determine the bit-width:
First , Calculate the parameters of each layer hessian The maximum eigenvalue of the matrix , The process is as follows :

secondly , Enter some training data , Statistics input the average value of the maximum eigenvalue when different data are trained . in addition , in the light of NLP The task needs to add variance to improve statistical information .
The larger the eigenvalue, the more sensitive it is to quantization , The higher you need to allocate bit-width, The more obvious the eigenvalue is, the less sensitive it is to quantification , Corresponding to the disturbance loss lanscape The smoother , Distribute less bit-width that will do .
about NLP The variance of the maximum eigenvalue of different input data of the task is very large , It's not good to rely only on the average , So the variance is added to further improve the statistical information , Take advantage of 10% Training data for statistics .
- Last , Calculate the quantitative sensitivity index , The formula is as follows , And distribute according to statistical information bit-width.

3.2 Group quantification (gourp-ise quantization)
Let's say the input sequence has n Word , Each word has a d Dimensions are embedded in vectors (BERT_BASE in d = 768). stay Transformer Encoding , Every self attention head (self-attention) Have 4 A dense matrix , Each self attention head calculates the weighted sum by clicking the formula :
Directly use the same quantification strategy to quantify different Head These four matrices , The accuracy will decrease a lot !
Group quantification mechanism :
Pay attention to the Bulls (MHSA) In the dense matrix of head matrix W As a group . for example ,12 individual head Altogether 12 Group . In each group , Output multiple sequences weight As a subgroup , Each subgroup has its own quantitative range .
This article takes every d / ( 2 ∗ N h ) d/(2*N_h) d/(2∗Nh) A continuous group , among d yes embedding length , N h N_h Nh yes head Number .
4. experiment
Experiment 1 : Compression effect and precision loss
Contrast the model :
- Baseline: Unquantified BERT Model
- Q-BERT( Only group quantification is used )
- Q-BERT_MP( Mixing accuracy and group quantization are adopted )
- DirectQ: Directly quantified BERT Model ( Do not use this method )
except baseline All models except the model use 8 Bit activation , Directly quantified Q-BERT Mixing accuracy or group quantization is not used .
notes :
- w-bits: Weight quantization
- e-bits: Embed quantification
- size-w/o-e: With MB Is the model size of the non embedded layer
summary :Q-BERT Reached 13 Times the weight compression ratio , The size of the model is only the original 1/8, And the accuracy loss is 2.3% within .
- Q-BERT Than DirectQ The method performs much better . Under the super low position , This difference becomes more obvious .
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.

- Besides , When the weight is quantified as 3 bit Settings ,Q-BERT and DirectQ The gap between them is further widened , In all kinds of tasks ,Q-BERT and DirectQ The gap between them is 9.68-27.83%.
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
Experiment two : Group quantify the impact
Q-BERT The effect of group quantification on three tasks . The weight of all tasks is set to 4, The embedded quantization bit is set to 8, The active quantization bit is set to 8.
Increase the number of groups , See the impact on the accuracy of the model :
result :
- No group quantization : The accuracy decreases by about 7% To 11.5%.
- When increasing the number of groups , Performance will be significantly improved , The performance gain is 128 The left and right sides of the group are almost saturated .
- for example , about 12 Group , The performance degradation of all tasks is less than 2%.
- Change the number of groups from 12 Add to 128, Accuracy will be further improved at least 0.3% The accuracy of .
- However , Change the number of groups from 128 Further increase to 768 Only in 0.1% Improve performance in . It indicates that the performance gain is 128 The left and right sides of the group are almost saturated .
Revelation : It's best not to set too many groups
Because this will increase the lookup table required for each matrix multiplication (lut) The number of . This may adversely affect hardware performance , According to our results , The return on accuracy is diminishing . This is why other experiments in this paper use 128 Group quantification .
Experiment three : Analyze the quantitative sensitivity of different modules

As shown in the figure above :
- The embedding layer is more sensitive to quantization than weight . for example , Use 4 Bit layer by layer quantization embedding layer , Lead to SST-2、MNLI、CoNLL-03 Performance degradation of up to 10%,SQuAD Even more than 20%. contrary , The coding layer consumes about 79% The total parameters of (4× Embedded parameter size ), Embedded layer parameters are in the minority , So the table 1 Quantify it as 4 When a , Less performance loss .
- Location embedding is more sensitive to quantization than word embedding . for example , The quantization position is embedded in 4 Bits will cause... Than quantifying word embedding 2% Performance degradation of . Even if the location embedding only accounts for less than 5%, This shows the importance of location information in natural language understanding tasks . Considering that the location embedding only accounts for a small part of the size of the model , Therefore, the mixing accuracy of the embedded layer can be quantified , Further compress the model with a tolerable decrease in accuracy , As shown in the figure below :
notes :4/8 Express 4bit Word embedding and 8bit Position insertion .
The performance decreases by about 0.5% Under the circumstances , The embedded table size can be reduced to 11.6MB.
- The self attention layer is more robust to quantization than the fully connected network . for example ,1/2MP Self attention will lead to performance degradation of about 5%, and 1/2MP Full connection will degrade performance 11%.
Experiment four : qualitative analysis
Use attention information for qualitative analysis Q-BERT And DirectQ differences :
Calculated the quantized BERT And full accuracy BERT The distribution of attention information for the same input Kullback-Leibler (KL) The divergence .
KL Distribution is used to measure 2 Distance of data distribution , KL The smaller the divergence , It shows that the more attention output of the two models is close .

As shown in the figure above :Q-BERT And Baseline The distance is far less than DirectQ And Baseline Distance of
5. summary
- Researchers on second-order information ( namely Hessian Information ) Do a lot of layer by layer analysis , And then BERT Perform hybrid accuracy quantization . The study found that , Compared with neural networks in the field of computer vision ,BERT Of Hessian There are great differences in behavior . therefore , This study proposes a method based on top Sensitivity measures of eigenvalue mean and variance , To achieve better mixing accuracy quantization .
- Researchers propose new quantitative mechanisms —— Group quantification (group-wise quantization), The method It can alleviate the decline of accuracy , At the same time, it will not lead to a significant increase in hardware complexity . To be specific , The group quantization mechanism divides each matrix into different groups , Each group has its own quantization range and lookup table .
- The researchers investigated BERT Bottlenecks in quantification , That is, how different factors affect NLP Tradeoff between performance and model compression , These factors include quantitative mechanisms , And embedding 、 Modules such as self attention and full connection layer .
This content can be viewed in three parts of the experiment , The quantitative sensitivity of embedding layer and weight, self attention layer and full connection layer .
6. REFERENCE
- Previous work of this article :DONG Z, YAO Z, GHOLAMI A, etc. . HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision[M/OL]. arXiv, 2019[2022-07-26].
HAWQ The work of the team can be seen more , It seems that there is open source !
Reference material :
- https://zhuanlan.zhihu.com/p/440401538
- https://developer.aliyun.com/article/819840
边栏推荐
- Vector implementation
- Several common design methods of test cases [easy to understand]
- Excel tool for generating database table structure
- 10 suggestions for 10x improvement of application performance
- SAP Fiori @OData. Analysis of the working principle of publish annotation
- 【日志框架】
- MySQL优化理论学习指南
- 跟着田老师学实用英语语法(持续更新)
- 这是一份不完整的数据竞赛年鉴!
- 机器学习入门的百科全书-2018年“机器学习初学者”公众号文章汇总
猜你喜欢

函数——(C游记)
![[jetson][reprint]pycharm installed on Jetson](/img/65/ba7f1e7bd1b39cd67018e3f17d465b.png)
[jetson][reprint]pycharm installed on Jetson

Efficient 7 habit learning notes

Youboxun, the gold donor of the open atom open source foundation, joined hands with partners to help openharmony break the circle!

《LOL》从代码上来说最难的是哪个英雄?

开放原子开源基金会黄金捐赠人优博讯携手合作伙伴,助力OpenHarmony破圈!

Excel tool for generating database table structure
![[AAAI] attention based spatiotemporal graph convolution network for traffic flow prediction](/img/3d/717bc3a47a58470edd7a815a976320.png)
[AAAI] attention based spatiotemporal graph convolution network for traffic flow prediction

The latest translated official pytorch easy introduction tutorial (pytorch version 1.0)

Implementation and verification logic of complex expression input component
随机推荐
PDF处理还收费?不可能
Skiasharp's WPF self drawn bouncing ball (case version)
i.MX6ULL驱动开发 | 32 - 手动编写一个虚拟网卡设备
SkiaSharp 之 WPF 自绘 弹动小球(案例版)
MySQL million level data migration practice notes
皕杰报表之文本附件属件
The purpose of DDD to divide domains, sub domains, core domains, and support domains
什么是卡特兰数?有哪些应用?
How to customize the opportunity closing form in dynamics 365online
After the thunderstorm of two encryption companies: Celsius repayment guarantee collateral, three arrow capital closed and disappeared
Mysql database final review question bank
English grammar_ Indefinite pronouns - Common Phrases
英特尔联合Datawhale,发布学习项目!
Vim到底可以配置得多漂亮?
这是一份不完整的数据竞赛年鉴!
Where are those test / development programmers in their 30s? a man should be independent at the age of thirty......
A little knowledge ~ miscellaneous notes on topics ~ a polymorphic problem
那句话的作用
Problems and solutions of introducing redis cache
Efficient 7 habit learning notes


