当前位置:网站首页>[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
2022-07-29 10:08:00 【zoetu】
1. Paper information
author :Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Publishing unit :University of California at Berkeley University of California, Berkeley
meeting :AAAI 2020
Time of publication :2019.9.25
2. Research background
- be based on BERT Model of , It needs a lot of memory , The delay is very high , It's hard to deploy , So it needs to be quantified ;
- Different encoder The layer of focuses on different feature structures , The sensitivity to quantification is different , Therefore, it is necessary to quantify the mixing accuracy ;
This paper is based on BERT The model performs ultra-low precision quantization , Designed to minimize performance degradation , While maintaining hardware efficiency .
3. Method
Fine tuned BERT_BASE The model consists of three parts :
Embedded layer 91MB、 Encoder 325MB、 Output layer 0.01MB
Due to the small scale of the output layer , The study did not quantify this part . therefore , This paper studies the quantization of embedding and encoder parameters in different ways .
3.1 be based on Hessian Mixed precision quantization of information
Previous work :HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
The mixing accuracy is mainly aimed at weight Parameters , this paper activation Of bit The bit width is uniformly set to 8bit,
Determine the bit-width:
First , Calculate the parameters of each layer hessian The maximum eigenvalue of the matrix , The process is as follows :

secondly , Enter some training data , Statistics input the average value of the maximum eigenvalue when different data are trained . in addition , in the light of NLP The task needs to add variance to improve statistical information .
The larger the eigenvalue, the more sensitive it is to quantization , The higher you need to allocate bit-width, The more obvious the eigenvalue is, the less sensitive it is to quantification , Corresponding to the disturbance loss lanscape The smoother , Distribute less bit-width that will do .
about NLP The variance of the maximum eigenvalue of different input data of the task is very large , It's not good to rely only on the average , So the variance is added to further improve the statistical information , Take advantage of 10% Training data for statistics .
- Last , Calculate the quantitative sensitivity index , The formula is as follows , And distribute according to statistical information bit-width.

3.2 Group quantification (gourp-ise quantization)
Let's say the input sequence has n Word , Each word has a d Dimensions are embedded in vectors (BERT_BASE in d = 768). stay Transformer Encoding , Every self attention head (self-attention) Have 4 A dense matrix , Each self attention head calculates the weighted sum by clicking the formula :
Directly use the same quantification strategy to quantify different Head These four matrices , The accuracy will decrease a lot !
Group quantification mechanism :
Pay attention to the Bulls (MHSA) In the dense matrix of head matrix W As a group . for example ,12 individual head Altogether 12 Group . In each group , Output multiple sequences weight As a subgroup , Each subgroup has its own quantitative range .
This article takes every d / ( 2 ∗ N h ) d/(2*N_h) d/(2∗Nh) A continuous group , among d yes embedding length , N h N_h Nh yes head Number .
4. experiment
Experiment 1 : Compression effect and precision loss
Contrast the model :
- Baseline: Unquantified BERT Model
- Q-BERT( Only group quantification is used )
- Q-BERT_MP( Mixing accuracy and group quantization are adopted )
- DirectQ: Directly quantified BERT Model ( Do not use this method )
except baseline All models except the model use 8 Bit activation , Directly quantified Q-BERT Mixing accuracy or group quantization is not used .
notes :
- w-bits: Weight quantization
- e-bits: Embed quantification
- size-w/o-e: With MB Is the model size of the non embedded layer
summary :Q-BERT Reached 13 Times the weight compression ratio , The size of the model is only the original 1/8, And the accuracy loss is 2.3% within .
- Q-BERT Than DirectQ The method performs much better . Under the super low position , This difference becomes more obvious .
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.

- Besides , When the weight is quantified as 3 bit Settings ,Q-BERT and DirectQ The gap between them is further widened , In all kinds of tasks ,Q-BERT and DirectQ The gap between them is 9.68-27.83%.
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
Experiment two : Group quantify the impact
Q-BERT The effect of group quantification on three tasks . The weight of all tasks is set to 4, The embedded quantization bit is set to 8, The active quantization bit is set to 8.
Increase the number of groups , See the impact on the accuracy of the model :
result :
- No group quantization : The accuracy decreases by about 7% To 11.5%.
- When increasing the number of groups , Performance will be significantly improved , The performance gain is 128 The left and right sides of the group are almost saturated .
- for example , about 12 Group , The performance degradation of all tasks is less than 2%.
- Change the number of groups from 12 Add to 128, Accuracy will be further improved at least 0.3% The accuracy of .
- However , Change the number of groups from 128 Further increase to 768 Only in 0.1% Improve performance in . It indicates that the performance gain is 128 The left and right sides of the group are almost saturated .
Revelation : It's best not to set too many groups
Because this will increase the lookup table required for each matrix multiplication (lut) The number of . This may adversely affect hardware performance , According to our results , The return on accuracy is diminishing . This is why other experiments in this paper use 128 Group quantification .
Experiment three : Analyze the quantitative sensitivity of different modules

As shown in the figure above :
- The embedding layer is more sensitive to quantization than weight . for example , Use 4 Bit layer by layer quantization embedding layer , Lead to SST-2、MNLI、CoNLL-03 Performance degradation of up to 10%,SQuAD Even more than 20%. contrary , The coding layer consumes about 79% The total parameters of (4× Embedded parameter size ), Embedded layer parameters are in the minority , So the table 1 Quantify it as 4 When a , Less performance loss .
- Location embedding is more sensitive to quantization than word embedding . for example , The quantization position is embedded in 4 Bits will cause... Than quantifying word embedding 2% Performance degradation of . Even if the location embedding only accounts for less than 5%, This shows the importance of location information in natural language understanding tasks . Considering that the location embedding only accounts for a small part of the size of the model , Therefore, the mixing accuracy of the embedded layer can be quantified , Further compress the model with a tolerable decrease in accuracy , As shown in the figure below :
notes :4/8 Express 4bit Word embedding and 8bit Position insertion .
The performance decreases by about 0.5% Under the circumstances , The embedded table size can be reduced to 11.6MB.
- The self attention layer is more robust to quantization than the fully connected network . for example ,1/2MP Self attention will lead to performance degradation of about 5%, and 1/2MP Full connection will degrade performance 11%.
Experiment four : qualitative analysis
Use attention information for qualitative analysis Q-BERT And DirectQ differences :
Calculated the quantized BERT And full accuracy BERT The distribution of attention information for the same input Kullback-Leibler (KL) The divergence .
KL Distribution is used to measure 2 Distance of data distribution , KL The smaller the divergence , It shows that the more attention output of the two models is close .

As shown in the figure above :Q-BERT And Baseline The distance is far less than DirectQ And Baseline Distance of
5. summary
- Researchers on second-order information ( namely Hessian Information ) Do a lot of layer by layer analysis , And then BERT Perform hybrid accuracy quantization . The study found that , Compared with neural networks in the field of computer vision ,BERT Of Hessian There are great differences in behavior . therefore , This study proposes a method based on top Sensitivity measures of eigenvalue mean and variance , To achieve better mixing accuracy quantization .
- Researchers propose new quantitative mechanisms —— Group quantification (group-wise quantization), The method It can alleviate the decline of accuracy , At the same time, it will not lead to a significant increase in hardware complexity . To be specific , The group quantization mechanism divides each matrix into different groups , Each group has its own quantization range and lookup table .
- The researchers investigated BERT Bottlenecks in quantification , That is, how different factors affect NLP Tradeoff between performance and model compression , These factors include quantitative mechanisms , And embedding 、 Modules such as self attention and full connection layer .
This content can be viewed in three parts of the experiment , The quantitative sensitivity of embedding layer and weight, self attention layer and full connection layer .
6. REFERENCE
- Previous work of this article :DONG Z, YAO Z, GHOLAMI A, etc. . HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision[M/OL]. arXiv, 2019[2022-07-26].
HAWQ The work of the team can be seen more , It seems that there is open source !
Reference material :
- https://zhuanlan.zhihu.com/p/440401538
- https://developer.aliyun.com/article/819840
边栏推荐
- [C language] Sanzi chess (intelligent chess playing + blocking players)
- 《LOL》从代码上来说最难的是哪个英雄?
- 函数——(C游记)
- Yin Yi: my learning and growth path
- A Zuo's realm
- TCP failure model
- 皕杰报表之文本附件属件
- 这是一份不完整的数据竞赛年鉴!
- Does neural network sound tall? Take you to train a network from scratch (based on MNIST)
- [jetson][转载]jetson上安装pycharm
猜你喜欢

Function - (C travel notes)

Encyclopedia of introduction to machine learning - 2018 "machine learning beginners" official account article summary

Anfulai embedded weekly report no. 273: 2022.07.04--2022.07.10

Yin Yi: my learning and growth path

函数——(C游记)
![[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types](/img/3c/eaadcc105377f012db5ee8852b5e28.png)
[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types

SiC功率半导体产业高峰论坛成功举办

Mysql database final review question bank

Geeer's happiness | is for the white whoring image! Analysis and mining, NDVI, unsupervised classification, etc

ECCV 2022 | CMU提出在视觉Transformer上进行递归,不增参数,计算量还少
随机推荐
Function - (C travel notes)
程序员脱离单身的一些建议
【AAAI】用于交通流预测的基于注意力的时空图卷积网络
Encyclopedia of introduction to machine learning - 2018 "machine learning beginners" official account article summary
Correct posture and landing practice of R & D efficiency measurement (speech ppt sharing version)
[FPGA tutorial case 19] factorial operation through multiplier
Youboxun, the gold donor of the open atom open source foundation, joined hands with partners to help openharmony break the circle!
跟着李老师学线代——矩阵(持续更新)
[HFCTF 2021 Final]easyflask
“为机器立心”:朱松纯团队搭建人与机器人的价值双向对齐系统,解决人机协作领域的重大挑战
[ts]typescript learning record pit collection
Why does the system we developed have concurrent bugs? What is the root cause of concurrent bugs?
div 水平排列
Vim到底可以配置得多漂亮?
Soft exam summary
ORBSLAM2安装测试,及各种问题汇总
English grammar_ Indefinite pronouns - Common Phrases
2021年CS保研经历(六):系统填报 + 一些感想
ECCV 2022 | CMU提出在视觉Transformer上进行递归,不增参数,计算量还少
My problem solving record 1: the @component annotation is used on the class. If you want to use the methods in this class, you can't directly new, but should use @autowired for injection, otherwise an


