当前位置：网站首页>[paper reading] q-bert: Hessian based ultra low precision quantification of Bert

[paper reading] q-bert: Hessian based ultra low precision quantification of Bert

2022-07-29 10:08:00 【zoetu】

Article navigation
1. Paper information
2. Research background
3. Method
3.1 be based on Hessian Mixed precision quantization of information
3.2 Group quantification （gourp-ise quantization）
4. experiment
Experiment 1 ： Compression effect and precision loss
Experiment two ： Group quantify the impact
Experiment three ： Analyze the quantitative sensitivity of different modules
Experiment four ： qualitative analysis
5. summary
6. REFERENCE
Reference material :

1. Paper information

author ：Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Publishing unit ：University of California at Berkeley University of California, Berkeley
meeting ：AAAI 2020
Time of publication ：2019.9.25

2. Research background

be based on BERT Model of , It needs a lot of memory , The delay is very high , It's hard to deploy , So it needs to be quantified ;
Different encoder The layer of focuses on different feature structures , The sensitivity to quantification is different , Therefore, it is necessary to quantify the mixing accuracy ;

This paper is based on BERT The model performs ultra-low precision quantization , Designed to minimize performance degradation , While maintaining hardware efficiency .

3. Method

Fine tuned BERT_BASE The model consists of three parts ：
Embedded layer 91MB、 Encoder 325MB、 Output layer 0.01MB

Due to the small scale of the output layer , The study did not quantify this part . therefore , This paper studies the quantization of embedding and encoder parameters in different ways .

3.1 be based on Hessian Mixed precision quantization of information

Previous work ：HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision

The mixing accuracy is mainly aimed at weight Parameters , this paper activation Of bit The bit width is uniformly set to 8bit,

Determine the bit-width：

First , Calculate the parameters of each layer hessian The maximum eigenvalue of the matrix , The process is as follows ：
secondly , Enter some training data , Statistics input the average value of the maximum eigenvalue when different data are trained . in addition , in the light of NLP The task needs to add variance to improve statistical information .

The larger the eigenvalue, the more sensitive it is to quantization , The higher you need to allocate bit-width, The more obvious the eigenvalue is, the less sensitive it is to quantification , Corresponding to the disturbance loss lanscape The smoother , Distribute less bit-width that will do .
about NLP The variance of the maximum eigenvalue of different input data of the task is very large , It's not good to rely only on the average , So the variance is added to further improve the statistical information , Take advantage of 10% Training data for statistics .

Last , Calculate the quantitative sensitivity index , The formula is as follows , And distribute according to statistical information bit-width.

3.2 Group quantification （gourp-ise quantization）

Let's say the input sequence has n Word , Each word has a d Dimensions are embedded in vectors （BERT_BASE in d = 768）. stay Transformer Encoding , Every self attention head （self-attention） Have 4 A dense matrix , Each self attention head calculates the weighted sum by clicking the formula ：
Insert picture description here
Directly use the same quantification strategy to quantify different Head These four matrices , The accuracy will decrease a lot ！

Group quantification mechanism :
Pay attention to the Bulls （MHSA） In the dense matrix of head matrix W As a group . for example ,12 individual head Altogether 12 Group . In each group , Output multiple sequences weight As a subgroup , Each subgroup has its own quantitative range .

This article takes every $d/(2*N_h)$ A continuous group , among d yes embedding length , $N_h$ yes head Number .
Insert picture description here

4. experiment

Experiment 1 ： Compression effect and precision loss

Contrast the model ：

Baseline： Unquantified BERT Model
Q-BERT( Only group quantification is used )
Q-BERT_MP( Mixing accuracy and group quantization are adopted )
DirectQ： Directly quantified BERT Model （ Do not use this method ）

except baseline All models except the model use 8 Bit activation , Directly quantified Q-BERT Mixing accuracy or group quantization is not used .

notes ：
w-bits： Weight quantization
e-bits： Embed quantification
size-w/o-e： With MB Is the model size of the non embedded layer

summary ：Q-BERT Reached 13 Times the weight compression ratio , The size of the model is only the original 1/8, And the accuracy loss is 2.3% within .

Q-BERT Than DirectQ The method performs much better . Under the super low position , This difference becomes more obvious .
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
- Besides , When the weight is quantified as 3 bit Settings ,Q-BERT and DirectQ The gap between them is further widened , In all kinds of tasks ,Q-BERT and DirectQ The gap between them is 9.68-27.83%.

Experiment two ： Group quantify the impact

Q-BERT The effect of group quantification on three tasks . The weight of all tasks is set to 4, The embedded quantization bit is set to 8, The active quantization bit is set to 8.
Increase the number of groups , See the impact on the accuracy of the model ：
Insert picture description here

result ：

No group quantization ： The accuracy decreases by about 7% To 11.5%.
When increasing the number of groups , Performance will be significantly improved , The performance gain is 128 The left and right sides of the group are almost saturated .
- for example , about 12 Group , The performance degradation of all tasks is less than 2%.
- Change the number of groups from 12 Add to 128, Accuracy will be further improved at least 0.3% The accuracy of .
- However , Change the number of groups from 128 Further increase to 768 Only in 0.1% Improve performance in . It indicates that the performance gain is 128 The left and right sides of the group are almost saturated .

Revelation ： It's best not to set too many groups

Because this will increase the lookup table required for each matrix multiplication (lut) The number of . This may adversely affect hardware performance , According to our results , The return on accuracy is diminishing . This is why other experiments in this paper use 128 Group quantification .

Experiment three ： Analyze the quantitative sensitivity of different modules

Insert picture description here
As shown in the figure above ：

The embedding layer is more sensitive to quantization than weight . for example , Use 4 Bit layer by layer quantization embedding layer , Lead to SST-2、MNLI、CoNLL-03 Performance degradation of up to 10%,SQuAD Even more than 20%. contrary , The coding layer consumes about 79% The total parameters of (4× Embedded parameter size ), Embedded layer parameters are in the minority , So the table 1 Quantify it as 4 When a , Less performance loss .
Location embedding is more sensitive to quantization than word embedding . for example , The quantization position is embedded in 4 Bits will cause... Than quantifying word embedding 2% Performance degradation of . Even if the location embedding only accounts for less than 5%, This shows the importance of location information in natural language understanding tasks . Considering that the location embedding only accounts for a small part of the size of the model , Therefore, the mixing accuracy of the embedded layer can be quantified , Further compress the model with a tolerable decrease in accuracy , As shown in the figure below ：

notes ：4/8 Express 4bit Word embedding and 8bit Position insertion .
The performance decreases by about 0.5% Under the circumstances , The embedded table size can be reduced to 11.6MB.

The self attention layer is more robust to quantization than the fully connected network . for example ,1/2MP Self attention will lead to performance degradation of about 5%, and 1/2MP Full connection will degrade performance 11%.

Experiment four ： qualitative analysis

Use attention information for qualitative analysis Q-BERT And DirectQ differences ：
Calculated the quantized BERT And full accuracy BERT The distribution of attention information for the same input Kullback-Leibler (KL) The divergence .

KL Distribution is used to measure 2 Distance of data distribution , KL The smaller the divergence , It shows that the more attention output of the two models is close .

Insert picture description here
As shown in the figure above ：Q-BERT And Baseline The distance is far less than DirectQ And Baseline Distance of

5. summary

Researchers on second-order information （ namely Hessian Information ） Do a lot of layer by layer analysis , And then BERT Perform hybrid accuracy quantization . The study found that , Compared with neural networks in the field of computer vision ,BERT Of Hessian There are great differences in behavior . therefore , This study proposes a method based on top Sensitivity measures of eigenvalue mean and variance , To achieve better mixing accuracy quantization .
Researchers propose new quantitative mechanisms —— Group quantification （group-wise quantization）, The method It can alleviate the decline of accuracy , At the same time, it will not lead to a significant increase in hardware complexity . To be specific , The group quantization mechanism divides each matrix into different groups , Each group has its own quantization range and lookup table .
The researchers investigated BERT Bottlenecks in quantification , That is, how different factors affect NLP Tradeoff between performance and model compression , These factors include quantitative mechanisms , And embedding 、 Modules such as self attention and full connection layer .

This content can be viewed in three parts of the experiment , The quantitative sensitivity of embedding layer and weight, self attention layer and full connection layer .