当前位置:网站首页>[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
2022-07-29 10:08:00 【zoetu】
1. Paper information
author :Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W. Mahoney, Kurt Keutzer
Publishing unit :University of California at Berkeley University of California, Berkeley
meeting :AAAI 2020
Time of publication :2019.9.25
2. Research background
- be based on BERT Model of , It needs a lot of memory , The delay is very high , It's hard to deploy , So it needs to be quantified ;
- Different encoder The layer of focuses on different feature structures , The sensitivity to quantification is different , Therefore, it is necessary to quantify the mixing accuracy ;
This paper is based on BERT The model performs ultra-low precision quantization , Designed to minimize performance degradation , While maintaining hardware efficiency .
3. Method
Fine tuned BERT_BASE The model consists of three parts :
Embedded layer 91MB、 Encoder 325MB、 Output layer 0.01MB
Due to the small scale of the output layer , The study did not quantify this part . therefore , This paper studies the quantization of embedding and encoder parameters in different ways .
3.1 be based on Hessian Mixed precision quantization of information
Previous work :HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision
The mixing accuracy is mainly aimed at weight Parameters , this paper activation Of bit The bit width is uniformly set to 8bit,
Determine the bit-width:
First , Calculate the parameters of each layer hessian The maximum eigenvalue of the matrix , The process is as follows :
secondly , Enter some training data , Statistics input the average value of the maximum eigenvalue when different data are trained . in addition , in the light of NLP The task needs to add variance to improve statistical information .
The larger the eigenvalue, the more sensitive it is to quantization , The higher you need to allocate bit-width, The more obvious the eigenvalue is, the less sensitive it is to quantification , Corresponding to the disturbance loss lanscape The smoother , Distribute less bit-width that will do .
about NLP The variance of the maximum eigenvalue of different input data of the task is very large , It's not good to rely only on the average , So the variance is added to further improve the statistical information , Take advantage of 10% Training data for statistics .
- Last , Calculate the quantitative sensitivity index , The formula is as follows , And distribute according to statistical information bit-width.
3.2 Group quantification (gourp-ise quantization)
Let's say the input sequence has n Word , Each word has a d Dimensions are embedded in vectors (BERT_BASE in d = 768). stay Transformer Encoding , Every self attention head (self-attention) Have 4 A dense matrix , Each self attention head calculates the weighted sum by clicking the formula :
Directly use the same quantification strategy to quantify different Head These four matrices , The accuracy will decrease a lot !
Group quantification mechanism :
Pay attention to the Bulls (MHSA) In the dense matrix of head matrix W As a group . for example ,12 individual head Altogether 12 Group . In each group , Output multiple sequences weight As a subgroup , Each subgroup has its own quantitative range .
This article takes every d / ( 2 ∗ N h ) d/(2*N_h) d/(2∗Nh) A continuous group , among d yes embedding length , N h N_h Nh yes head Number .
4. experiment
Experiment 1 : Compression effect and precision loss
Contrast the model :
- Baseline: Unquantified BERT Model
- Q-BERT( Only group quantification is used )
- Q-BERT_MP( Mixing accuracy and group quantization are adopted )
- DirectQ: Directly quantified BERT Model ( Do not use this method )
except baseline All models except the model use 8 Bit activation , Directly quantified Q-BERT Mixing accuracy or group quantization is not used .
notes :
- w-bits: Weight quantization
- e-bits: Embed quantification
- size-w/o-e: With MB Is the model size of the non embedded layer
summary :Q-BERT Reached 13 Times the weight compression ratio , The size of the model is only the original 1/8, And the accuracy loss is 2.3% within .
- Q-BERT Than DirectQ The method performs much better . Under the super low position , This difference becomes more obvious .
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
- Besides , When the weight is quantified as 3 bit Settings ,Q-BERT and DirectQ The gap between them is further widened , In all kinds of tasks ,Q-BERT and DirectQ The gap between them is 9.68-27.83%.
- for example , Weight quantization is set to 4 bit when ,SQuAD Direct quantification of (DirectQ) comparison BERTBASE Performance degradation 11.5%. However , For the same Q-BERT The performance degradation is only 0.5%.
Experiment two : Group quantify the impact
Q-BERT The effect of group quantification on three tasks . The weight of all tasks is set to 4, The embedded quantization bit is set to 8, The active quantization bit is set to 8.
Increase the number of groups , See the impact on the accuracy of the model :
result :
- No group quantization : The accuracy decreases by about 7% To 11.5%.
- When increasing the number of groups , Performance will be significantly improved , The performance gain is 128 The left and right sides of the group are almost saturated .
- for example , about 12 Group , The performance degradation of all tasks is less than 2%.
- Change the number of groups from 12 Add to 128, Accuracy will be further improved at least 0.3% The accuracy of .
- However , Change the number of groups from 128 Further increase to 768 Only in 0.1% Improve performance in . It indicates that the performance gain is 128 The left and right sides of the group are almost saturated .
Revelation : It's best not to set too many groups
Because this will increase the lookup table required for each matrix multiplication (lut) The number of . This may adversely affect hardware performance , According to our results , The return on accuracy is diminishing . This is why other experiments in this paper use 128 Group quantification .
Experiment three : Analyze the quantitative sensitivity of different modules
As shown in the figure above :
- The embedding layer is more sensitive to quantization than weight . for example , Use 4 Bit layer by layer quantization embedding layer , Lead to SST-2、MNLI、CoNLL-03 Performance degradation of up to 10%,SQuAD Even more than 20%. contrary , The coding layer consumes about 79% The total parameters of (4× Embedded parameter size ), Embedded layer parameters are in the minority , So the table 1 Quantify it as 4 When a , Less performance loss .
- Location embedding is more sensitive to quantization than word embedding . for example , The quantization position is embedded in 4 Bits will cause... Than quantifying word embedding 2% Performance degradation of . Even if the location embedding only accounts for less than 5%, This shows the importance of location information in natural language understanding tasks . Considering that the location embedding only accounts for a small part of the size of the model , Therefore, the mixing accuracy of the embedded layer can be quantified , Further compress the model with a tolerable decrease in accuracy , As shown in the figure below :
notes :4/8 Express 4bit Word embedding and 8bit Position insertion .
The performance decreases by about 0.5% Under the circumstances , The embedded table size can be reduced to 11.6MB.
- The self attention layer is more robust to quantization than the fully connected network . for example ,1/2MP Self attention will lead to performance degradation of about 5%, and 1/2MP Full connection will degrade performance 11%.
Experiment four : qualitative analysis
Use attention information for qualitative analysis Q-BERT And DirectQ differences :
Calculated the quantized BERT And full accuracy BERT The distribution of attention information for the same input Kullback-Leibler (KL) The divergence .
KL Distribution is used to measure 2 Distance of data distribution , KL The smaller the divergence , It shows that the more attention output of the two models is close .
As shown in the figure above :Q-BERT And Baseline The distance is far less than DirectQ And Baseline Distance of
5. summary
- Researchers on second-order information ( namely Hessian Information ) Do a lot of layer by layer analysis , And then BERT Perform hybrid accuracy quantization . The study found that , Compared with neural networks in the field of computer vision ,BERT Of Hessian There are great differences in behavior . therefore , This study proposes a method based on top Sensitivity measures of eigenvalue mean and variance , To achieve better mixing accuracy quantization .
- Researchers propose new quantitative mechanisms —— Group quantification (group-wise quantization), The method It can alleviate the decline of accuracy , At the same time, it will not lead to a significant increase in hardware complexity . To be specific , The group quantization mechanism divides each matrix into different groups , Each group has its own quantization range and lookup table .
- The researchers investigated BERT Bottlenecks in quantification , That is, how different factors affect NLP Tradeoff between performance and model compression , These factors include quantitative mechanisms , And embedding 、 Modules such as self attention and full connection layer .
This content can be viewed in three parts of the experiment , The quantitative sensitivity of embedding layer and weight, self attention layer and full connection layer .
6. REFERENCE
- Previous work of this article :DONG Z, YAO Z, GHOLAMI A, etc. . HAWQ: Hessian AWare Quantization of Neural Networks with Mixed-Precision[M/OL]. arXiv, 2019[2022-07-26].
HAWQ The work of the team can be seen more , It seems that there is open source !
Reference material :
- https://zhuanlan.zhihu.com/p/440401538
- https://developer.aliyun.com/article/819840
边栏推荐
- ECCV 2022 | CMU提出在视觉Transformer上进行递归,不增参数,计算量还少
- Talk about multithreaded concurrent programming from a different perspective without heap concept
- Enterprise architecture | togaf architecture capability framework
- Orbslam2 installation test and summary of various problems
- Uniswap entered the NFT trading market and opensea took the lead
- Logistic regression of machine learning
- 高效能7个习惯学习笔记
- JS temporary dead zone_ Temporary
- 【论文阅读】I-BERT: Integer-only BERT Quantization
- Dimensionality reduction and mathematical modeling after reading blog!
猜你喜欢
造型科幻、标配6安全气囊,风行·游艇11.99万起售
[AAAI] attention based spatiotemporal graph convolution network for traffic flow prediction
【AAAI】用于交通流预测的基于注意力的时空图卷积网络
[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types
Harmonyos 3.0 release!
[FPGA tutorial case 18] develop low delay open root calculation through ROM
"Focus on machines": Zhu Songchun's team built a two-way value alignment system between people and robots to solve major challenges in the field of human-computer cooperation
综合设计一个OPPE主页--页面的底部
【黑马早报】每日优鲜回应解散,多地已无法下单;李斌称蔚来将每年出一部手机;李嘉诚欲抄底恒大香港总部大楼;今年国庆休7天上7天...
程序员脱离单身的一些建议
随机推荐
这才是开发者神器正确的打开方式
Correct posture and landing practice of R & D efficiency measurement (speech ppt sharing version)
The maximum length of VARCHAR2 type in Oracle is_ Oracle modify field length SQL
Why does the system we developed have concurrent bugs? What is the root cause of concurrent bugs?
TCP failure model
[jetson][reprint]pycharm installed on Jetson
How big is the bandwidth of the Tiktok server for hundreds of millions of people to brush at the same time?
[Yugong series] go teaching course 009 in July 2022 - floating point type of data type
Functions and arrays
The purpose of DDD to divide domains, sub domains, core domains, and support domains
Orbslam2 installation test and summary of various problems
Div horizontal arrangement
Tips of Day1 practice in 2022cuda summer training camp
leetcode刷题——排序
PDF处理还收费?不可能
Vim到底可以配置得多漂亮?
Function - (C travel notes)
Summary of window system operation skills
SAP Fiori @OData. Analysis of the working principle of publish annotation
造型科幻、标配6安全气囊,风行·游艇11.99万起售