当前位置:网站首页>[paper reading] i-bert: integer only Bert quantification
[paper reading] i-bert: integer only Bert quantification
2022-07-29 10:08:00 【zoetu】
Article navigation
1. Paper information
author :Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Publishing unit :University of California, Berkeley University of California, Berkeley
meeting :ICML2021
Time of publication :2021.6.8
2. Research background
- NLP The pre training models inside are too big , It cannot be deployed efficiently and run in real time ;
- Previous pair transformer There are still a lot of floating-point operations , Such as GELU, softmax,LayerNorm, Can not efficiently use pure integer computing units to accelerate , such as Turing Tensor Cores, Or traditional integer-only ARM processor ;
Compare Transformer Different quantitative schemes applied to the self attention layer in the architecture :
- ( Left ) Analog quantization ( Pseudo quantization ): All operations are performed using floating-point arithmetic , Parameters are quantized and stored as integers , But they are inversely quantized to floating point for inference .
- ( in ) Analog quantization ( Pseudo quantization ): Only part of the operation is performed by integer algorithm ., For example, in this picture Softmax It is executed by floating-point algorithm , therefore Softmax The input of should be de quantified (Dequantize);Softmax The output of should be quantized back to an integer , To execute subsequent integers MatMul.
- ( Right ) This paper proposes pure integer quantization : In the whole process of reasoning , There is no floating point operation , There is no quantification .
3. Method
The basic idea : Use polynomials to approximate nonlinear functions
Challenge : Find a good low order polynomial , Be accessible to Transformer Nonlinear functions used in .
- If you choose high-level , Although the approximate error is small , But it's a lot of calculation ;
- Low precision integer multiplication is easy to cause overflow , Need more bit width to save to accumulate values .
3.1 Integer-only GELU
original GELU function :
Tried the way :
a. Use ReLU The approximate :GELU and ReLU In particularly large positive numbers / The negative part is very similar , But in 0 Nearby values vary greatly ;
The following pictures are for reference :https://blog.csdn.net/weixin_43791477/article/details/124871734
b. Solve directly erf Integral term is unreliable , It takes a lot of calculation
c. According to previous theories , Use sigmoid Function to approximate erf( however sigmoid Floating point number calculation is required , So it's not feasible !), It needs to be reused h-sigmoid To further approximate sigmoid, Finally, we can get h-GELU, But the approximate error is still too large .

Methods of this paper : Polynomial approximation
- The optimization problem is as follows :

- Second order , But directly optimize this formula , You will get a poor approximation , because erf The range of values is too wide , But in fact erf Approach on a larger value ±1, So we can set a small range to optimize , because erf It's an odd function , So just consider the positive part , Thus, a new second-order approximate function is obtained L(x).

- To get i-GELU

contrast : The approximate GELU function

result : As can be observed in the figure above ,I-GELU( Blue curve ) Heyuan GELU function ( Red curve ) Very close to , Especially in 0 Around the dot ;h-GELU( Yellow curve ) The approximation error is still relatively large ;RELU( Green curve ) Not to mention , Itself is near zero and GELU It's different .
error analysis

The results of the analysis : i-GELU The average error is 8.2 × 10−3, The maximum error is 1.8 × 10−2. This is more than h-GELU The average error and maximum error of are 3.1 × 10−2 and 6.8 × 10−2 The accuracy of this method has been improved 3 times . Besides ,i-GELU Even slightly better than based on Sigmoid Of erf The approximate , But no floating-point arithmetic is used .
The error here leads to the following i-GELU Than h-GELU Better precision !
3.2 Integer-only Softmax
Softmax Normalize the input vector , And map it to the probability distribution :
Former transformer All use floating-point operation to deal with this layer , But it is not conducive to the deployment of accelerators that only support integer operations ;
difficulty : The input of exponential function is unbounded , It has been changing dynamically
Tried the way :
a. Lookup table : Large memory consumption , Not elegant enough , So avoid looking up tables ;
b. Simple polynomial approximation : We need to use higher order to approximate exp, And the greater the value , The greater the approximation error , Need high-level .
Methods of this paper : Limited range polynomial approximation
Subtract the maximum value from the index first , Make them all negative , similar pytorch Intermediate defense softmax Overflow the same operation ;
Then decode the negative number into the following form , among z Is a nonnegative integer ,p yes (-ln2, 0) Between a floating point number

- So you can put exp(x) Transform into the following form , among exp§ The range of is fixed ,z Integers

- So you can put exp(x) Transform into the following form , among exp§ The range of is fixed ,z Integers
For this limited range exp§ It is easy to approximate by binomial , And the error will not be great , The resulting integer reasoning exp In the form of :
![[ picture ]](/img/44/098ddf62a3f92d4630295331e3e21b.png)
contrast :i-exp And exponential function

The figure above shows i-exp Result , It is almost the same as exponential function . It is found that the biggest difference between these two functions is 1.9 × 10−3. Considering the unit interval 8 The quantization error introduced by bit quantization is 1/256 = 3.9 × 10−3, therefore ,i-exp The approximate error of is relatively negligible , It can be included in the quantization error .
3.3 Integer-only LayerNorm
LayerNorm stay transformer Often used in , It involves some nonlinear operations , Like division 、 square root . This operation is used to normalize input activation across channel dimensions . The normalization process is described as :
among ,μ and σ Is the mean and standard deviation input in the channel dimension . A subtle challenge here is , Enter Statistics ( for example ,µ and σ) stay NLP The task changes quickly , These values need to be calculated dynamically at run time . Calculation µ It's simple , Calculation σ Square root function is required .
Simply put, quantify LayerNorm The difficulty of function is to calculate variance dynamically .
Methods of this paper : Newton's iteration
- Each iteration requires only one integer division , Integer addition , One bit displacement , You can get an approximate value .
Newton's iteration , This one is relatively simple , In fact, it is to obtain the first two terms of Taylor's expansion and find the root of the equation .

- Calculate the variance and mean directly in the integer field , Then proceed LayerNorm operation .
4. experiment
be based on RoBERTa Model implementation I-BERT, The original RoBERTa All floating-point operations in the model are replaced by pure integer operations in this paper . among , Use INT8 Precision to execute MatMul and Embedding, be-all MatMul All operations are based on INT8 Precision execution , And add up to INT32 precision . Besides , The embedded layer remains INT8 precision .
Experimental environment
- TensorRT 7.2.1
- Google Cloud Platform virtual machine with a single Tesla T4 GPU
- CUDA 11.1
- cuDNN 8.0
4.1 Accuracy experiment
Data sets :GELU
baseline:FP32 Model

experimental result :I-BERT Always achieve comparable or slightly higher accuracy than the baseline
- RoBERTa-Base: I-BERT In all cases, higher accuracy is achieved (RTE The accuracy is improved as high as 1.4), except MNLI-m, QQP and STS-B Mission ( The accuracy decreases as much as 0.3)
- RoBERTa-Large: The results are similar , among I-BERT Match or better than the baseline accuracy of all downstream tasks .
- On average, , about RoBERTa-Base/Large, I-BERT Better than the baseline 0.3/0.5.
4.2 Delay the experiment

experimental result : And FP32 The model compares , Deploy on dedicated hardware that supports efficient integer Computing I-BERT Can achieve significant acceleration .
- I-BERT Of INT8 Reasoning average ratio BERT-Base and BERT-Large Pure FP32 Reasoning is fast 3.08× and 3.56×, Up to 4.00× Speed up .
in addition , adopt Nvidia Plug-in integration and acceleration Transformer Key operations in the architecture , Further acceleration is possible , You can expect to compare TensorRT Provided api fast 2 times .
4.3 Ablation Experiment
Use GELU、h-GELU and i-GELU Conduct GELU Calculation model accuracy 
experimental result
- i-GELU Than GELU There is a small improvement in accuracy , The accuracy of all the above tasks is better than h-GELU high
- h-GELU Approximate instead of GELU Will result in the addition of MRPC The accuracy of all downstream tasks is reduced ( The highest 2.2% The accuracy is reduced )
Why? i-GELU Than GELU Better ? There is no reasonable explanation
5. REFERENCE
- YAO Z, DONG Z, ZHENG Z, etc. . HAWQV3: Dyadic Neural Network Quantization[M/OL]. arXiv, 2020[2022-07-26].
I-BERT Readers are recommended to read ( Yao Ming et al ., 2020) To learn more about quantization-aware fine-tuning integer-only Quantitative methods .
Reference material
边栏推荐
- Google Earth engine (GEE) -- calculate the location of the center point, the external boundary, the external polygon, fuse and simplify the boundary and return it to the vector set
- 最新翻译的官方PyTorch简易入门教程(PyTorch1.0版本)
- Intel joins hands with datawhale to launch learning projects!
- 那句话的作用
- Yin Yi: my learning and growth path
- Encyclopedia of introduction to machine learning - 2018 "machine learning beginners" official account article summary
- 我的问题解决记录1:类上使用了@Component注解,想要使用这个类中的方法,便不能直接new,而应该使用# @Autowired进行注入,否则会报错(如空指针异常等)
- How to customize the opportunity closing form in dynamics 365online
- 跟着李老师学线代——行列式(持续更新)
- i. Mx6ull driver development | 32 - manually write a virtual network card device
猜你喜欢

尹伊:我的学习成长路径

Leetcode question brushing - sorting

ECCV 2022 | CMU提出在视觉Transformer上进行递归,不增参数,计算量还少

Shell notes (super complete)
![[wechat applet] interface generates customized homepage QR code](/img/9b/cccdb8ff6db61518402a27b94d0196.png)
[wechat applet] interface generates customized homepage QR code

【微信小程序】接口生成自定义首页二维码

SiC Power Semiconductor Industry Summit Forum successfully held

Knowledge points of common interview questions: distributed lock

Vector implementation

高效能7个习惯学习笔记
随机推荐
SiC功率半导体产业高峰论坛成功举办
Only simple function test? One article takes you to advanced interface automatic testing technology in 6 steps
Reasons for the rise of DDD and its relationship with microservices
Sublime Text3 设置不同文件不同缩进
Knowledge points of common interview questions: distributed lock
Modulenotfounderror: no module named 'pywt' solution
Meituan senior technical expert: DDD's practice in the evolution of tourism e-commerce architecture
Problems and solutions of introducing redis cache
Unity3d空包打apk报错汇总
2021年CS保研经历(六):系统填报 + 一些感想
Encyclopedia of introduction to machine learning - 2018 "machine learning beginners" official account article summary
消费电子,冻死在夏天
什么是卡特兰数?有哪些应用?
TCP failure model
[jetson][转载]jetson上安装pycharm
【论文阅读】Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT
How big is the bandwidth of the Tiktok server for hundreds of millions of people to brush at the same time?
我的问题解决记录1:类上使用了@Component注解,想要使用这个类中的方法,便不能直接new,而应该使用# @Autowired进行注入,否则会报错(如空指针异常等)
What is Cartland number? What are the applications?
综合设计一个OPPE主页--页面的底部
