当前位置：网站首页>[paper reading] i-bert: integer only Bert quantification

[paper reading] i-bert: integer only Bert quantification

2022-07-29 10:08:00 【zoetu】

Article navigation
1. Paper information
2. Research background
3. Method
3.1 Integer-only GELU
contrast ： The approximate GELU function
error analysis
3.2 Integer-only Softmax
contrast ：i-exp And exponential function
3.3 Integer-only LayerNorm
4. experiment
Experimental environment
4.1 Accuracy experiment
4.2 Delay the experiment
4.3 Ablation Experiment
5. REFERENCE
Reference material

1. Paper information

author ：Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W. Mahoney, Kurt Keutzer
Publishing unit ：University of California, Berkeley University of California, Berkeley
meeting ：ICML2021
Time of publication ：2021.6.8

2. Research background

NLP The pre training models inside are too big , It cannot be deployed efficiently and run in real time ;
Previous pair transformer There are still a lot of floating-point operations , Such as GELU, softmax,LayerNorm, Can not efficiently use pure integer computing units to accelerate , such as Turing Tensor Cores, Or traditional integer-only ARM processor ;

Compare Transformer Different quantitative schemes applied to the self attention layer in the architecture ：
Insert picture description here

( Left ) Analog quantization （ Pseudo quantization ）： All operations are performed using floating-point arithmetic , Parameters are quantized and stored as integers , But they are inversely quantized to floating point for inference .
( in ) Analog quantization （ Pseudo quantization ）： Only part of the operation is performed by integer algorithm ., For example, in this picture Softmax It is executed by floating-point algorithm , therefore Softmax The input of should be de quantified （Dequantize）;Softmax The output of should be quantized back to an integer , To execute subsequent integers MatMul.
( Right ) This paper proposes pure integer quantization ： In the whole process of reasoning , There is no floating point operation , There is no quantification .

3. Method

The basic idea ： Use polynomials to approximate nonlinear functions

Challenge ： Find a good low order polynomial , Be accessible to Transformer Nonlinear functions used in .

If you choose high-level , Although the approximate error is small , But it's a lot of calculation ;
Low precision integer multiplication is easy to cause overflow , Need more bit width to save to accumulate values .

3.1 Integer-only GELU

original GELU function ：
Insert picture description here
Tried the way ：
a. Use ReLU The approximate ：GELU and ReLU In particularly large positive numbers / The negative part is very similar , But in 0 Nearby values vary greatly ;

The following pictures are for reference ：https://blog.csdn.net/weixin_43791477/article/details/124871734

b. Solve directly erf Integral term is unreliable , It takes a lot of calculation
c. According to previous theories , Use sigmoid Function to approximate erf( however sigmoid Floating point number calculation is required , So it's not feasible ！), It needs to be reused h-sigmoid To further approximate sigmoid, Finally, we can get h-GELU, But the approximate error is still too large .
Insert picture description here

Methods of this paper ： Polynomial approximation

The optimization problem is as follows ：
Second order , But directly optimize this formula , You will get a poor approximation , because erf The range of values is too wide , But in fact erf Approach on a larger value ±1, So we can set a small range to optimize , because erf It's an odd function , So just consider the positive part , Thus, a new second-order approximate function is obtained L(x).
To get i-GELU

contrast ： The approximate GELU function

Insert picture description here
result ： As can be observed in the figure above ,I-GELU（ Blue curve ） Heyuan GELU function （ Red curve ） Very close to , Especially in 0 Around the dot ;h-GELU（ Yellow curve ） The approximation error is still relatively large ;RELU（ Green curve ） Not to mention , Itself is near zero and GELU It's different .

error analysis

Insert picture description here

The results of the analysis ： i-GELU The average error is 8.2 × 10−3, The maximum error is 1.8 × 10−2. This is more than h-GELU The average error and maximum error of are 3.1 × 10−2 and 6.8 × 10−2 The accuracy of this method has been improved 3 times . Besides ,i-GELU Even slightly better than based on Sigmoid Of erf The approximate , But no floating-point arithmetic is used .

The error here leads to the following i-GELU Than h-GELU Better precision ！

3.2 Integer-only Softmax

Softmax Normalize the input vector , And map it to the probability distribution ：
Insert picture description here
Former transformer All use floating-point operation to deal with this layer , But it is not conducive to the deployment of accelerators that only support integer operations ;
difficulty ： The input of exponential function is unbounded , It has been changing dynamically

Tried the way ：
a. Lookup table ： Large memory consumption , Not elegant enough , So avoid looking up tables ;
b. Simple polynomial approximation ： We need to use higher order to approximate exp, And the greater the value , The greater the approximation error , Need high-level .

Methods of this paper ： Limited range polynomial approximation

Subtract the maximum value from the index first , Make them all negative , similar pytorch Intermediate defense softmax Overflow the same operation ;
Then decode the negative number into the following form , among z Is a nonnegative integer ,p yes (-ln2, 0) Between a floating point number
- So you can put exp(x) Transform into the following form , among exp§ The range of is fixed ,z Integers
For this limited range exp§ It is easy to approximate by binomial , And the error will not be great , The resulting integer reasoning exp In the form of ：

contrast ：i-exp And exponential function

Insert picture description here

The figure above shows i-exp Result , It is almost the same as exponential function . It is found that the biggest difference between these two functions is 1.9 × 10−3. Considering the unit interval 8 The quantization error introduced by bit quantization is 1/256 = 3.9 × 10−3, therefore ,i-exp The approximate error of is relatively negligible , It can be included in the quantization error .

3.3 Integer-only LayerNorm

LayerNorm stay transformer Often used in , It involves some nonlinear operations , Like division 、 square root . This operation is used to normalize input activation across channel dimensions . The normalization process is described as ：
Insert picture description here
among ,μ and σ Is the mean and standard deviation input in the channel dimension . A subtle challenge here is , Enter Statistics ( for example ,µ and σ) stay NLP The task changes quickly , These values need to be calculated dynamically at run time . Calculation µ It's simple , Calculation σ Square root function is required .

Simply put, quantify LayerNorm The difficulty of function is to calculate variance dynamically .

Methods of this paper ： Newton's iteration

Each iteration requires only one integer division , Integer addition , One bit displacement , You can get an approximate value .

Newton's iteration , This one is relatively simple , In fact, it is to obtain the first two terms of Taylor's expansion and find the root of the equation .

Insert picture description here

Calculate the variance and mean directly in the integer field , Then proceed LayerNorm operation .

4. experiment

be based on RoBERTa Model implementation I-BERT, The original RoBERTa All floating-point operations in the model are replaced by pure integer operations in this paper . among , Use INT8 Precision to execute MatMul and Embedding, be-all MatMul All operations are based on INT8 Precision execution , And add up to INT32 precision . Besides , The embedded layer remains INT8 precision .

Experimental environment

TensorRT 7.2.1
Google Cloud Platform virtual machine with a single Tesla T4 GPU
CUDA 11.1
cuDNN 8.0

4.1 Accuracy experiment

Data sets ：GELU
baseline：FP32 Model

Insert picture description here
experimental result ：I-BERT Always achieve comparable or slightly higher accuracy than the baseline

RoBERTa-Base： I-BERT In all cases, higher accuracy is achieved (RTE The accuracy is improved as high as 1.4), except MNLI-m, QQP and STS-B Mission （ The accuracy decreases as much as 0.3）
RoBERTa-Large： The results are similar , among I-BERT Match or better than the baseline accuracy of all downstream tasks .
On average, , about RoBERTa-Base/Large, I-BERT Better than the baseline 0.3/0.5.

4.2 Delay the experiment

Insert picture description here
experimental result ： And FP32 The model compares , Deploy on dedicated hardware that supports efficient integer Computing I-BERT Can achieve significant acceleration .

I-BERT Of INT8 Reasoning average ratio BERT-Base and BERT-Large Pure FP32 Reasoning is fast 3.08× and 3.56×, Up to 4.00× Speed up .

in addition , adopt Nvidia Plug-in integration and acceleration Transformer Key operations in the architecture , Further acceleration is possible , You can expect to compare TensorRT Provided api fast 2 times .

4.3 Ablation Experiment

Use GELU、h-GELU and i-GELU Conduct GELU Calculation model accuracy
Insert picture description here
experimental result

i-GELU Than GELU There is a small improvement in accuracy , The accuracy of all the above tasks is better than h-GELU high
h-GELU Approximate instead of GELU Will result in the addition of MRPC The accuracy of all downstream tasks is reduced （ The highest 2.2% The accuracy is reduced ）

Why? i-GELU Than GELU Better ？ There is no reasonable explanation

5. REFERENCE

YAO Z, DONG Z, ZHENG Z, etc. . HAWQV3: Dyadic Neural Network Quantization[M/OL]. arXiv, 2020[2022-07-26].

I-BERT Readers are recommended to read ( Yao Ming et al ., 2020) To learn more about quantization-aware fine-tuning integer-only Quantitative methods .