当前位置：网站首页>Tensor RT's int8 quantization principle

Tensor RT's int8 quantization principle

2022-07-26 18:56:00 【@BangBang】

Quantitative goals

Neural network operation 32 Weight of floating-point representation , become 8 For the purpose of Int Integers , And hope there is no significant decrease in accuracy
Why use In8, Because it can bring Higher Of Throughput rate , also less Of Memory footprint
But there are also challenges ,Int8 Yes Lower accuracy , And there are Smaller dynamic range
How to ensure Accuracy after quantification Well , Solution ： Yes Int8 Quantized model weight and activation function , Minimize information loss .
Tensor RT The method adopted , No additional fine tuning Or retrain .

In8 Reasoning

Challenge

INT8 be relative to FP32 It has low accuracy and dynamic range
As can be seen from the table 32 Bit floating point ,16 Bit floating point ,INT8 The dynamic range of is very different , such as 16 The locus is -65504 ~ +65504 ,32 The maximum dynamic range for floating point is -3.4 * 10^(38) ~3.4 x10^(38), and INT8 The dynamic range of is much smaller -128 ~ 127
So we You cannot use simple type conversions , take 32 Bit floating point , Convert to 8 An integer , Otherwise, it will bring great performance loss .

Linear quantization (Linear quartization)

int8 The relationship with tensor is as follows
Tensor Values =FP32 scale factor *int8 array +FP32 bias
among FP32 bias After research, it has little effect on performance , Can be removed

Symmetric linear quartization

Expressed as :
Tensor Values =FP32 scale factor *int8 array
For all int 8array As long as a FP32 scale factor

Quantification method

There are two ways to quantify ： Quantification of unsaturated 、 Saturated quantification

Unsaturated quantification

Put the operation range of floating-point numbers above , Mapping to -127 ~127, By mapping the negative maximum to -127, Positive maxima map to 127. But this will lead to a significant decrease in accuracy .
Saturation quantification

By setting a threshold T , take -T To T Mapping in this range to -127~127 , Less than -T Mapping to -127, Greater than T Mapping to 127 . So this is the quantification of saturation
If we can be sure threshold T Words , It can improve the accuracy very well , The key point is how to select the appropriate threshold .

How to optimize threshold selection

Yes In8 The representation of requires a trade-off between dynamic range and accuracy
Insert picture description here
The picture above shows different networks , The horizontal axis is activation value, The vertical axis is the number of times the normalized number appears , As can be seen from the first picture ,vgg19 conv3_4 Large activation values appear less often , The other two pictures are resnet152 Distribution of activation values , and googlenet:inception_3a Distribution of activation values .

We want to consider minimizing information loss , Think about 32 Bit floating-point number to 8 Bit integers just recode the information

Relative entropy

- What we want Int8 model And the original FP32 The information expressed by our model is the same , If it can't be done , We want to minimize the loss of information .
Loss of information through KL Divergence is measured ,KL Divergence is a measure of the difference between two probability distributions , So as to measure the information loss caused by the new coding method .

Solution ：Calibration

FP32 The model infers on the calibration data set , The calibration data set extracts some pictures from the training set .
For each layer ：

Collect the distribution of activation values (histograms)
Using different Saturation threshold Produce different quantitative distributions (quantized distributions)
Select one of the thresholds , Make the corresponding quantization distribution and the distribution of activation value , It can be minimized KL The divergence .
$KL\_divergence(ref\_distr, quant\_distr)$ , Through iteration, we can get the most suitable Saturation threshold

Insert picture description here

Calibration algorithm

calibration: Iterative search threshold based on experiment

Provide a sample data set （ Preferably a subset of the validation set ）, be called “” Calibration data set “”, be called " Calibration data set ", Used for calibration
Run on the calibration dataset FP32 Reasoning . Collect weights 、 Active histogram , And generate a set of 8 Bit notation , And choose to have the least KL The expression of divergence .
KL Divergence uses the reference distribution （FP32 Distribution ） And quantitative distribution （ namely 8 Bit quantization activation ） Between

TRT Provides Int8EntropyCalibrator, The interface needs to be implemented by the regression end , To provide a set of calibration data and some sample codes for caching results .

How to use it KL Divergence choose the appropriate threshold

Nvidia The choice is KL-divergence, It's actually relative entropy . Relative entropy describes the difference between two distributions , Here is the difference between the two distributions before and after quantification . The smallest difference is the best , So the problem is to find the minimum value of relative entropy
Insert picture description here
KL Divergence is to accurately measure the difference between the optimal and suboptimal .

FP32 Is the original optimal coding ,INT8 It's sub optimal coding , use KL Divergence to describe the difference between the two .

Tensor RT Quantitative process (workflow)

Premise

FP32 Training for Model
Calibration data set

TensorRT Work done

use FP32 The model makes reasoning on the calibration data set
Collect weights under different thresholds 、 Statistics of activation value （ Histogram ）
Execute the calibration algorithm to get the optimal Scale factors
Then you can put FP32 The weight of is quantized to INT8
Finally generate a Calibration table（ Calibration table ） and INT8 Executable reasoning engine

Insert picture description here
The left figure does not consider saturation , The picture on the right considers saturation , In the figure White line The position is the position of saturation threshold , The part below the threshold remains unchanged , Parts larger than the threshold will be quantified to a value .