当前位置:网站首页>Tensor RT's int8 quantization principle
Tensor RT's int8 quantization principle
2022-07-26 18:56:00 【@BangBang】
Quantitative goals
- Neural network operation 32 Weight of floating-point representation , become 8 For the purpose of Int Integers , And hope there is no significant decrease in accuracy
Why use In8, Because it can bringHigherOfThroughput rate, alsolessOfMemory footprint- But there are also challenges ,
Int8YesLower accuracy, And there areSmaller dynamic range - How to ensure
Accuracy after quantificationWell ,Solution: Yes Int8 Quantized model weight and activation function , Minimize information loss . - Tensor RT The method adopted , No additional fine tuning Or retrain .
In8 Reasoning
Challenge
- INT8 be relative to FP32 It has low accuracy and dynamic range

- As can be seen from the table 32 Bit floating point ,16 Bit floating point ,INT8 The dynamic range of is very different , such as
16The locus is-65504 ~ +65504,32 The maximum dynamic range for floating point is-3.4 * 10^(38) ~3.4 x10^(38), andINT8The dynamic range of is much smaller-128 ~ 127 - So we
You cannot use simple type conversions , take 32 Bit floating point , Convert to 8 An integer , Otherwise, it will bring great performance loss.
Linear quantization (Linear quartization)
int8 The relationship with tensor is as follows Tensor Values =FP32 scale factor *int8 array +FP32 bias
among FP32 bias After research, it has little effect on performance , Can be removed
Symmetric linear quartization
Expressed as :Tensor Values =FP32 scale factor *int8 array
For all int 8array As long as a FP32 scale factor
Quantification method
There are two ways to quantify : Quantification of unsaturated 、 Saturated quantification
- Unsaturated quantification

Put the operation range of floating-point numbers above , Mapping to-127 ~127, By mapping the negative maximum to-127, Positive maxima map to127. But this will lead to a significant decrease in accuracy . - Saturation quantification

By setting athreshold T, take-TToTMapping in this range to-127~127, Less than-TMapping to-127, Greater thanTMapping to127. So this is the quantification of saturation
If we can be surethreshold TWords , It can improve the accuracy very well , The key point is how to select the appropriate threshold .
How to optimize threshold selection
Yes In8 The representation of requires a trade-off between dynamic range and accuracy 
The picture above shows different networks , The horizontal axis is activation value, The vertical axis is the number of times the normalized number appears , As can be seen from the first picture ,vgg19 conv3_4 Large activation values appear less often , The other two pictures are resnet152 Distribution of activation values , and googlenet:inception_3a Distribution of activation values .
We want to consider minimizing information loss , Think about 32 Bit floating-point number to 8 Bit integers just recode the information
Relative entropy
- What we want
Int8 modelAnd the originalFP32The information expressed by our model is the same , If it can't be done , We want to minimize the loss of information .
- What we want
- Loss of information through
KLDivergence is measured ,KL Divergence is a measure of the difference between two probability distributions , So as to measure the information loss caused by the new coding method .
Solution :Calibration
FP32 The model infers on the calibration data set , The calibration data set extracts some pictures from the training set .
For each layer :
- Collect the distribution of activation values (histograms)
- Using different
Saturation thresholdProduce different quantitative distributions (quantized distributions) - Select one of the thresholds , Make the corresponding quantization distribution and the distribution of activation value , It can be minimized KL The divergence .
K L _ d i v e r g e n c e ( r e f _ d i s t r , q u a n t _ d i s t r ) KL\_divergence(ref\_distr, quant\_distr) KL_divergence(ref_distr,quant_distr), Through iteration, we can get the most suitableSaturation threshold

Calibration algorithm
calibration: Iterative search threshold based on experiment
- Provide a sample data set ( Preferably a subset of the validation set ), be called “” Calibration data set “”, be called " Calibration data set ", Used for calibration
- Run on the calibration dataset FP32 Reasoning . Collect weights 、 Active histogram , And generate a set of 8 Bit notation , And choose to have the least KL The expression of divergence .
- KL Divergence uses the reference distribution (FP32 Distribution ) And quantitative distribution ( namely 8 Bit quantization activation ) Between
TRT Provides Int8EntropyCalibrator, The interface needs to be implemented by the regression end , To provide a set of calibration data and some sample codes for caching results .
How to use it KL Divergence choose the appropriate threshold
Nvidia The choice is KL-divergence, It's actually relative entropy . Relative entropy describes the difference between two distributions , Here is the difference between the two distributions before and after quantification . The smallest difference is the best , So the problem is to find the minimum value of relative entropy 
KL Divergence is to accurately measure the difference between the optimal and suboptimal .
FP32 Is the original optimal coding ,INT8 It's sub optimal coding , use KL Divergence to describe the difference between the two .
Tensor RT Quantitative process (workflow)
Premise
- FP32 Training for Model
- Calibration data set
TensorRT Work done
- use FP32 The model makes reasoning on the calibration data set
- Collect weights under different thresholds 、 Statistics of activation value ( Histogram )
- Execute the calibration algorithm to get the optimal Scale factors
- Then you can put FP32 The weight of is quantized to INT8
- Finally generate a
Calibration table( Calibration table ) and INT8 Executable reasoning engine

The left figure does not consider saturation , The picture on the right considers saturation , In the figure White line The position is the position of saturation threshold , The part below the threshold remains unchanged , Parts larger than the threshold will be quantified to a value .
边栏推荐
- 突发!Arm停止与华为合作!对华为影响究竟有多大?
- Ren Zhengfei revealed for the first time: the story behind Huawei's nearly $10billion "sale" to Motorola!
- Understand in depth why not use system.out.println()
- Still using xshell? Recommend this more modern terminal connection tool
- MES系统最全介绍来了
- 神经网络学习(2)前言介绍二
- Concentrate, heart to heart! The Chinese funded mobile phone Enterprises Association (CMA) of India is officially operational!
- 我酷故我在
- Interview summary of some large factories
- [add conditional behavior in kotlin]
猜你喜欢

Excellent JSON processing tool

更安全、更健康、无续航焦虑,魏牌拿铁DHT-PHEV来了

MySQL - 函数及约束命令

如何成为一名优秀的测试/开发程序员?专注谋定而后动......

详细介绍@GetMapping和@PostMapping的区别

455. 分发饼干【双指针 ++i、++j】

The class jointly built by famous oarsmen is new, and Professor qiuxipeng of Fudan University broadcast it live on Tuesday!

深入理解为什么不要使用System.out.println()

Understand in depth why not use system.out.println()

Meta Cambria handle exposure, active tracking + multi tactile feedback scheme
随机推荐
Likeshop takeout order system is open source, 100% open source, no encryption
2022年焊工(初级)操作证考试题库及模拟考试
Lombok常用注解
2022年流动式起重机司机考试试题模拟考试平台操作
NFT数字藏品开发:数字藏品助力企业发展
14. Gradient detection, random initialization, neural network Summary
Microsoft silently donated $10000 to curl, which was not notified until half a year later
模块八作业 - 消息数据 MySQL 表设计
5款WPS Office最佳海外替代品
OpenSSF 基金会总经理 Brian Behlendorf :预计 2026 年将有 4.2 亿个开源
CoVOS:无需解码!利用压缩视频比特流的运动矢量和残差进行半监督的VOS加速(CVPR 2022)...
The first ABAP ALV reporter construction process
File upload and download test point
Seata 入门简介
Data security knowledge system
一文详解MES系统给企业带来的5大好处,附应用场景
Offer set (1)
rancher部署kubernetes集群
2022G1工业锅炉司炉上岗证题库及模拟考试
Automated test tool playwright (quick start)