当前位置:网站首页>Pytorch quantitative practice (1)
Pytorch quantitative practice (1)
2022-06-30 21:43:00 【Breeze_】
Translation source https://pytorch.org/blog/quantization-in-practice/
Quantification is a cheap and simple method , It can make the deep neural network model run faster , And has lower memory requirements .PyTorch Several different methods of quantifying models are provided . In this blog post , We will ( Fast ) Lay the foundation for quantification in deep learning , Then look at how each technology works in practice . Last , We will conclude with the recommendations in the literature on the use of quantification in workflow .
Principle of quantification
If someone asks you what time it is , You can't answer “10:14:34:430705”, And they say “10 A quarter past ”.
The essence of quantification is information compression , In deep networks , It refers to reducing its weight and / Or the activated numerical precision .
Over parameterized deep neural network (DNN) There are more degrees of freedom , This makes them good candidates for information compression [1]. When you quantify a model , Two things usually happen —— The model becomes smaller , More efficient operation . Hardware vendors explicitly allow faster processing 8 Bit data ( comparison 32 Bit data ), So as to achieve higher throughput . Smaller models have lower memory footprint and power consumption [2], This is critical for edge deployment .
Function mapping
The mapping function is a function that converts a value from Floating point space Mapping to Integer space Function of . The commonly used mapping function is given by the linear transformation Q ( r ) = round ( r / S + Z ) Q(r)=\operatorname{round}(r / S+Z) Q(r)=round(r/S+Z), among r r r It's input , S S S and Z Z Z Is a quantitative parameter .
To re convert to floating point space , The corresponding inverse function is r ~ = ( Q ( r ) − Z ) ⋅ S , r ~ ≠ r \tilde{r}=(Q(r)-Z) \cdot S,\tilde{r}≠r r~=(Q(r)−Z)⋅S,r~=r, The difference between them constitutes the quantization error .
Quantizing parameters
The mapping function consists of The scaling factor S S S and Zero parameter (zero-point) Z Z Z constitute , among , S S S Is the ratio of input and output range , Yes S = β − α β q − α q S=\frac{\beta-\alpha}{\beta_{q}-\alpha_{q}} S=βq−αqβ−α, among , [ α , β ] [\alpha,\beta] [α,β] Is the input crop range , That is, the allowed input boundary . and [ α q , β q ] [\alpha_q,\beta_q] [αq,βq] Is the range of the mapped quantized output space , for example 8bit The quantification of , Output range β q − α q < = ( 2 8 − 1 ) \beta_{q}-\alpha_{q}<=(2^8-1) βq−αq<=(28−1)
Z Z Z As a deviation , To ensure that... In the input space 0 Perfectly mapped to the 0, Yes Z = − ( α S − α q ) Z=-\left(\frac{\alpha}{S}-\alpha_{q}\right) Z=−(Sα−αq)
calibration
The process of selecting the input shear range is called calibration . The simplest Technology ( It's also PyTorch The default method for ) Is the minimum and maximum value of the record operation , And assign them to α \alpha α and β \beta β.TensorRT Entropy minimization is also used (KL The divergence ), Mean square error minimization , Or enter the percentage of the range .
stay PyTorch in ,Observer modular (docs, code) Collect statistical information of input values and calculate quantitative parameters S S S and Z Z Z. Different calibration schemes will produce different quantized outputs , It is best to verify through experience which scheme is most suitable for applications and architectures ( It will be described in detail later ).
from torch.quantization.observer import MinMaxObserver, MovingAverageMinMaxObserver, HistogramObserver
C, L = 3, 4
normal = torch.distributions.normal.Normal(0,1)
inputs = [normal.sample((C, L)), normal.sample((C, L))]
print(inputs)
# >>>>>
# [tensor([[-0.0590, 1.1674, 0.7119, -1.1270],
# [-1.3974, 0.5077, -0.5601, 0.0683],
# [-0.0929, 0.9473, 0.7159, -0.4574]]]),
# tensor([[-0.0236, -0.7599, 1.0290, 0.8914],
# [-1.1727, -1.2556, -0.2271, 0.9568],
# [-0.2500, 1.4579, 1.4707, 0.4043]])]
observers = [MinMaxObserver(), MovingAverageMinMaxObserver(), HistogramObserver()]
for obs in observers:
for x in inputs: obs(x)
print(obs.__class__.__name__, obs.calculate_qparams())
# >>>>>
# MinMaxObserver (tensor([0.0112]), tensor([124], dtype=torch.int32))
# MovingAverageMinMaxObserver (tensor([0.0101]), tensor([139], dtype=torch.int32))
# HistogramObserver (tensor([0.0100]), tensor([106], dtype=torch.int32))
Affine and symmetric quantization schemes
Affine or asymmetric quantization schemes Assign input ranges to minimum and maximum observations . Affine schemes usually use tighter scopes , Useful for quantifying nonnegative activation ( If the input tensor is never negative , The input range does not need to contain negative values ). The scope is α = m i n ( r ) , β = m a x ( r ) \alpha=min(r),\beta=max(r) α=min(r),β=max(r). When used for weight tensor [3] when , Affine quantization leads to greater computational inference costs .
Symmetric quantization scheme Focus the input range on 0 near , It eliminates the need to calculate the zero offset . The range is calculated as − α = β = m a x ( ∣ m i n ( r ) ∣ , ∣ m a x ( r ) ∣ ) -\alpha=\beta=max(|min(r)|,|max(r)|) −α=β=max(∣min(r)∣,∣max(r)∣). For tilt signals ( If non negative activation ), This can lead to poor quantization resolution , Because the range may include values that have never been displayed in the input .
In conclusion , Asymmetric quantization is useful for nonnegative activation , Quantization of the weight tensor is computationally expensive ; Symmetric quantization schemes can be bad for non negative activation .
act = torch.distributions.pareto.Pareto(1, 10).sample((1,1024))
weights = torch.distributions.normal.Normal(0, 0.12).sample((3, 64, 7, 7)).flatten()
def get_symmetric_range(x):
beta = torch.max(x.max(), x.min().abs())
return -beta.item(), beta.item()
def get_affine_range(x):
return x.min().item(), x.max().item()
def plot(plt, data, scheme):
boundaries = get_affine_range(data) if scheme == 'affine' else get_symmetric_range(data)
a, _, _ = plt.hist(data, density=True, bins=100)
ymin, ymax = np.quantile(a[a>0], [0.25, 0.95])
plt.vlines(x=boundaries, ls='--', colors='purple', ymin=ymin, ymax=ymax)
fig, axs = plt.subplots(2,2)
plot(axs[0, 0], act, 'affine')
axs[0, 0].set_title("Activation, Affine-Quantized")
plot(axs[0, 1], act, 'symmetric')
axs[0, 1].set_title("Activation, Symmetric-Quantized")
plot(axs[1, 0], weights, 'affine')
axs[1, 0].set_title("Weights, Affine-Quantized")
plot(axs[1, 1], weights, 'symmetric')
axs[1, 1].set_title("Weights, Symmetric-Quantized")
plt.show()

stay PyTorch in , You can initialize Observer Specifies an affine or symmetric scheme . Be careful , Not all obeserver Both modes are supported .
for qscheme in [torch.per_tensor_affine, torch.per_tensor_symmetric]:
obs = MovingAverageMinMaxObserver(qscheme=qscheme)
for x in inputs: obs(x)
print(f"Qscheme: {
qscheme} | {
obs.calculate_qparams()}")
# >>>>>
# Qscheme: torch.per_tensor_affine | (tensor([0.0101]), tensor([139], dtype=torch.int32))
# Qscheme: torch.per_tensor_symmetric | (tensor([0.0109]), tensor([128]))
Tensor by tensor and channel by channel quantization schemes
The quantization parameter can be used as a whole to calculate the whole weight tensor of the layer , The weight tensor of each channel can also be calculated separately . In tensor by tensor , The same shear range applies to all channels in a layer
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-m6yuuvXT-1653633614929)(https://pytorch.org/assets/images/quantization-practice/per-channel-tensor.svg)]
chart 3 Shown . Each channel uses a set of quantization parameters for each channel . Each tensor uses the same quantization parameter for the entire tensor . For weight quantification , Symmetric per channel quantization provides better accuracy ; The performance of each tensor quantification is poor , It may be due to batchnorm Fold [3] High variance of convolution weights across channels .
from torch.quantization.observer import MovingAveragePerChannelMinMaxObserver
obs = MovingAveragePerChannelMinMaxObserver(ch_axis=0) # calculate qparams for all `C` channels separately
for x in inputs: obs(x)
print(obs.calculate_qparams())
# >>>>>
# (tensor([0.0090, 0.0075, 0.0055]), tensor([125, 187, 82], dtype=torch.int32))
Back end engine
at present , The quantization operator passes through FBGEMM The back end runs on x86 On the machine , Or in ARM Use on the machine QNNPACK The original language . To the server gpu Back end support ( adopt TensorRT and cuDNN) Coming soon . Learn more about extending quantification to custom backend :RFC-0019.
backend = 'fbgemm' if x86 else 'qnnpack'
qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
QConfig
QConfig Storage Observer And a quantization scheme for quantizing activation and weighting .
Make sure that the message is Observer class ( Not an instance ), Or you can return Observer Instance's callable object . Use with_args() Override default parameters .
my_qconfig = torch.quantization.QConfig(
activation=MovingAverageMinMaxObserver.with_args(qscheme=torch.per_tensor_affine),
weight=MovingAveragePerChannelMinMaxObserver.with_args(qscheme=torch.qint8)
)
# >>>>>
# QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, qscheme=torch.per_tensor_affine){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MovingAveragePerChannelMinMaxObserver'>, qscheme=torch.qint8){})
边栏推荐
猜你喜欢

1-2 安装并配置MySQL相关的软件

Text recognition svtr paper interpretation

Why have the intelligent investment advisory products collectively taken off the shelves of banks become "chicken ribs"?

Reading notes of Clickhouse principle analysis and Application Practice (3)

Nacos部署及使用
Understand what MySQL index push down (ICP) is in one article

PyTorch量化实践(2)

Open source internship experience sharing: openeuler software package reinforcement test

Bloom filter
笔记【JUC包以及Future介绍】
随机推荐
Ssh server configuration file parameter permitrootlogin introduction
Prediction and regression of stacking integrated model
AKK菌——下一代有益菌
A comprehensive understanding of gout: symptoms, risk factors, pathogenesis and management
Fletter nested hell? No, constraintlayout to save!
To the Sultanate of Anderson
用yml文件进行conda迁移环境时的报错小结
Jupyter notebook/lab switch CONDA environment
Text recognition svtr paper interpretation
Why have the intelligent investment advisory products collectively taken off the shelves of banks become "chicken ribs"?
1-13 express监听GET和POST请求&处理请求
看阿里云 CIPU 的 10 大能力
The 16th Heilongjiang Provincial Collegiate Programming Contest
你我他是谁
1-15 nodemon
Summary of errors reported when using YML file to migrate CONDA environment
Zaah Sultan looks at the old driver
1-2 install and configure MySQL related software
Go Web 编程入门: 一探优秀测试库 GoConvey
Upgrade Kube with unknown flag: --network plugin