当前位置：网站首页>Pytorch quantitative practice (1)

Pytorch quantitative practice (1)

2022-06-30 21:43:00 【Breeze_】

Translation source https://pytorch.org/blog/quantization-in-practice/
Quantification is a cheap and simple method , It can make the deep neural network model run faster , And has lower memory requirements .PyTorch Several different methods of quantifying models are provided . In this blog post , We will ( Fast ) Lay the foundation for quantification in deep learning , Then look at how each technology works in practice . Last , We will conclude with the recommendations in the literature on the use of quantification in workflow .

Principle of quantification

If someone asks you what time it is , You can't answer “10:14:34:430705”, And they say “10 A quarter past ”.

The essence of quantification is information compression , In deep networks , It refers to reducing its weight and / Or the activated numerical precision .

Over parameterized deep neural network （DNN） There are more degrees of freedom , This makes them good candidates for information compression [1]. When you quantify a model , Two things usually happen —— The model becomes smaller , More efficient operation . Hardware vendors explicitly allow faster processing 8 Bit data （ comparison 32 Bit data ）, So as to achieve higher throughput . Smaller models have lower memory footprint and power consumption [2], This is critical for edge deployment .

Function mapping

The mapping function is a function that converts a value from Floating point space Mapping to Integer space Function of . The commonly used mapping function is given by the linear transformation $Q(r)=\operatorname{round}(r / S+Z)$ , among $r$ It's input , $S$ and $Z$ Is a quantitative parameter .

To re convert to floating point space , The corresponding inverse function is $\tilde{r}=(Q(r)-Z) \cdot S,\tilde{r}≠r$ , The difference between them constitutes the quantization error .

Quantizing parameters

The mapping function consists of The scaling factor $S$ and Zero parameter （zero-point） $Z$ constitute , among , $S$ Is the ratio of input and output range , Yes $S=\frac{\beta-\alpha}{\beta_{q}-\alpha_{q}}$ , among , $[\alpha,\beta]$ Is the input crop range , That is, the allowed input boundary . and $[\alpha_q,\beta_q]$ Is the range of the mapped quantized output space , for example 8bit The quantification of , Output range $\beta_{q}-\alpha_{q}<=(2^8-1)$

$Z$ As a deviation , To ensure that... In the input space 0 Perfectly mapped to the 0, Yes $Z=-\left(\frac{\alpha}{S}-\alpha_{q}\right)$

calibration

The process of selecting the input shear range is called calibration . The simplest Technology （ It's also PyTorch The default method for ） Is the minimum and maximum value of the record operation , And assign them to $\alpha$ and $\beta$ .TensorRT Entropy minimization is also used （KL The divergence ）, Mean square error minimization , Or enter the percentage of the range .

stay PyTorch in ,Observer modular (docs, code) Collect statistical information of input values and calculate quantitative parameters $S$ and $Z$ . Different calibration schemes will produce different quantized outputs , It is best to verify through experience which scheme is most suitable for applications and architectures （ It will be described in detail later ）.

from torch.quantization.observer import MinMaxObserver, MovingAverageMinMaxObserver, HistogramObserver
C, L = 3, 4
normal = torch.distributions.normal.Normal(0,1)
inputs = [normal.sample((C, L)), normal.sample((C, L))]
print(inputs)

# >>>>>
# [tensor([[-0.0590, 1.1674, 0.7119, -1.1270],
# [-1.3974, 0.5077, -0.5601, 0.0683],
# [-0.0929, 0.9473, 0.7159, -0.4574]]]),

# tensor([[-0.0236, -0.7599, 1.0290, 0.8914],
# [-1.1727, -1.2556, -0.2271, 0.9568],
# [-0.2500, 1.4579, 1.4707, 0.4043]])]

observers = [MinMaxObserver(), MovingAverageMinMaxObserver(), HistogramObserver()]
for obs in observers:
  for x in inputs: obs(x) 
  print(obs.__class__.__name__, obs.calculate_qparams())

# >>>>>
# MinMaxObserver (tensor([0.0112]), tensor([124], dtype=torch.int32))
# MovingAverageMinMaxObserver (tensor([0.0101]), tensor([139], dtype=torch.int32))
# HistogramObserver (tensor([0.0100]), tensor([106], dtype=torch.int32))

Affine and symmetric quantization schemes

Affine or asymmetric quantization schemes Assign input ranges to minimum and maximum observations . Affine schemes usually use tighter scopes , Useful for quantifying nonnegative activation （ If the input tensor is never negative , The input range does not need to contain negative values ）. The scope is $\alpha=min(r),\beta=max(r)$ . When used for weight tensor [3] when , Affine quantization leads to greater computational inference costs .

Symmetric quantization scheme Focus the input range on 0 near , It eliminates the need to calculate the zero offset . The range is calculated as $-\alpha=\beta=max(|min(r)|,|max(r)|)$ . For tilt signals ( If non negative activation ), This can lead to poor quantization resolution , Because the range may include values that have never been displayed in the input .

In conclusion , Asymmetric quantization is useful for nonnegative activation , Quantization of the weight tensor is computationally expensive ; Symmetric quantization schemes can be bad for non negative activation .

act =  torch.distributions.pareto.Pareto(1, 10).sample((1,1024))
weights = torch.distributions.normal.Normal(0, 0.12).sample((3, 64, 7, 7)).flatten()

def get_symmetric_range(x):
  beta = torch.max(x.max(), x.min().abs())
  return -beta.item(), beta.item()

def get_affine_range(x):
  return x.min().item(), x.max().item()

def plot(plt, data, scheme):
  boundaries = get_affine_range(data) if scheme == 'affine' else get_symmetric_range(data)
  a, _, _ = plt.hist(data, density=True, bins=100)
  ymin, ymax = np.quantile(a[a>0], [0.25, 0.95])
  plt.vlines(x=boundaries, ls='--', colors='purple', ymin=ymin, ymax=ymax)

fig, axs = plt.subplots(2,2)
plot(axs[0, 0], act, 'affine')
axs[0, 0].set_title("Activation, Affine-Quantized")

plot(axs[0, 1], act, 'symmetric')
axs[0, 1].set_title("Activation, Symmetric-Quantized")

plot(axs[1, 0], weights, 'affine')
axs[1, 0].set_title("Weights, Affine-Quantized")

plot(axs[1, 1], weights, 'symmetric')
axs[1, 1].set_title("Weights, Symmetric-Quantized")
plt.show()

stay PyTorch in , You can initialize Observer Specifies an affine or symmetric scheme . Be careful , Not all obeserver Both modes are supported .

for qscheme in [torch.per_tensor_affine, torch.per_tensor_symmetric]:
  obs = MovingAverageMinMaxObserver(qscheme=qscheme)
  for x in inputs: obs(x)
  print(f"Qscheme: {
      qscheme} | {
      obs.calculate_qparams()}")

# >>>>>
# Qscheme: torch.per_tensor_affine | (tensor([0.0101]), tensor([139], dtype=torch.int32))
# Qscheme: torch.per_tensor_symmetric | (tensor([0.0109]), tensor([128]))

Tensor by tensor and channel by channel quantization schemes

The quantization parameter can be used as a whole to calculate the whole weight tensor of the layer , The weight tensor of each channel can also be calculated separately . In tensor by tensor , The same shear range applies to all channels in a layer

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-m6yuuvXT-1653633614929)(https://pytorch.org/assets/images/quantization-practice/per-channel-tensor.svg)]

chart 3 Shown . Each channel uses a set of quantization parameters for each channel . Each tensor uses the same quantization parameter for the entire tensor . For weight quantification , Symmetric per channel quantization provides better accuracy ; The performance of each tensor quantification is poor , It may be due to batchnorm Fold [3] High variance of convolution weights across channels .

from torch.quantization.observer import MovingAveragePerChannelMinMaxObserver
obs = MovingAveragePerChannelMinMaxObserver(ch_axis=0)  # calculate qparams for all `C` channels separately
for x in inputs: obs(x)
print(obs.calculate_qparams())

# >>>>>
# (tensor([0.0090, 0.0075, 0.0055]), tensor([125, 187, 82], dtype=torch.int32))

Back end engine

at present , The quantization operator passes through FBGEMM The back end runs on x86 On the machine , Or in ARM Use on the machine QNNPACK The original language . To the server gpu Back end support ( adopt TensorRT and cuDNN) Coming soon . Learn more about extending quantification to custom backend :RFC-0019.

backend = 'fbgemm' if x86 else 'qnnpack'
qconfig = torch.quantization.get_default_qconfig(backend)  
torch.backends.quantized.engine = backend

QConfig

QConfig Storage Observer And a quantization scheme for quantizing activation and weighting .

Make sure that the message is Observer class ( Not an instance ), Or you can return Observer Instance's callable object . Use with_args() Override default parameters .

my_qconfig = torch.quantization.QConfig(
  activation=MovingAverageMinMaxObserver.with_args(qscheme=torch.per_tensor_affine),
  weight=MovingAveragePerChannelMinMaxObserver.with_args(qscheme=torch.qint8)
)
# >>>>>
# QConfig(activation=functools.partial(<class 'torch.ao.quantization.observer.MovingAverageMinMaxObserver'>, qscheme=torch.per_tensor_affine){}, weight=functools.partial(<class 'torch.ao.quantization.observer.MovingAveragePerChannelMinMaxObserver'>, qscheme=torch.qint8){})