当前位置：网站首页>Quantitative calculation research

Quantitative calculation research

2022-07-03 11:53:00 【Diros1g】

1.Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

1.1 methods

Prunes the network： Keep only some important connections ;
Quantize the weights： Share some through weight quantization weights;
Huffman coding： Further compression by Huffman coding ;
Insert picture description here

1.1.1 prune

1. Learn to connect from the normally trained network
2. Trim connections with small weights , Those less than the threshold are deleted ;
3 Retraining the web , Its parameters are obtained from the remaining sparse connections ;

1.1.2 Quantification and weight sharing

quantitative

Please add a picture description

Insert picture description here

Weight sharing

It uses a very simple K-means, Make one for each floor weight The clustering , Belong to the same cluster Of share the same weight size . One thing to watch out for ： Cross layer weight Do not share weights ;

1.1.3 Huffman code

Huffman coding will first use the frequency of characters to create a tree , Then a specific code is generated for each character through the structure of the tree , Characters with high frequency use shorter encoding , If the frequency is low, a longer code is used , This will reduce the average length of the encoded string , So as to achieve the purpose of data lossless compression .

1.2 effect

Pruning： Reduce the number of connections to the original 1/13~1/9;
Quantization： Every connection from the original 32bits Reduced to 5bits;

Final effect ：

hold AlextNet Compressed 35 times , from 240MB, Reduce to 6.9MB;
hold VGG-16 Compressed 49 north , from 552MB Reduce to 11.3MB;
The calculation speed is the original 3~4 times , Energy consumption is the original 3~7 times ;

1.3 The experimental requirements

1. The weight preservation requirements of training are complete , It can't be model.state_dict(), But most of our current weight files are parameter states rather than complete models
2. It requires a complete network structure
3. Have enough training data

Reference resources ：1.【 Deep neural network compression 】Deep Compression
2. Deep compression ： Adopt pruning , Quantization training and Huffman coding to compress the depth neural network
3.Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding
4.pytorch Official documents
5. Understanding of Huffman coding (Huffman Coding)

2.torch The official quantitative calculation

https://pytorch.apachecn.org/#/
https://pytorch.org/docs/stable/quantization.html

2.0 Data type quantification Quantized Tensor

Can be stored int8/uint8/int32 Data of type , And carry with you scale、zero_point These parameters

>>> x = torch.rand(2,3, dtype=torch.float32) 
>>> x
tensor([[0.6839, 0.4741, 0.7451],
        [0.9301, 0.1742, 0.6835]])
 
>>> xq = torch.quantize_per_tensor(x, scale = 0.5, zero_point = 8, dtype=torch.quint8)
tensor([[0.5000, 0.5000, 0.5000],
        [1.0000, 0.0000, 0.5000]], size=(2, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.5, zero_point=8)
 
>>> xq.int_repr()
tensor([[ 9,  9,  9],

2.1 Two quantitative methods

2.2Post Training Static Quantization, Static quantization after model training torch.quantize_per_tensor

scale （ scale ） and zero_point（ Zero position ） You need to customize it . The quantified model , Can't train （ It's not back propagation ）, Nor can you reason , After dequantization , In order to do the calculation

2.3. Post Training Dynamic Quantization, Dynamic quantification after model training ： torch.quantization.quantize_dynamic

The system automatically selects the most appropriate scale （ scale ） and zero_point（ Zero position ）, No need to customize . The quantified model , You can infer , But not training （ It's not back propagation ）

原网站

版权声明
本文为[Diros1g]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/184/202207031058239091.html