当前位置:网站首页>Deep learning model compression and acceleration technology (VII): mixed mode

Deep learning model compression and acceleration technology (VII): mixed mode

2022-07-07 20:18:00 Breeze_

Compression and acceleration of deep learning model refers to The model is simplified by using the redundancy of neural network parameters and the redundancy of network structure , Without affecting the completion of the task , Get fewer parameters 、 A more streamlined model . The compressed model has smaller computational resource requirements and memory requirements , Compared with the original model, it can meet a wider range of application needs . In the context of the increasing popularity of deep learning technology , The strong application demand for deep learning model makes people occupy less memory 、 Low computing resource requirements 、 At the same time, it still ensures a high accuracy “ Little model ” Pay special attention to . The model compression and acceleration of deep learning by using the redundancy of neural network has aroused widespread interest in academia and industry , All kinds of jobs emerge in endlessly .

In this paper, the reference 2021 Published in the Journal of software 《 A survey of compression and acceleration of deep learning models 》 Summarized and learned .

Related links :

Deep learning model compression and acceleration technology ( One ): Parameter pruning

Deep learning model compression and acceleration technology ( Two ): Parameter quantification

Deep learning model compression and acceleration technology ( 3、 ... and ): Low rank decomposition

Deep learning model compression and acceleration technology ( Four ): Parameters of the Shared

Deep learning model compression and acceleration technology ( 5、 ... and ): Compact network

Deep learning model compression and acceleration technology ( 6、 ... and ): Distillation of knowledge

Deep learning model compression and acceleration technology ( 7、 ... and ): Mixed mode

summary

Model compression and acceleration technology describe
Parameter pruning (A) Design evaluation criteria for the importance of parameters , Based on this criterion, the importance of network parameters is judged , Delete redundant parameters
Parameter quantification (A) Change the network parameters from 32 Bit full precision floating-point number quantized to lower bits
Low rank decomposition (A) The dimension reduction of high-dimensional parameter vector is decomposed into sparse low-dimensional vector
Parameters of the Shared (A) Using structured matrix or clustering method to map the internal parameters of the network
Compact network (B) From convolution kernel 、 Special layer and network structure 3 Design a new lightweight network at three levels
Distillation of knowledge (B) Refine the information from the larger teacher model to the smaller student model
Mixed mode (A+B) The combination of the first several methods

A: Compression parameters B: Compression structure

Mixed mode

Definition

A combination of commonly used model compression and acceleration techniques , It's a hybrid way .

characteristic

The hybrid method can integrate the advantages of various compression and acceleration methods , Further strengthen the compression and acceleration effect , It will be an important research direction in the field of deep learning model compression and acceleration in the future .

1. Combine parameter pruning and parameter quantification

  • Ullrich wait forsomeone [165] be based on Soft weight sharing The regularization term of , Parameter quantification and parameter pruning are realized in the process of model retraining .
  • Tung wait forsomeone [166] An integrated compression and acceleration framework of parameter pruning and parameter quantization is proposed Compression learning by in parallel pruning-quantization(CLIP-Q).
  • Han wait forsomeone [167] Put forward Deep compression, Prune parameters 、 Combination of parameter quantization and Huffman coding , Achieved a good compression effect ; And on the basis of it, soft / Hardware co compression design , Put forward Efficient inference engine(Eie) frame [168].
  • Dubey wait forsomeone [169] Also use this 3 A combination of two methods for network compression .

2. Combine parameter pruning and parameter sharing

  • Louizos wait forsomeone [170] Adopt Bayesian principle , Introduce sparsity through prior distribution to prune the network , Use a posteriori uncertainty to determine the optimal fixed-point accuracy to encode weights .
  • Ji wait forsomeone [171] Enter by reordering / Output dimension for pruning , The irregular distribution weights with small values are clustered into structured groups , Achieve better hardware utilization and higher sparsity .
  • Zhang wait forsomeone [172] Not only do regularizers encourage sparsity , It also learns which parameter groups should share a common value to explicitly identify highly correlated neurons .

3. Combined parameter quantification and knowledge distillation

  • Polino wait forsomeone [173] It is proposed to add knowledge distillation loss Quantitative training method , There are floating-point model and quantitative model , Use quantitative model to calculate forward loss, And calculate the gradient , To update the floating point model . Before each forward calculation , Update the quantization model with the updated floating-point model .
  • Mishra wait forsomeone [174] It is proposed to use high-precision teacher model to guide the training of low-precision student model , Yes 3 Ideas : The teacher model and the quantified student model are jointly trained ; The pre trained teacher model guides the quantitative student model to train from scratch ; Both the teacher model and the student model are pre trained , But the student model has been quantified , Then fine tune under the guidance of the teacher model .

reference

Main reference : Gao Han , Tian Yulong , Xu Fengyuan , Zhongsheng . A survey of compression and acceleration of deep learning models [J]. Journal of software ,2021,32(01):68-92.DOI:10.13328/j.cnki.jos.006096.

[165] Ullrich K, Meeds E, Welling M. Soft weight-sharing for neural network compression. arXiv Preprint arXiv: 1702.04008, 2017.

[166] Tung F, Mori G. Clip-q: Deep network compression learning by in-parallel pruning-quantization. In: Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition. 2018. 78737882.

[167] Han S, Mao H, Dally WJ. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv Preprint arXiv: 1510.00149, 2015.

[168] Han S, Liu X, Mao H, et al. EIE: Efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 2016,44(3):243254.

[169] Dubey A, Chatterjee M, Ahuja N. Coreset-based neural network compression. In: Proc. of the European Conf. on Computer Vision (ECCV). 2018. 454470.

[170] Louizos C, Ullrich K, Welling M. Bayesian compression for deep learning. In: Advances in Neural Information Processing Systems.\2017. 32883298.

[171] Ji Y, Liang L, Deng L, et al. TETRIS: Tile-matching the tremendous irregular sparsity. In: Advances in Neural Information Processing Systems. 2018. 41154125.

[172] Zhang D, Wang H, Figueiredo M, et al. Learning to share: Simultaneous parameter tying and sparsification in deep learning. In: Proc. of the 6th Int’l Conf. on Learning Representations. 2018.

[173] Polino A, Pascanu R, Alistarh D. Model compression via distillation and quantization. arXiv Preprint arXiv: 1802.05668, 2018.

[174] Mishra A, Marr D. Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy. arXiv Preprint arXiv: 1711.05852, 2017.

原网站

版权声明
本文为[Breeze_]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071808237217.html