当前位置：网站首页>A quick overview of transformer quantitative papers in emnlp 2020

A quick overview of transformer quantitative papers in emnlp 2020

2022-07-27 07:56:00 【51CTO】

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Computer vision

I spent a few days reading three articles EMNLP 2020 About China Transformer Related papers on model quantification , Take a quick note of the main points .

Fully Quantized Transformer for Machine Translation

This is a collaboration between Huawei Noah Ark laboratory and the University of Montreal , Published in findings above .

「 Address of thesis ：」 https://www.aclweb.org/anthology/2020.findings-emnlp.1.pdf

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Artificial intelligence _02

Method

in the light of Transformer The structure of computationally intensive operators is carried out activation quantitative , except bias All but weight It is quantified , In fact, our group has also done this .

It's quantitative perception training （QAT）, Back propagation uses straight-through estimator.

Consider each channel There are differences in the distribution of , So for each channel Learn quantifiable alone scale Parameters .

Zero value processing ：padding Don't need to consider , It will be mask fall .ReLU and attention softmax The lower bound of quantification is as follows 0. All quantification is added to dropout Before .

Pruning adopts structural pruning , Because sparsity pruning requires hardware or library support , More trouble . Here only for FFN Parameter to prune , And not the traditional percentage threshold , But according to ReLU Then the first layer is calculated FFN Output of each column max value , according to max Whether the value exceeds a certain threshold to prune , This threshold is set to all columns max The variance of a value is multiplied by a constant . The advantage here is that the percentage of pruning is not fixed , Each layer of dynamic shear , More or less .

experimental result

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Deep learning _03

You can see that after training, you can quantify （PTQ） It's still a little bit lower , and QAT Of 8 Bit or 6 It doesn't drop much ,4 It will have a great influence . The result is also in line with the practical effect of our group .

evaluation

There is no innovation in this article , But in the end it's just findings, No one will . Quantify location and some trick It's easy to think of , And each channel In fact, it has little significance in practice , Without this trick The effect is good, too .FFN In the end, we didn't cut out many parameters .

Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation

This one is made by Samsung , Also published in findings above .

「 Address of thesis ：」 https://www.aclweb.org/anthology/2020.findings-emnlp.433.pdf

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _3d_04

Method

This article does not use the more commonly used uniform Quantitative way （ In other words, the floating-point interval is mapped to the integer interval in equal proportion ）, instead binary-code, That is to say, the parameter Expressed as （ Quantization Bits ） Binary vectors of the same dimension The linear combination of , Finally, the multiplication of a matrix and a vector becomes ： , For details, please refer to the previous papers of Samsung ：BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs.

This article aims at embedding Different word frequency of the word uses different quantized digits , The specific scheme can see the following pseudo code ：

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Artificial intelligence _05

In short, the higher the frequency of words , The more quantized digits . And most words are very low ,1% That's the word that occupied 95% The frequency of words , So if they have a higher number, it doesn't matter . For each word vector , Different quantization parameters are used , This also takes into account that the spatial distribution of each word vector is different .

in the light of encoder and decoder Different types of attention, The paper also uses different quantized digits .

Training strategy , here finetune Each stage 2000 Step open a quantification , To save training time .

experimental result

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Artificial intelligence _06

It can be seen that embedding The quantitative impact is still relatively large , Especially quantified to 2 Below the position . And all use 2 Bit quantization is bad . If we use the quantitative method in this paper , You can see that in most configurations , The losses are all in 1 Within a point , The effect is good .

evaluation

This article adopts non-uniform Quantitative way , It's a little more complicated to implement , And in terms of accelerated optimization of matrix multiplication , Not too much about how to achieve . Besides, there is no contrast uniform Quantification and non-uniform The gap in the way of quantification , It's not clear that the performance improvement is non-uniform Quantification dominates , Or according to the frequency and the difference attention Different quantized bits are dominant . Finally, there are too many artificial design factors , such as attention The number of quantized digits has to be determined , It's not very flexible to use .

TernaryBERT: Distillation-aware Ultra-low Bit BERT

This article is made by Huawei Noah Ark laboratory , Published in EMNLP On the Lord's meeting .

「 Address of thesis ：」 https://www.aclweb.org/anthology/2020.emnlp-main.37.pdf

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Computer vision _07

Method

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _3d_08

As shown in the figure above , Distillation is added to this article on the basis of quantification .

First of all, for weight quantitative , Using ternary quantization , The training method is TWN perhaps LAT, You can see the specific principle of Hou Lu's paper ：https://houlu369.github.io/, and activation Or use the general min-max quantitative .

To quantify the parameters ,embedding Each row takes a set of parameters , And everything else weight It's a whole set of parameters .

Distillation uses three loss：hidden state、attention and logits.teacher It's full precision ,student There are two , The first one is full precision , Models and teacher As like as two peas . Then use TWN or LAT Quantize the ternary network , Calculation loss, Finally, back propagation updates the full precision student Parameters of .

There are two more trick, One is data enhancement , The second is full precision student Model initialization is done with finetune After teacher.

experimental result

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _3d_09

stay GLUE The experiment on , The maximum compression ratio can reach 23 times , It seems that the effect is better than ordinary Q-BERT、Q8-BERT It's still a lot better , Than BERT It's just bad 1 A p.m. . All of this activation It's only quantified to 8 position , It is estimated that the effect will not be good if we try again .

evaluation

GLUE and SQUAD It looks good on the top , Compression is also very high , But there are a few ways to integrate ： Distillation 、 Data to enhance 、 Model initialization, etc , Ablation experiments can also see that the effect is greatly reduced by removing distillation and data enhancement , therefore TWN and LAT Compared with the general min-max It's not clear how much advantage there is in quantifying practice . Finally, this paper does not do machine translation task , It's all about classification tasks , It also works well on more difficult build tasks , It's really persuasive .

summary

Take a comprehensive look at , These three papers use three completely different quantitative methods . It's best to realize ,TensorFlow This is also the way to quantify ,8 Bit effects are almost lossless , But it's not going to work much lower . The last two articles are about ultra-low bit quantization , One uses binary-code, One is TWN or LAT, The compression ratio is very high . In the second part, the quantized digits of different parameters are distinguished according to word frequency or importance , In the case of ultra-low bit can also maintain good results . Direct ultra-low bit quantization , But with a series of operations such as distillation, the performance has been improved , Unfortunately, I didn't do the machine translation task , It makes people suspect .

But in the end, only the third one won the Lord's meeting , Anyway , There is still a lot to learn from .

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ Computer vision _10

Author's brief introduction ：godweiyang, Know the same name , A master's degree in computer science of East China Normal University , Direction Natural language processing and deep learning . Like to share technology and knowledge with others , Looking forward to further communication with you ~

If you have any questions, you can leave a message in the comments section , Welcome to join us for in-depth communication on wechat ~

A quick overview of EMNLP 2020 Medium Transformer Quantitative paper _ natural language processing _11