当前位置:网站首页>A quick overview of transformer quantitative papers in emnlp 2020
A quick overview of transformer quantitative papers in emnlp 2020
2022-07-27 07:56:00 【51CTO】

I spent a few days reading three articles EMNLP 2020 About China Transformer Related papers on model quantification , Take a quick note of the main points .
Fully Quantized Transformer for Machine Translation
This is a collaboration between Huawei Noah Ark laboratory and the University of Montreal , Published in findings above .
「 Address of thesis :」 https://www.aclweb.org/anthology/2020.findings-emnlp.1.pdf

Method
in the light of Transformer The structure of computationally intensive operators is carried out activation quantitative , except bias All but weight It is quantified , In fact, our group has also done this .
It's quantitative perception training (QAT), Back propagation uses straight-through estimator.
Consider each channel There are differences in the distribution of , So for each channel Learn quantifiable alone scale Parameters .
Zero value processing :padding Don't need to consider , It will be mask fall .ReLU and attention softmax The lower bound of quantification is as follows 0. All quantification is added to dropout Before .
Pruning adopts structural pruning , Because sparsity pruning requires hardware or library support , More trouble . Here only for FFN Parameter to prune , And not the traditional percentage threshold , But according to ReLU Then the first layer is calculated FFN Output of each column max value , according to max Whether the value exceeds a certain threshold to prune , This threshold is set to all columns max The variance of a value is multiplied by a constant . The advantage here is that the percentage of pruning is not fixed , Each layer of dynamic shear , More or less .
experimental result

You can see that after training, you can quantify (PTQ) It's still a little bit lower , and QAT Of 8 Bit or 6 It doesn't drop much ,4 It will have a great influence . The result is also in line with the practical effect of our group .
evaluation
There is no innovation in this article , But in the end it's just findings, No one will . Quantify location and some trick It's easy to think of , And each channel In fact, it has little significance in practice , Without this trick The effect is good, too .FFN In the end, we didn't cut out many parameters .
Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation
This one is made by Samsung , Also published in findings above .
「 Address of thesis :」 https://www.aclweb.org/anthology/2020.findings-emnlp.433.pdf

Method
This article does not use the more commonly used uniform Quantitative way ( In other words, the floating-point interval is mapped to the integer interval in equal proportion ), instead binary-code, That is to say, the parameter Expressed as ( Quantization Bits ) Binary vectors of the same dimension The linear combination of , Finally, the multiplication of a matrix and a vector becomes : , For details, please refer to the previous papers of Samsung :BiQGEMM: Matrix Multiplication with Lookup Table For Binary-Coding-based Quantized DNNs.
This article aims at embedding Different word frequency of the word uses different quantized digits , The specific scheme can see the following pseudo code :

In short, the higher the frequency of words , The more quantized digits . And most words are very low ,1% That's the word that occupied 95% The frequency of words , So if they have a higher number, it doesn't matter . For each word vector , Different quantization parameters are used , This also takes into account that the spatial distribution of each word vector is different .
in the light of encoder and decoder Different types of attention, The paper also uses different quantized digits .
Training strategy , here finetune Each stage 2000 Step open a quantification , To save training time .
experimental result

It can be seen that embedding The quantitative impact is still relatively large , Especially quantified to 2 Below the position . And all use 2 Bit quantization is bad . If we use the quantitative method in this paper , You can see that in most configurations , The losses are all in 1 Within a point , The effect is good .
evaluation
This article adopts non-uniform Quantitative way , It's a little more complicated to implement , And in terms of accelerated optimization of matrix multiplication , Not too much about how to achieve . Besides, there is no contrast uniform Quantification and non-uniform The gap in the way of quantification , It's not clear that the performance improvement is non-uniform Quantification dominates , Or according to the frequency and the difference attention Different quantized bits are dominant . Finally, there are too many artificial design factors , such as attention The number of quantized digits has to be determined , It's not very flexible to use .
TernaryBERT: Distillation-aware Ultra-low Bit BERT
This article is made by Huawei Noah Ark laboratory , Published in EMNLP On the Lord's meeting .
「 Address of thesis :」 https://www.aclweb.org/anthology/2020.emnlp-main.37.pdf

Method

As shown in the figure above , Distillation is added to this article on the basis of quantification .
First of all, for weight quantitative , Using ternary quantization , The training method is TWN perhaps LAT, You can see the specific principle of Hou Lu's paper :https://houlu369.github.io/, and activation Or use the general min-max quantitative .
To quantify the parameters ,embedding Each row takes a set of parameters , And everything else weight It's a whole set of parameters .
Distillation uses three loss:hidden state、attention and logits.teacher It's full precision ,student There are two , The first one is full precision , Models and teacher As like as two peas . Then use TWN or LAT Quantize the ternary network , Calculation loss, Finally, back propagation updates the full precision student Parameters of .
There are two more trick, One is data enhancement , The second is full precision student Model initialization is done with finetune After teacher.
experimental result

stay GLUE The experiment on , The maximum compression ratio can reach 23 times , It seems that the effect is better than ordinary Q-BERT、Q8-BERT It's still a lot better , Than BERT It's just bad 1 A p.m. . All of this activation It's only quantified to 8 position , It is estimated that the effect will not be good if we try again .
evaluation
GLUE and SQUAD It looks good on the top , Compression is also very high , But there are a few ways to integrate : Distillation 、 Data to enhance 、 Model initialization, etc , Ablation experiments can also see that the effect is greatly reduced by removing distillation and data enhancement , therefore TWN and LAT Compared with the general min-max It's not clear how much advantage there is in quantifying practice . Finally, this paper does not do machine translation task , It's all about classification tasks , It also works well on more difficult build tasks , It's really persuasive .
summary
Take a comprehensive look at , These three papers use three completely different quantitative methods . It's best to realize ,TensorFlow This is also the way to quantify ,8 Bit effects are almost lossless , But it's not going to work much lower . The last two articles are about ultra-low bit quantization , One uses binary-code, One is TWN or LAT, The compression ratio is very high . In the second part, the quantized digits of different parameters are distinguished according to word frequency or importance , In the case of ultra-low bit can also maintain good results . Direct ultra-low bit quantization , But with a series of operations such as distillation, the performance has been improved , Unfortunately, I didn't do the machine translation task , It makes people suspect .
But in the end, only the third one won the Lord's meeting , Anyway , There is still a lot to learn from .

Author's brief introduction :godweiyang, Know the same name , A master's degree in computer science of East China Normal University , Direction Natural language processing and deep learning . Like to share technology and knowledge with others , Looking forward to further communication with you ~
If you have any questions, you can leave a message in the comments section , Welcome to join us for in-depth communication on wechat ~

边栏推荐
- How to obtain the cash flow data of advertising services to help analyze the advertising effect?
- 北京五日游记
- C# 事件用法案例 订阅事件+=
- 物联网工业级UART串口转WiFi转有线网口转以太网网关WiFi模块选型
- [flight control development foundation tutorial 4] crazy shell · open source formation UAV - serial port (optical flow data acquisition)
- What is the real HTAP? (2) Challenge article
- slf4j如何进行logback配置呢?
- [QT] unable to open the containing file pcap.h (C1083) in QT creator
- HU相关配置
- Docker install MySQL 8.0.28
猜你喜欢

Hu related configuration

Jjwt generate token

Abstract factory pattern

OpenGL shader learning notes: varying variables

SQL labs SQL injection platform - level 1 less-1 get - error based - Single Quotes - string (get single quote character injection based on errors)

大家节日快乐哈

How to update PIP3? And running PIP as the 'root' user can result in broken permissions and conflicting behavior

鲁迅:我不记得说没说过,要不你自己查!
![API version control [eolink translation]](/img/3a/8a78e57a2474f33d011d91631fde74.jpg)
API version control [eolink translation]

Convert objects to key value pairs
随机推荐
Things come to conform, the future is not welcome, at that time is not miscellaneous, neither love
What are the software tuning methods? Let's see what Feiteng technology experts say about dragon lizard technology
C language: random number + Hill sort
Want the clouds in the picture to float? Video editing services can be achieved in three steps with one click
How to update PIP3? And running PIP as the 'root' user can result in broken permissions and conflicting behavior
JS存取cookie示例
一体化实时HTAP数据库StoneDB,如何替换MySQL并实现近百倍分析性能的提升
Install tensorflow
npm的使用
[flight control development foundation tutorial 4] crazy shell · open source formation UAV - serial port (optical flow data acquisition)
DEMO SUBMIT 某程序并获取该程序ALV数据
C语言:优化后的希尔排序
Regular and sed exercises
Primary key in MySQL secondary index - MySQL has a large number of same data paging query optimization
Stored procedures and functions
Internet of things industrial UART serial port to WiFi to wired network port to Ethernet Gateway WiFi module selection
linux中mysql表名区不区分大小写
C语言:随机生成数+希尔排序
[day42 literature intensive reading] a Bayesian model of perfect head centered velocity during smooth pursuit eye movement
What other methods are available for MySQL index analysis besides explain