当前位置:网站首页>ICML 2022 | explore the best architecture and training method of language model
ICML 2022 | explore the best architecture and training method of language model
2022-07-07 06:22:00 【PaperWeekly】
author | Zhuyaoming
Company | Byte beat Artificial Intelligence Lab
Research direction | Machine translation
This article introduces two articles published in ICML 2022 The paper of , Researchers are mainly from Google. Both papers are very practical analytical papers . It's different from the common papers' innovation in the model , Both papers are aimed at existing NLP The structure and training method of language model 、 Explore its advantages and disadvantages in different scenarios and summarize empirical rules .
Here, the author first collates the main experimental conclusions of the two papers :
1. The first paper found that although encoder-decoder It has occupied the absolute mainstream of machine translation , But when the model parameters are large , Design language model reasonably LM It can be compared with the traditional encoder-decoder The performance of architecture for machine translation tasks is comparable ; And LM stay zero-shot scenario 、 Better performance in small language machine translation 、 In large language machine translation, it also has off-target Fewer benefits .
2. The second paper found that I was not doing finetuning Under the circumstances ,Causal decoder LM framework +full language modeling Training in zero-shot The best performance in the task ; And there are multitasking prompt finetuning when , It is encoder-decoder framework +masked language modeling Training has the best zero-shot performance .
Paper 1
The first part mainly explores the traditional language model of language model in machine translation task (language model) How good it can be , Whether it can be compared with mainstream encoder-decoder rival .
Paper title :
Examining Scaling and Transfer of Language Model Architectures for Machine Translation
Thesis link :
https://arxiv.org/abs/2202.00528
The conference :
ICML 2022
Author organization :
University of Edinburgh 、Google
1.1 introduction
The author believes that at present encoder-decoder Architecture is the absolute mainstream in machine translation , However, people are not satisfied with the more pure language model (Language Model) There is not much research on Machine Translation , Its performance and advantages and disadvantages are rarely analyzed . The author explores various language model configurations for machine translation , And do a systematic performance analysis .
1.2 Method introduction
The author used for machine translation LM See the picture below ,X and Y Represents source language input and target language input respectively , The two make splicing input LM. In order to prevent information leakage during translation , The author will LM The original attention mask of is adjusted to prefix-LM Mask or casual-LM Mask , The black part is a mask mask( top left corner ). The lower right corner is and mainstream encoder-decoder Mask comparison .
▲ The author used for translation LM. The main method is source language + Splice the target language , And in Attention Do on mask Masks prevent information disclosure .
In addition to two masking mechanisms , The author also compares TopOnly and layer-wise [1] Two representations , The former is the most common one, which only uses the top-level representation of neural network as the input of the output layer , The latter will coordinate the representation of each layer [1].
The author also compares TrgOnly Formal LM, such LM The loss function of only models the generation of target languages , Instead of modeling source side language generation .
1.3 Experimental exploration
The author's baseline model uses a similar Transformer-base The architecture of , stay WMT14 Britain and France 、 England and Germany ,WMT19 English/Chinese , And the main experiment on its own Yingde data set . The author also explored the effect of increasing the parameter quantity on LM Influence on Machine Translation .
▲ Here are the most important experimental figures in this paper : On two datasets , Various LM Variants and encoder-decoder Of BLEU fraction - Parameter curve .
The author's main conclusions are listed below :
1. When the model parameters are small , Architecture has the greatest impact on Performance . here , have inductive biases The translation quality of the model is the best . What the author said inductive biases There are four main categories :1) similar prefix-LM Mask this source language side information is completely visible ;2) Use TopOnly Do output layer characterization instead of layer-wise;3) priority of use Deep LM instead of Wide LM;4)Loss Refers to modeling target language generation , Instead of modeling source side language generation .
2. Different models show different parameter characteristics , But this gap is narrowed when large-scale parameters are used .
3. Sentence sequence length is right LM The influence of parameter quantity characteristics is small .
4. Encoder-decoder The architecture is in computing efficiency ( With FLOPs To measure ) Better than all LM framework ( This also explains why the former is the absolute mainstream of machine translation ).
The author is still there zero-shot The scenario tested LM Performance in machine translation , The author found :
1. PrefixLM The mask is zero-shot The scene has a good performance .CausalLM Mask is not suitable zero-shot scene .
2. In low resource languages ,LM The performance of machine translation language migration is better than encoder-decoder framework .
3. In the performance of big language translation ,LM and decoder-encoder Each architecture has its merits . In general ,LM There are better zero-shot performance , When translating off-target( That is, the wrong language ) Even less .
Thesis 2
The second chapter explores large-scale pre training LM stay Zero-shot Performance of the scene , And discuss what architecture to use 、 What is the most effective pre training goal .
Paper title :
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?
Thesis link :
https://arxiv.org/abs/2204.05832
Code link :
https://github.com/bigscience-workshop/architecture-objective
The conference :
ICML 2022
Author organization :
Google Brain, HuggingFace, LightOn, Allen NLP, LPENS
2.1 research objective
The author compares three architectures (causal decoder-only, non-causal decoder-only, encoder-decoder)、 Two pre training goals (autoregressive、masked language modeling) The trained language model is zero-shot stay zero-shot NLP Mission performance . The author also according to whether multitask prompted finetuning Step also divides the test into two scenarios .
2.2 Method introduction
The research method of this paper is very simple , Is to put causal decoder LM, non-causal decoder LM, encoder-decoder Three basic architectures and full language modeling( The most common LM Training ),masked language modeling( Instant belt mask The mechanism LM Training ) Two training methods do permutation and combination test performance .
meanwhile , In the evaluation phase , In addition to direct testing zero-shot Beyond performance , The author also tested the addition of multitask prompted finetuning after , Each model + The performance of training methods on new tasks . multitasking tune You can refer to this paper (T0) [2].
▲ The author's attention mechanism diagram of three basic architectures (attention map). in short ,causal decoder LM Only source sentence Each of them token To the front token do attention, The latter two are bidirectional attention Of .
2.3 Experimental exploration
The author's evaluation task is also a reference T0 The paper [2] and EleutherAI LM Evaluation Harness [3].( The authors note : these two items. benchmark A total of similar T5 and GPT-3 Such a large language model provides hundreds of evaluation tasks ).
There are many evaluation tasks , Here is just the conclusion of the author :
1. I'm not doing finetuning Under the circumstances ,Causal decoder LM framework +full language modeling Training can make the model have the best zero-shot Generalization performance . This conclusion is also consistent with the present GPT-3 This kind of model is in zero-shot NLG The amazing performance is consistent .
2. There is multitask finetuning Under the circumstances ,encoder-decoder framework +masked language modeling Training is the best zero-shot Generalization performance . The author also found that in a single task finetune Well behaved architectures tend to be multitask It's better to generalize in the scene .
3. Use only decoder Of LM Doing it adaptation perhaps task transfer when , Than encoder-decoder Less overhead 、 That is, it is easier to migrate tasks .
The author makes a brief comment on
Both papers are practical emprical study. Comprehensive speaking , The author believes that the previous article is more enlightening , The author explains that when the model is large enough , have only decorder Structural LM It can also complete machine translation tasks well . In addition to zero-shot Machine translation of ,LM Very good performance —— Consider that there has been a lot of work to prove LM It has extremely high generalization , Maybe LM Have become a very low resource language or minority language translation ( For example, classical Chinese Translation ) The potential of mainstream models . All in all ,LM The upper limit in the field of machine translation remains to be tapped .
The last one is Google A typical example of working miracles vigorously , The tone of the writing is more inclined to “ Plug and play ”, It can be used as a guide for relevant researchers or engineers to select models on demand . Actually, I'm a little curious Google Why choose in ICML’22 Publish this paper on , The author foolishly believes that this paper seems to be more consistent TACL or JAIR A kind of periodical .
reference
[1] https://proceedings.neurips.cc/paper/2018/hash/4fb8a7a22a82c80f2c26fe6c1e0dcbb3-Abstract.html
[2] https://arxiv.org/abs/2110.08207
[3] https://github.com/EleutherAI/lm-evaluation-harness
Read more
# cast draft through Avenue #
Let your words be seen by more people
How to make more high-quality content reach the reader group in a shorter path , How about reducing the cost of finding quality content for readers ? The answer is : People you don't know .
There are always people you don't know , Know what you want to know .PaperWeekly Maybe it could be a bridge , Push different backgrounds 、 Scholars and academic inspiration in different directions collide with each other , There are more possibilities .
PaperWeekly Encourage university laboratories or individuals to , Share all kinds of quality content on our platform , It can be Interpretation of the latest paper , It can also be Analysis of academic hot spots 、 Scientific research experience or Competition experience explanation etc. . We have only one purpose , Let knowledge really flow .
The basic requirements of the manuscript :
• The article is really personal Original works , Not published in public channels , For example, articles published or to be published on other platforms , Please clearly mark
• It is suggested that markdown Format writing , The pictures are sent as attachments , The picture should be clear , No copyright issues
• PaperWeekly Respect the right of authorship , And will be adopted for each original first manuscript , Provide Competitive remuneration in the industry , Specifically, according to the amount of reading and the quality of the article, the ladder system is used for settlement
Contribution channel :
• Send email :[email protected]
• Please note your immediate contact information ( WeChat ), So that we can contact the author as soon as we choose the manuscript
• You can also directly add Xiaobian wechat (pwbot02) Quick contribution , remarks : full name - contribute
△ Long press add PaperWeekly Small make up
Now? , stay 「 You know 」 We can also be found
Go to Zhihu home page and search 「PaperWeekly」
Click on 「 Focus on 」 Subscribe to our column
·
边栏推荐
- [shell] summary of common shell commands and test judgment statements
- c语言面试写一个函数在字符串N中查找第一次出现子串M的位置。
- Calculation model FPS
- You don't know the complete collection of recruitment slang of Internet companies
- How to set up in touch designer 2022 to solve the problem that leap motion is not recognized?
- tkinter窗口选择pcd文件并显示点云(open3d)
- Markdown 并排显示图片
- 基本Dos命令
- K8s running Oracle
- Convert numbers to string strings (to_string()) convert strings to int sharp tools stoi();
猜你喜欢
3531. Huffman tree
VMware安装后打开就蓝屏
go-microservice-simple(2) go-Probuffer
如何在Touch Designer 2022版中设置解决Leap Motion不识别的问题?
The boss always asks me about my progress. Don't you trust me? (what do you think)
laravel 使用腾讯云 COS5全教程
Peripheral driver library development notes 43: GPIO simulation SPI driver
JMeter function assistant - random value, random string, fixed value random extraction
Markdown 并排显示图片
On the discrimination of "fake death" state of STC single chip microcomputer
随机推荐
If you don't know these four caching modes, dare you say you understand caching?
【OpenCV】形态学滤波(2):开运算、形态学梯度、顶帽、黑帽
Solve pod install error: FFI is an incompatible architecture
Handling hardfault in RT thread
Qtthread, one of many methods of QT multithreading
JVM command - jmap: export memory image file & memory usage
ETCD数据库源码分析——从raftNode的start函数说起
c语言面试写一个函数在字符串N中查找第一次出现子串M的位置。
港科大&MSRA新研究:关于图像到图像转换,Fine-tuning is all you need
生活中的开销,怎么记账合适
JMeter's own functions are not enough? Why don't you develop one yourself
基于FPGA的VGA协议实现
Deep clustering: joint optimization of depth representation learning and clustering
C语言整理(待更新)
牛客小白月赛52 E.分组求对数和(二分&容斥)
When we talk about immutable infrastructure, what are we talking about
Peripheral driver library development notes 43: GPIO simulation SPI driver
3428. 放苹果
Array proof during st table preprocessing
JVM命令之- jmap:导出内存映像文件&内存使用情况