当前位置:网站首页>After Luke zettlemoyer, head of meta AI Seattle research | trillion parameters, will the large model continue to grow?

After Luke zettlemoyer, head of meta AI Seattle research | trillion parameters, will the large model continue to grow?

2022-07-06 00:38:00 Zhiyuan community

 
Reading guide : The pre training language model is becoming larger and larger , While surprised at its powerful ability , People can't help asking : Will the scale of language models continue to grow in the future ?
 
Meta AI Seattle Research Director , School of computer science and engineering, University of Washington Paul G. Allen professor Luke Zettlemoyer Published the title “Large Language Models: Will they keep getting bigger? And, how will we use them if they do?” Keynote speech , The related work of the team on large-scale language model is introduced .Zettlemoyer The professor discussed the following three issues :
 
    1) Will the scale of the language model continue to grow
    2) How to make the best use of language models
    3) Other supervision methods of model pre training
 
Arrangement : Hu Xueyu 、 Xiong Yuxuan
edit : Li Mengjia
 

As the scale of the language model continues to grow ,「 Will the scale of the language model continue to grow 」、「 How to use language models 」 It has become the focus of researchers . The following figure shows the relationship between parameter size and time of the pre training language model ,x Axis represents time ,y The axis represents the parameter scale , Models are getting bigger .

The language model contains the distribution of sentences , When people have language models , You can use the model to predict the possible words that may appear later according to the existing words . Despite the existence of such as BERT Such a two-way language model , This article only discusses the language model of processing words from left to right .
 
With Open AI Of GPT Take a series of models as an example , The ability of zero sample learning enables the language model to handle any natural language processing (NLP) Mission . The model will NLP Tasks are converted into sequence to sequence tasks . for example , Enter a description of the task , The model classifies texts according to the knowledge learned before .

Will the scale of the language model continue to grow

 

As shown in the figure below , The parameters of the language model are approaching one trillion , It is more and more difficult for people to train all parameters at the same time . Because there is not enough data or computing power to support , The model is difficult to be fully trained . In the near future , Such as PaLM、Chinchilla Wait for the model to try to use less parameters 、 More data and arithmetic training models , Control the scale of the model .

Zettlemoyer Professor pointed out , If people really want to make the model bigger , Finally, we have to make some compromises : No longer choose to use large and dense Neural Networks , Instead, it adopts the idea of sparseness , Use different parts of the model to handle different inputs ( for example , Google's Switch Model ). Adopt even the most advanced GPU colony , The demand for computing power is still approaching the limit of computing equipment , Innovation must be achieved on the top-level architecture .

When training dense models , Any input will be thrown into the whole model , Every parameter will participate in the calculation . Researchers have proposed data parallelism 、 Model parallelism and pipeline parallelism to speed up computation , These calculations are deployed to GPU On different nodes of the cluster . For all that , For each input , The model still needs a lot of calculations , Especially for those large-scale models .Zettlemoyer Professor pointed out , Using sparse model for conditional calculation can only use some parameters of the model to participate in the calculation , Save a lot of computing resources .
Transformer The structure includes self attention layer and full connection layer , Input is input into the full connection layer after completing the calculation in the self attention layer , When the scale of the language model expands , exceed 95% All parameters of are from the full connection layer ,Zettlemoyer The professor team tried to sparse the full connection layer .
In adopting methods such as ReLU When the activation function of , In fact, the neural network is thinned , similarly , When training large-scale language models , People have made conditional calculations to some extent . But in practice , We cannot know in advance what part should be calculated , Nor can we use efficient matrix multiplication . So ,Zettlemoyer The professor team tried to cut part of the neural network , Divide the whole connection layer into several parts , Allocate different parts to different processors , And designed a specific routing mechanism .
The mixed expert model shown in the figure below (mixture of experts) in ,“Dogs bark” and “Cats purr” It will be input to the model during training . The model will route different words to different expert networks , After calculation, it will be rearranged into the original order .Zettlemoyer Professor pointed out , The real challenge is how to ensure performance , Improve the efficiency of this process . So ,Zettlemoyer The teaching team adopts expert specialization and expert equalization technology , Make different expert networks do different calculations , And realize the uniform distribution of calculation , Reach the saturation order of performance .
The standard practice is to add some loss functions in the training process to ensure balanced routing , Use capacity factor (capacity factor) Make sure there is no overload . This method introduces additional loss functions and hyperparameters , You need to make sure that the settings are stable .Zettlemoyer Where is the professor team 2021 It was proposed that BASE Layers, Its main characteristics are as follows :(1) No additional loss functions or hyperparameters are introduced (2) It can achieve data parallelism 90% The calculation speed of (3) Each word case is assigned to an expert (4) All experts are balanced and saturated . In the part of achieving expert balance , This model replaces the trainable equilibrium with the algorithm based equilibrium , Formalized as a linear assignment problem , And use auction algorithm to solve .
The figure above shows three different calculations , The effect of the pre training model measured by confusion . The lower the degree of confusion, the better the model training effect . The blue line in the figure represents BASE Layers The job of , It is equivalent to other dense models , When the model is larger, the effect is even better . Besides , The work and Sparse MoE The effect is quite , Slightly better than Switch.
Next ,Zettlemoyer The professor introduced Gururangan Et al. 2021 Years of work DEMix Layers. Researchers have collected a lot of documents , Let it be RoBERTa The obtained characterization is clustered according to similarity , The visualization results show that the language data is heterogeneous . As shown in the figure , Medical literature is mostly clustered in red clusters , But computer science literature and news can also be gathered , This shows that the data has its own structure . This inspires us to : The calculation process can be split based on the characteristics of data , So as to realize expert specialization .
DEMix Design domain specific experts , Fixed the routing process . Specific areas are handled by specific experts , And can be mixed when needed 、 Add or remove specific experts , Emphasize rapid adaptation . Model training , Suppose there is corresponding metadata indicating the field of data , There is no need for other forms of equalization . Through conditional calculation , Domain experts can be trained in parallel , Although there are more parameters than dense models , But training is faster . In order to verify the effect of the model , Researchers have collected data on some new fields . Confusion degree is also used as the evaluation index in the evaluation , The calculation time of all models is limited to 48 Within hours , The models for comparison include GTP-3 small, medium, large, XL Four kinds of .
 
Researchers assume that the fields to which the data belongs are known during training and testing . Experimental results show that ,DEMix It is more effective when the scale of the model is small , As the number of model parameters increases , Dense models and DEMix The gap is narrowing .
Further observation of the experimental results on the largest model shows ,DEMix In the news and Reddit Performance on datasets is not as good as dense models , But how to divide the field is based on subjective judgment , Researchers use metadata here , We need to further explore how to divide the data field .
  • characteristic 1: Mixed experts
Researchers assume that they do not know the specific field of input , adopt DEMix Mix experts . One method is to take small-scale data and calculate it on all experts , Get the posterior distribution of the field , In this way, no additional parameters are needed to mix experts .
Yes 100 Input sequence , Researchers visualized the distribution of domain experts in all fields , Every field expert is closely related to his field , But there are some more heterogeneous areas ( Such as WebText Data sets ) It also has a certain relationship with other fields . When researchers analyze the relevance of domain experts in new fields , It is found that this correlation is more scattered than those in the training set . It means that models tend to get information from different experts when they encounter new fields .
Experimental results show that , Using mixed expert DEMix The model achieves the best results in new fields .
  • characteristic 2: Add experts
DEMix You can freeze other parts of the model , A posteriori distribution in the field of computing new data , Add a new expert and input some new data , Continue training from the parameters of the expert with the highest a posteriori probability in the existing field .
  • characteristic 3: Remove experts
Some experts in specific fields ( Such as hate speech ) It may not be necessary during testing , By removing these experts, in a sense, the model will not be exposed to these specific areas .
Researchers replaced the full connection layer with a hybrid expert model , Such as BaseLayers and DeMix Layers, Carry out more effective training on a small and medium-sized scale . These simple methods eliminate fine-tuning .DEMix It has the ability to add additional modules , But it is not clear which method can better expand the scale of the model .
How can we best use language models

 

Input... To the model 「What would you put meat on top of to cook it?」 Such a question about common sense ,GPT-3 Will answer :“hot dog bun”, But this is not the answer we expected , We hope the answer is similar to “A hot pan”. Although this example looks silly , But it can help us understand the problems encountered by the language model . When researchers adopt methods such as Commonsense QA When testing the model with common sense problems in the data set , The model may make the wrong choice . At this time, the model is making multiple choices ,“frying pan” Is one of the options , Although this option is correct , But it can be expressed in many ways . For the same thing , Language models can generate many different representations .
Zettlemoyer The professor showed that language models are most likely to address this problem 10 Outputs , It is found that some of them are completely wrong , But the other part of the answer is right or at least related to the right answer . The four blue options in the figure express the same concept , It's just a different expression of the pot .
Notice that when there are multiple statements about the same thing , These same expressions compete with each other , Finally, it leads to formal competition (Surface Form Competition) problem . Because the language model is a probability distribution for words , When stove The probability of is 0.9 when ,frying pan The probability cannot exceed 0.1.
With COPA Take language modeling on data sets as an example . Assume that 「 The bar closed because 」, We need to start from the assumption 「 It's too crowded 」 and 「 It's three in the morning 」 Choose a more correct answer . If you directly input the premise into the model , Get the hypothesis 1 And assumptions 2 Probability , You will find assumptions 1 Higher probability , Because crowding and bars are often more relevant , But this is wrong . To prevent such mistakes , Some adjustments need to be made . For this purpose, conditional point mutual information (PMI), Choose a broader concept as the premise of the field , For example, choose “ because ” As the premise of the field (Domain Premise), Readjust the answer by calculating the conditional probability under the premise of the field .
Researchers tested models of different sizes on different data sets , The maximum number of model parameters reaches 175B. The experiment compares different adjustment schemes ,PMI In most cases, the program has achieved the best results .
Zettlemoyer The professor introduced the learning scheme of noise channel language model . The noise channel model appeared in the pre deep learning era , Output the answer to a given question , The noise channel model calculates the probability of the problem based on the answer .
 
When people fine tune the language model , Considering that the amount of calculation for fine tuning all parameters is too large , Often only part of the model is fine tuned . At present, there are many fine-tuning schemes , Such as prompt tuning,head tuning and transformation tuning. stay prompt tuning The channel model can be introduced .
Experimental results show that , stay prompt tuning Under the scheme of adopting channel model , Compared with direct fine-tuning, the effect is significantly better ,head tuning Become the most competitive baseline , Tip tuning based on noise channel achieves better results on the whole , Especially in Yahoo and DBPedia On dataset .
Zettlemoyer Professor pointed out :
 
1) Although language models can calculate the probability of statements , But formal competition may lead to the wrong answer ;
2)PMI By evaluating how much information the answer encodes , To adjust the probability of the answer ;
3) Direct learning Noisy channel prompting model Even better results can be achieved .
Other monitoring methods of pre training model
Zettlemoyer The professor briefly reviewed the multilingual and multimodal pre training models , At the same time, it emphasizes the importance of open science . Today's models accept different supervisory signals , Recent work includes XGLM, This is a generative multilingual model ;HTLM, This is a model applied to document level formatted text ;CM3, Multimodal models ;InCoder Model for generating or populating code . Besides ,OPT It is the first public achievement GPT-3 Scale language model .
stay HTLM In the model , Use document structure for pre training and cue models , For example, when performing summary tasks , It can be used <title> Label as a reminder . We can also get massive HTML Documentation for pre training .
HTLM It can even automatically complete the prompt , For example, the model in the following figure is based on the text content , Automatically generated HTML Labels like <title> and <body>.
stay CM3 In the model , The researchers will html Of documents <img> The image in the tag is discretized into a sequence , This forms a multimodal sequence, mixing text and images token, Then it can be treated as a general sequence using a language model , And carry out zero sample learning such as cross modal .
InCoder It is a model that can generate missing code or comments , It trains on large-scale code datasets with open source licenses .
Zettlemoyer Says Professor , The pre training model can receive a variety of monitoring signals , Images 、 Words, etc. can be discretized for pre training . Open science and model sharing are of great significance .

Conclusion :

 
Last ,Zettlemoyer The professor returned to the three questions raised at the beginning of the speech :Zettlemoyer The professor said that at present, the scale of the model continues to grow, and researchers will still invest more computing power , In the future, the calculation of conditions may make a major breakthrough ; We still haven't found a way to make full use of language models , Zero learning and small sample learning are both worthy of research , However, it also depends on the performance pursued and the degree of open source of the code ;Zettlemoyer Professor pointed out , Text data is not required , Researchers can also try to use other structures or modes to provide monitoring signals .
 

Recommended reading

Richard Sutton: The experience is AI The ultimate data , Four stages lead to real AI The road to development

Machine learning champion Michael I. Jordan Talk with Zhang Hongjiang about the way of youth growth : Researchers should contribute to the complex era

Academician Mei Hong : How to construct artificial swarm intelligence ?|  Review of the specially invited report of Zhiyuan Conference

Turing award winner Adi Shamir The latest theory , Uncover the mystery of antagonistic samples  |  Review of the specially invited report of Zhiyuan Conference
原网站

版权声明
本文为[Zhiyuan community]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060032470717.html