当前位置:网站首页>Adapter-a technology of adaptive pre training continuous learning
Adapter-a technology of adaptive pre training continuous learning
2022-07-06 02:06:00 【weixin_ forty-two million one thousand and eighty-nine】
Preface
Long term pre training partner , You can pay attention to this technical point, that is adapter, Recently, there are quite a lot of work in this field , It is such a background : Without forgetting the knowledge learned before , How to continuously inject knowledge into the large model .
Today I will bring you two latest related works , Open up your mind ~, Interested partners can check more related paper.
Of course, the idea is not particularly new , There have been many similar ideas before , So when you look at it, on the one hand, you should learn its design ideas , On the other hand, it is more important to learn what problems it wants to solve and the general direction of thinking , In short, don't let the current adapter Research is bound , As the author also gives some discussions at the end of the article , In fact, the title of this article at the beginning is not intended to write “Adapter- A technique that adapts pre training to continuous learning ”, But I want to write the general direction of continuous learning , But this direction is big , And this article introduces paper All are adapter thought ( One of the solutions ), So write in lowercase , I'll write another article later when I have time “ Model continuous learning ”, And the author believes that this is the starting point for the study of the most essential problems in the future , Even the current multimodal model of the fire is just an attempt , It cannot solve the essential problem ( See the end of the article for the reasons , To represent only one's point of view )
To be exact, after reading this article, I hope you have not learned a specific model design , But gain : I understand that there is another direction that can be studied, that is, continuous knowledge learning of pre training model . If you can inspire a good idea, That is the greatest significance of this article .
problem
Before we begin, let's take a look at the main problems in this demand context ? Before, it was basically through multi task learning to inject new knowledge into the model or curriculum learning ( Multi-stage learning ), For example, dialogue model PLATO-2 Multi-stage learning is adopted in , But here are some obvious shortcomings :
(1) Unable to continuously inject knowledge , For example, we have a trained model before , So what if a new knowledge needs to be injected ? If you only train on the data set of this new knowledge , That is bound to make the model fit the current data set infinitely , Then it leads to forgetting the knowledge learned before , Usually to solve this problem , We mix the previous data set with the current data set , Retraining the whole model , That's ok , for example ERNIE-2 That's what it does , But the disadvantages are also obvious , That is, the cost is too high , Every new knowledge , You need to bring all the previous knowledge to update the whole model .
(2) The model indicates that the coupling is serious , This means that multiple knowledge representations are coupled together, and it is difficult to study the impact of various knowledge , I believe you have also seen the trend of model unification , In a narrow sense, from the perspective of single-mode model : Integrate various tasks , such as NLP Field will ner、 Data sets of various tasks such as emotion classification and even tables are collected , Train a unified model ; In a broad sense, from the multi-modal model, we can see the unification : The picture 、 Text and even voice data are collected to train a unified model . In short, no matter three, seven, twenty-one , Is to collect the data that can be collected , Train a big model , That is, rely on big data and big models to build an intelligent model that accommodates all kinds of knowledge . Far away , Far away , Interested partners , You can see an article previously written by the blogger as follows . Back to the topic of this article , In this trend , It is more important to study the relationship between various knowledge , Instead, study relationships , The first step is to find a way to decouple each knowledge representation .
After generally understanding the requirements, let's take a look through two papers adapter The solution .
If you have to sum it up in a big white sentence , I think it is : Press and hold the big head , Train the little head .
K-adapter
Thesis link :https://arxiv.org/pdf/2002.01808.pdf
Don't forget the previous knowledge, but don't want to retrain with the previous data , To achieve this effect, the easiest thing to think of is not to change the good weight that has been trained , If you don't change the weight, the knowledge is still !!!
Based on this logic , We can think of fixing the previous weight , Then the new knowledge only trains itself to add a corresponding part of the model weight . That is what I started to say “ Press and hold the big head , Train the little head ”.
Speaking of this , Did some friends laugh , This is similar pretrain+fintune Do you ? Yes that's right , Hold down “ Previous weights ” This big head does not change , Only train your own part of the weight , And the weight of this part is usually not too large , This part of the model is called adapter, Each task has its own adapter To fit .
We can also see from the above figure that (a) It represents multi task learning , I will learn all the data together ,(b) Is the method mentioned in this article , You can see that there is Adapter1,Adapter2 Multiple adaptation mechanisms .
In fact, it can end here , We have learned the core idea , But the following is about adapter How it was designed , The paper gives a train of thought , You don't even need to design your own models .
Specifically, you can see N individual transformer layer ,2 A mapping layer and 1 A residual connection combination structure , Every adapter It actually includes K A combination like this , This is also the title of the thesis K-adapter The origin of . About experiments and more details , You can read the original paper .
Adapter-Bot
Thesis link :https://arxiv.org/pdf/2008.12579.pdf
This paper will adapter Apply ideas to the dialogue model . His background is like this : There are many kinds of dialog data sets , For example, knowledge-based dialogue skills , Emotional dialogue skills and so on , Every kind of data can be understood as a kind of knowledge , What this paper wants to solve is how to train a dialogue model that integrates multiple knowledge , If you are interested in the dialogue model, you can see an article written by the author before :
The big idea remains the same , Press and hold the big head , Train the little head .
The specific big head is one frozen backbone Dialogue model ; The small head is a group of trainable independent residuals residual adapters And a trainable dialog manager dialogue manager.
If analogy K-adapter Words ,residual adapters Namely K-adapter, But the design here adopts residual .
The so-called dialog manager here is actually a classification task , It is used to classify which conversation skill data set the current conversation belongs to , The function is to supervise residual adapters We better fit their data distribution .
summary
(1) Hold down the big head , Only update small adapter This idea can certainly solve the problem of not forgetting , But this is not 100% Can solve , The reason why the big head can become the capital of the big head lies in its sufficient authority , In other words, when training the first version of big head, it already contains a lot of knowledge , Suppose your big head is not very good at the beginning , Now a new knowledge , This knowledge is very big , It's very hard to learn , In fact, the previous boss is not qualified to be a boss , Still need to learn again . So it is very important whether the big head on your hand is strong enough .
(2) Pay attention to the problems discussed here and multi task learning ( Including multimodal learning and so on ) There are differences and connections , One of the reasons given by researchers of multitasking learning is : Multiple tasks can influence each other, that is, improve their performance through the influence of other related tasks , So we should study together , Interact with each other , We seem to keep the weight of other tasks unchanged . There's actually nothing tangled here , There are two things , You can take care of it .
(3) Talk a little more : I don't know if you feel the above mentioned adapter In fact, it is not necessarily the best way , Take us as human beings , A birth is a blank sheet , Then learn and accumulate knowledge little by little the day after tomorrow , Finally, it becomes stronger and stronger . So should we not be based on a big model , Small stack on top , But jump out of this thinking , Look at this problem again , That's us ( Small at the beginning ) Now there are many models of various abilities , So how to integrate these models , Or how to make the system continuously learn these models , Pay attention to what is mentioned here and above adapter Ideas are fundamentally different ,adapter The premise is to assume that small knowledge is pushed above the big , In fact, there should be no size discrimination , The designed model should have the ability to continuously integrate the knowledge of large models , So this piece of imagination space is very large , We can even draw lessons from distillation ideas and so on ,adapter The concept of is by no means narrow , Knowledge fusion is a valuable topic , Now the multi model is essentially a start. Soha is to prepare all kinds of knowledge data as much as possible at the beginning , But the author believes that this is ultimately a palliative , Because there will be an endless stream of knowledge data every day , In this age of information explosion , It is never possible to enumerate by preparing in advance , It is necessary to have the ability of continuous learning , Therefore, it is very valuable to study model sustainable learning , I sincerely hope to see more achievements in this field in the future .
Focus on
Welcome to your attention , See you next time ~
Welcome to WeChat official account :
github:
Mryangkaitong · GitHubhttps://github.com/Mryangkaitong
You know :
边栏推荐
- 同一个 SqlSession 中执行两条一模一样的SQL语句查询得到的 total 数量不一样
- FTP server, ssh server (super brief)
- Virtual machine network, networking settings, interconnection with host computer, network configuration
- [depth first search] Ji Suan Ke: Betsy's trip
- 【Flask】官方教程(Tutorial)-part2:蓝图-视图、模板、静态文件
- 竞赛题 2022-6-26
- Ali test open-ended questions
- Poj2315 football games
- National intangible cultural heritage inheritor HD Wang's shadow digital collection of "Four Beauties" made an amazing debut!
- A basic lintcode MySQL database problem
猜你喜欢
NumPy 数组索引 切片
Basic operations of databases and tables ----- default constraints
Know MySQL database
PHP campus financial management system for computer graduation design
MySQL index
Grabbing and sorting out external articles -- status bar [4]
02.Go语言开发环境配置
Mongodb problem set
C web page open WinForm exe
同一个 SqlSession 中执行两条一模一样的SQL语句查询得到的 total 数量不一样
随机推荐
leetcode3、实现 strStr()
Maya hollowed out modeling
Computer graduation design PHP enterprise staff training management system
Computer graduation design PHP college classroom application management system
【Flask】官方教程(Tutorial)-part3:blog蓝图、项目可安装化
Genius storage uses documents, a browser caching tool
安装Redis
NumPy 数组索引 切片
SPI communication protocol
02. Go language development environment configuration
Redis key operation
【Flask】官方教程(Tutorial)-part1:项目布局、应用程序设置、定义和访问数据库
ClickOnce does not support request execution level 'requireAdministrator'
500 lines of code to understand the principle of mecached cache client driver
Computer graduation design PHP animation information website
Xshell 7 Student Edition
Basic operations of database and table ----- set the fields of the table to be automatically added
Basic operations of databases and tables ----- default constraints
Basic operations of databases and tables ----- non empty constraints
阿里测开面试题