当前位置:网站首页>[NLP] pre training model - gpt1
[NLP] pre training model - gpt1
2022-07-01 13:46:00 【Coriander Chrysanthemum】
background
I don't say much nonsense , First put out the links of three papers :GPT1:Improving Language Understanding by Generative Pre-Training、GPT2:Language Models are Unsupervised Multitask Learners、GPT3:Language Models are Few-Shot Learners. Teacher Li Mu is also B There is an introduction on the station GPT Video of model :GPT,GPT-2,GPT-3 Intensive reading 【 Intensive reading 】.
First of all, let's tidy up Transformer The time period after the emergence of some language models :Transformer:2017/06,GPT1:2018/06,BERT:2018/10,GTP2:2019/02,GPT3:2020/05. From the time sequence of the emergence of the model , It's really competitive .
GPT1
We know that NLP There are many tasks in the field , Such as Q & A , semantic similarity , Text classification, etc . Although a large number of unmarked text corpora are very rich , But there is little labeled data for learning these specific tasks , Then it is more difficult to carry out model training .GTP1 This paper is about , The language model is tested on different unlabeled text corpora Generative pre training , Then for each specific task Differential fine tuning , You can get good results on these tasks . Compared with the previous method ,GPT1 Use task aware input transformations during tuning , To achieve effective conversion , At the same time, make minimal changes to the model architecture . Here we can think of word2vec, Although it is also a pre training model , But we will still construct a neural network according to the task type , and GPT1 There is no need to .GPT1 The effectiveness of this approach is illustrated by its effectiveness on a wide range of benchmarks for natural language understanding .GTP1 In general, the task agnostic model is better than the differentiated training model using the architecture specially made for each task , In the study of 12 One of the tasks is 9 Three have significantly improved the technical level .
Before that , Of course, the most successful pre training model is word2vec. In this paper, we put forward 2 Major issues ,1. How to choose what kind of loss function in a given unlabeled corpus ? Of course, there will also be some tasks at this time, such as language models , Machine translation, etc , But there is no loss function that performs well in all tasks ;2. How to express the learned text Effectively transfer to downstream tasks ?
GPT1 Our approach is to use semi supervised on a large number of unlabeled corpus (semi-supervised) The method of learning a language model , Then fine tune the downstream tasks . So far , What we can think of better in the language model is RNN and Transformer That's it . Compared with RNN,Transformer The characteristics learned are more robust , The text explains , Compared with alternatives such as circular Networks , This model selection provides us with more structured memory , Used to deal with long-term dependencies in text , Thus, robust transmission performance is produced in different tasks (This model choice provides us with a more structured memory for handling long-term dependencies intext, compared to alternatives like recurrent networks, resulting in robust transfer performance acrossdiverse tasks. ) And it uses a task related input representation when migrating (During transfer, we utilize task-specific input adaptations derived from traversal-styleapproaches, which process structured text input as a single contiguous sequence of tokens)
Model structure
Model training includes two stages . The first stage is to learn high-capacity language models on a large corpus . Next is the fine-tuning stage , At this stage , Adjust the model to suit the discrimination task with marked data .
Unsupervised pre training
Existing unsupervised corpus data U = { u 1 , ⋯ , u n } {\mathcal U} = \{ u_1,\cdots,u_n \} U={ u1,⋯,un}, A standard language model is used to minimize the following likelihood function :
L 1 ( U ) = ∑ ⋅ i log P ( u i ∣ u i − k , … , u i − 1 ; Θ ) L_{1}(\mathcal{U})=\sum_{\cdot_{i}} \log P\left(u_{i} \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\right) L1(U)=⋅i∑logP(ui∣ui−k,…,ui−1;Θ)
among Θ \Theta Θ Namely GPT1 Model , Then before use k k k individual token To predict number one i i i individual token Probability , That's the window size , It's a super parameter . So from the first 0 From the beginning to the end, all the results are added up to get the objective function L 1 L_1 L1, choice log The function is also to avoid the multiplication of probabilities until the final value is gone . In other words, the model can maximize the probability of corpus information .
In this article, we use multiple layers Transformer decoder. stay Transformer Of decoder Due to the existence of mask, when extracting features, you can only see the content in front of the current character , The following content is used to calculate the attention mechanism 0.
In the process of forecasting , Add a given context U = ( u k , ⋯ , u − 1 ) U=(u_{_k},\cdots,u_{-1}) U=(uk,⋯,u−1) It's context token Vector , n n n Express transformer decoder The number of layers , W e W_e We yes token embedding matrix , W p W_p Wp Is the position vector embedding matrix . Then the prediction context is U U U The process of the next word of is as follows :
h 0 = U W e + W p h l = t r a n s f o r m e r _ b l o c k ( h l − 1 ) , ∀ i ∈ [ 1 , n ] P ( u ) = s o f t m a x ( h n W e T ) h_0 = UW_e+W_p \\ h_l = transformer\_block(h_{l-1}),\forall { i\in[1, n]}\\ P(u)=softmax(h_nW_e^{T}) h0=UWe+Wphl=transformer_block(hl−1),∀i∈[1,n]P(u)=softmax(hnWeT)
Fine tuning based on supervision (fine-tuning)
Use L 1 L_1 L1 After training a model with the objective function formula , We can use the parameters of this model supervised Task . This model after training is what we call the pre training model . Suppose we have marked corpus C C C, Each sample is a sequence x 1 , ⋯ , x m x^1,\cdots,x^m x1,⋯,xm, The corresponding label is y y y. After the input data passes through our pre training model, we get a h l m h_l^m hlm. This h l m h_l^m hlm It is the input of a linear output layer added according to the task , The predicted result is y y y. The following formula :
P ( y ∣ x 1 , … , x m ) = softmax ( h l m W y ) P\left(y \mid x^{1}, \ldots, x^{m}\right)=\operatorname{softmax}\left(h_{l}^{m} W_{y}\right) P(y∣x1,…,xm)=softmax(hlmWy)
The loss function of the fine-tuning model is as follows :
L 2 ( C ) = ∑ ( x , y ) log P ( y ∣ x 1 , … , x m ) L_{2}(\mathcal{C})=\sum_{(x, y)} \log P\left(y \mid x^{1}, \ldots, x^{m}\right) L2(C)=(x,y)∑logP(y∣x1,…,xm)
The paper also says , If you add a sequence to predict the loss of the next word in the sequence , namely L 1 L_1 L1, Add to the loss function , It works better , That is, we have the following loss function :
L 3 ( C ) = L 2 ( C ) + λ ∗ L 1 ( C ) L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda * L_{1}(\mathcal{C}) L3(C)=L2(C)+λ∗L1(C)
among λ \lambda λ It's a super parameter .
in general , Only additional parameters are needed in the trimmer W y W_y Wy. So how to NLP Task input is represented and processed using a pre training model ? Look at this picture first :
Take the text categorization task as an example , Add the specified delimiter, Then send it to Transformer, Next to that is Linear that will do . The second task is Entailment( contains ), That is what we introduced before NIU Mission . Other tasks are to splice according to the above figure .
in general , By constructing the input of different types of pre training models , To complete multiple NLP Mission , The pre training model itself will not change .
边栏推荐
- 8 popular recommended style layout
- Computer network interview knowledge points
- 小程序-小程序图表库(F2图表库)
- Yarn restart applications record recovery
- Sign APK with command line
- 【剑指 Offer】55 - II. 平衡二叉树
- 【 剑指 Offer】55 - I. 二叉树的深度
- Investment analysis and prospect prediction report of global and Chinese p-nitrotoluene industry Ⓙ 2022 ~ 2027
- What is the future development direction of people with ordinary education, appearance and family background? The career planning after 00 has been made clear
- Understand the window query function of tdengine in one article
猜你喜欢

AnimeSR:可学习的降质算子与新的真实世界动漫VSR数据集

Content Audit Technology

Qtdeisgner, pyuic detailed use tutorial interface and function logic separation (nanny teaching)

2022上半年英特尔有哪些“硬核创新”?看这张图就知道了!

1553B environment construction

开源者的自我修养|为 ShardingSphere 贡献了千万行代码的程序员,后来当了 CEO

Error:Kotlin: Module was compiled with an incompatible version of Kotlin. The binary version of its

When you really learn databinding, you will find "this thing is really fragrant"!
![[anwangbei 2021] Rev WP](/img/98/ea5c241e2b8f3ae4c76e1c75c9e0d1.png)
[anwangbei 2021] Rev WP

Chen Yu (Aqua) - Safety - & gt; Cloud Security - & gt; Multicloud security
随机推荐
2.15 summary
学会使用LiveData和ViewModel,我相信会让你在写业务时变得轻松
详细讲解面试的 IO多路复用,select,poll,epoll
使用CMD修复和恢复病毒感染文件
Animesr: learnable degradation operator and new real world animation VSR dataset
孔松(信通院)-数字化时代云安全能力建设及趋势
9. Use of better scroll and ref
Detailed explanation of leetcode reconstruction binary tree [easy to understand]
04-Redis源码数据结构之字典
Anti fraud, refusing to gamble, safe payment | there are many online investment scams, so it's impossible to make money like this
Applet - multiple text line breaks in view
Machine learning summary (I): linear regression, ridge regression, Lasso regression
MySQL 66 questions, 20000 words + 50 pictures in detail! Necessary for review
JVM有哪些类加载机制?
Go整合Logrus实现日志打印
Summary of 20 practical typescript single line codes
GET请求如何传递数组参数
Global and Chinese styrene acrylic lotion polymer development trend and prospect scale prediction report Ⓒ 2022 ~ 2028
Social distance (cow infection)
2022上半年英特尔有哪些“硬核创新”?看这张图就知道了!