当前位置：网站首页>[NLP] pre training model - gpt1

[NLP] pre training model - gpt1

2022-07-01 13:46:00 【Coriander Chrysanthemum】

background

I don't say much nonsense , First put out the links of three papers ：GPT1:Improving Language Understanding by Generative Pre-Training、GPT2:Language Models are Unsupervised Multitask Learners、GPT3:Language Models are Few-Shot Learners. Teacher Li Mu is also B There is an introduction on the station GPT Video of model ：GPT,GPT-2,GPT-3 Intensive reading 【 Intensive reading 】.

First of all, let's tidy up Transformer The time period after the emergence of some language models ：Transformer：2017/06,GPT1:2018/06,BERT:2018/10,GTP2:2019/02,GPT3:2020/05. From the time sequence of the emergence of the model , It's really competitive .

GPT1

We know that NLP There are many tasks in the field , Such as Q & A , semantic similarity , Text classification, etc . Although a large number of unmarked text corpora are very rich , But there is little labeled data for learning these specific tasks , Then it is more difficult to carry out model training .GTP1 This paper is about , The language model is tested on different unlabeled text corpora Generative pre training , Then for each specific task Differential fine tuning , You can get good results on these tasks . Compared with the previous method ,GPT1 Use task aware input transformations during tuning , To achieve effective conversion , At the same time, make minimal changes to the model architecture . Here we can think of word2vec, Although it is also a pre training model , But we will still construct a neural network according to the task type , and GPT1 There is no need to .GPT1 The effectiveness of this approach is illustrated by its effectiveness on a wide range of benchmarks for natural language understanding .GTP1 In general, the task agnostic model is better than the differentiated training model using the architecture specially made for each task , In the study of 12 One of the tasks is 9 Three have significantly improved the technical level .

Before that , Of course, the most successful pre training model is word2vec. In this paper, we put forward 2 Major issues ,1. How to choose what kind of loss function in a given unlabeled corpus ？ Of course, there will also be some tasks at this time, such as language models , Machine translation, etc , But there is no loss function that performs well in all tasks ;2. How to express the learned text Effectively transfer to downstream tasks ？

GPT1 Our approach is to use semi supervised on a large number of unlabeled corpus (semi-supervised) The method of learning a language model , Then fine tune the downstream tasks . So far , What we can think of better in the language model is RNN and Transformer That's it . Compared with RNN,Transformer The characteristics learned are more robust , The text explains , Compared with alternatives such as circular Networks , This model selection provides us with more structured memory , Used to deal with long-term dependencies in text , Thus, robust transmission performance is produced in different tasks （This model choice provides us with a more structured memory for handling long-term dependencies intext, compared to alternatives like recurrent networks, resulting in robust transfer performance acrossdiverse tasks. ） And it uses a task related input representation when migrating （During transfer, we utilize task-specific input adaptations derived from traversal-styleapproaches, which process structured text input as a single contiguous sequence of tokens）

Model structure

Model training includes two stages . The first stage is to learn high-capacity language models on a large corpus . Next is the fine-tuning stage , At this stage , Adjust the model to suit the discrimination task with marked data .

Unsupervised pre training

Existing unsupervised corpus data ${\mathcal U} = \{ u_1,\cdots,u_n \}$ , A standard language model is used to minimize the following likelihood function ：
$L_{1}(\mathcal{U})=\sum_{\cdot_{i}} \log P\left(u_{i} \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\right)$
among $\Theta$ Namely GPT1 Model , Then before use $k$ individual token To predict number one $i$ individual token Probability , That's the window size , It's a super parameter . So from the first 0 From the beginning to the end, all the results are added up to get the objective function $L_1$ , choice log The function is also to avoid the multiplication of probabilities until the final value is gone . In other words, the model can maximize the probability of corpus information .

In this article, we use multiple layers Transformer decoder. stay Transformer Of decoder Due to the existence of mask, when extracting features, you can only see the content in front of the current character , The following content is used to calculate the attention mechanism 0.

In the process of forecasting , Add a given context $U=(u_{_k},\cdots,u_{-1})$ It's context token Vector , $n$ Express transformer decoder The number of layers , $W_e$ yes token embedding matrix , $W_p$ Is the position vector embedding matrix . Then the prediction context is $U$ The process of the next word of is as follows ：
$h_0 = UW_e+W_p \\ h_l = transformer\_block(h_{l-1}),\forall { i\in[1, n]}\\ P(u)=softmax(h_nW_e^{T})$

Fine tuning based on supervision （fine-tuning）

Use $L_1$ After training a model with the objective function formula , We can use the parameters of this model supervised Task . This model after training is what we call the pre training model . Suppose we have marked corpus $C$ , Each sample is a sequence $x^1,\cdots,x^m$ , The corresponding label is $y$ . After the input data passes through our pre training model, we get a $h_l^m$ . This $h_l^m$ It is the input of a linear output layer added according to the task , The predicted result is $y$ . The following formula ：
$P\left(y \mid x^{1}, \ldots, x^{m}\right)=\operatorname{softmax}\left(h_{l}^{m} W_{y}\right)$
The loss function of the fine-tuning model is as follows ：
$L_{2}(\mathcal{C})=\sum_{(x, y)} \log P\left(y \mid x^{1}, \ldots, x^{m}\right)$
The paper also says , If you add a sequence to predict the loss of the next word in the sequence , namely $L_1$ , Add to the loss function , It works better , That is, we have the following loss function ：
$L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda * L_{1}(\mathcal{C})$
among $\lambda$ It's a super parameter .

in general , Only additional parameters are needed in the trimmer $W_y$ . So how to NLP Task input is represented and processed using a pre training model ？ Look at this picture first ：
Please add a picture description
Take the text categorization task as an example , Add the specified delimiter, Then send it to Transformer, Next to that is Linear that will do . The second task is Entailment( contains ), That is what we introduced before NIU Mission . Other tasks are to splice according to the above figure .

in general , By constructing the input of different types of pre training models , To complete multiple NLP Mission , The pre training model itself will not change .

原网站

版权声明
本文为[Coriander Chrysanthemum]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/182/202207011328526254.html