当前位置：网站首页>GPT (improving language understanding generative pre training) paper notes

GPT (improving language understanding generative pre training) paper notes

2022-06-30 09:37:00 【A grain of sand in the vast sea of people】

Catalog

1. Brief introduction of the paper

2. GPT And ELMO Differences and connections .

3. Contribution or major improvement

3.1. Transformer Semi supervised learning

why

how

3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .

4. Other

4.1. Why? GPT It can be done in parallel , Although it is also an autoregressive language model （AR Model ）

Paper：GPT: Improving Language Understandingby Generative Pre-Training

Code：GPT

1. Brief introduction of the paper

GPT yes “Generative Pre-Training” For short , It refers to the generative pre training .GPT Adopt a two-stage process , The first stage is to use language model for pre training , The second stage passes Fine-tuning The model solves the downstream task . The following figure shows GPT The pre training process of .

2. GPT And ELMO Differences and connections .

（1） The same thing ：GPT and ELMO It is similar to the two-stage model .
（2） Difference ： First , Feature extractor is not used RNN, But with Transformer, Its feature extraction ability is better than RNN; secondly ,GPT Although the pre training is still based on the language model as the target task , But it is one-way

3. Contribution or major improvement

3.1. Transformer Semi supervised learning

why

The problem of supervised learning , The supervised learning here only uses a large amount of manually marked data to train the model .

Most deep learning methods require a lot of manually labeled data , But in fact, we don't have so much manually marked data , This limits their applicability in many fields .
Some can use unsupervised learning to learn language information , If you use manually marked data to learn . It can be time-consuming and expensive
Even if it is available under a large amount of Supervision , Learning good representations in an unsupervised way can provide an important performance boost . so far , The most convincing evidence is the widespread use of pre trained word embedding to improve a range of NLP Performance of tasks

how

We use a two-stage training method . First , We use the language model to learn the initial parameters of the neural network model from unlabeled data . And then , We analyze the language model according to the downstream marked data Fine-tuning.x

The first on the left of the figure below is Transformer The model structure diagram and the right figure are GPT Model

GPT The whole process is shown in the figure on the right , The word vector （token embedding） And the position vector （position embedding） And as input , after 12 Layer of Masked Multi-Head Attention and Feed Forward（ Of course, the middle also includes Layer Norm）, Get the predicted vector and the vector of the last word , The word vector of the last word will be used as a follow-up fine-tuning The input of .

3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .

( Left ）Transformer Decoder Structure and training objectives . （ Right ） Enter the transitions used to fine tune different tasks . We convert all structured inputs into sequences of tokens processed by our pre training model , Then there is a linear +softmax layer .

4. Other

4.1. Why? GPT It can be done in parallel , Although it is also an autoregressive language model （AR Model ）

The key is this Masked Self attention. Autoregression is the use of the following Masked Of Self attention Realization .

Masking The purpose of is to ensure that the left side cannot see the right side . For example, the following picture . The horizontal axis is Query, The vertical axis is Key. Hollow yes Masked It's worth it .

Input is 1 when , It can't see anything , Input 2 when , It can see 1. Input 3 when , It can see 1,2. And so on , If we want to achieve , According to the following Masked self attention Realization .

1->2->3->4->5->6

In this way, autoregression can be implemented in parallel . Unlike RNN or LSTM This serial method realizes autoregression .

Code implementation

import numpy as np

def test1():
    nd = 6
    ns = 6
    i = np.arange(nd)[:, None]
    j = np.arange(ns)
    m = i >= j - ns + nd
    print(m)

if __name__ == '__main__':
    test1()

Output results

[[ True False False False False False]
 [ True  True False False False False]
 [ True  True  True False False False]
 [ True  True  True  True False False]
 [ True  True  True  True  True False]
 [ True  True  True  True  True  True]]

原网站

版权声明
本文为[A grain of sand in the vast sea of people]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202160524516306.html