当前位置:网站首页>GPT (improving language understanding generative pre training) paper notes
GPT (improving language understanding generative pre training) paper notes
2022-06-30 09:37:00 【A grain of sand in the vast sea of people】
Catalog
1. Brief introduction of the paper
2. GPT And ELMO Differences and connections .
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .
Paper:GPT: Improving Language Understandingby Generative Pre-Training
Code:GPT
1. Brief introduction of the paper
GPT yes “Generative Pre-Training” For short , It refers to the generative pre training .GPT Adopt a two-stage process , The first stage is to use language model for pre training , The second stage passes Fine-tuning The model solves the downstream task . The following figure shows GPT The pre training process of .
2. GPT And ELMO Differences and connections .
(1) The same thing :GPT and ELMO It is similar to the two-stage model .
(2) Difference : First , Feature extractor is not used RNN, But with Transformer, Its feature extraction ability is better than RNN; secondly ,GPT Although the pre training is still based on the language model as the target task , But it is one-way
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
why
The problem of supervised learning , The supervised learning here only uses a large amount of manually marked data to train the model .
- Most deep learning methods require a lot of manually labeled data , But in fact, we don't have so much manually marked data , This limits their applicability in many fields .
- Some can use unsupervised learning to learn language information , If you use manually marked data to learn . It can be time-consuming and expensive
- Even if it is available under a large amount of Supervision , Learning good representations in an unsupervised way can provide an important performance boost . so far , The most convincing evidence is the widespread use of pre trained word embedding to improve a range of NLP Performance of tasks
how
We use a two-stage training method . First , We use the language model to learn the initial parameters of the neural network model from unlabeled data . And then , We analyze the language model according to the downstream marked data Fine-tuning.x
The first on the left of the figure below is Transformer The model structure diagram and the right figure are GPT Model

GPT The whole process is shown in the figure on the right , The word vector (token embedding) And the position vector (position embedding) And as input , after 12 Layer of Masked Multi-Head Attention and Feed Forward( Of course, the middle also includes Layer Norm), Get the predicted vector and the vector of the last word , The word vector of the last word will be used as a follow-up fine-tuning The input of .
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .

( Left )Transformer Decoder Structure and training objectives . ( Right ) Enter the transitions used to fine tune different tasks . We convert all structured inputs into sequences of tokens processed by our pre training model , Then there is a linear +softmax layer .
4. Other
4.1. Why? GPT It can be done in parallel , Although it is also an autoregressive language model (AR Model )
The key is this Masked Self attention. Autoregression is the use of the following Masked Of Self attention Realization .
Masking The purpose of is to ensure that the left side cannot see the right side . For example, the following picture . The horizontal axis is Query, The vertical axis is Key. Hollow yes Masked It's worth it .
Input is 1 when , It can't see anything , Input 2 when , It can see 1. Input 3 when , It can see 1,2. And so on , If we want to achieve , According to the following Masked self attention Realization .
1->2->3->4->5->6
In this way, autoregression can be implemented in parallel . Unlike RNN or LSTM This serial method realizes autoregression .

Code implementation
import numpy as np
def test1():
nd = 6
ns = 6
i = np.arange(nd)[:, None]
j = np.arange(ns)
m = i >= j - ns + nd
print(m)
if __name__ == '__main__':
test1()
Output results
[[ True False False False False False]
[ True True False False False False]
[ True True True False False False]
[ True True True True False False]
[ True True True True True False]
[ True True True True True True]]边栏推荐
- Express の Hello World
- JVM garbage collector G1 & ZGC details
- Idea shortcut key settings
- 1, 基本配置
- Redis docker 主从模式与哨兵sentinel
- 2020-11-02
- Framework program of browser self-service terminal based on IE kernel
- Row column (vertical and horizontal table) conversion of SQL
- Pit encountered by fastjason
- Small program learning path 1 - getting to know small programs
猜你喜欢

Applet learning path 2 - event binding

Talking about kotlin process exception handling mechanism

【新书推荐】Cleaning Data for Effective Data Science

9.JNI_ Necessary optimization design

Clickhouse installation (quick start)

Distributed ID

Express の post request

目标检测yolov5开源项目调试

4. use ibinder interface flexibly for short-range communication

Numpy (data type)
随机推荐
Using OpenCV Net for image restoration
Script summary
float
Alibaba billion concurrent projects in architecture
Applet learning path 2 - event binding
【新书推荐】MongoDB Performance Tuning
JVM tuning tool introduction and constant pool explanation
Get to know handler again
Deep understanding of kotlin collaboration context coroutinecontext
Flutter 中的 ValueNotifier 和 ValueListenableBuilder
Express get request
目标检测yolov5开源项目调试
Initialize static resource demo
Cftpconnection:: getfile() download FTP server files and related parameter descriptions
Bluetooth BT RF test (forwarding)
Flutter 0001, environment configuration
[shutter] solve failed assertion: line 5142 POS 12: '_ debugLocked‘: is not true.
【Ubuntu-MySQL8安装与主从复制】
仿照微信Oauth2.0接入方案
MySQL优化