当前位置:网站首页>GPT (improving language understanding generative pre training) paper notes
GPT (improving language understanding generative pre training) paper notes
2022-06-30 09:37:00 【A grain of sand in the vast sea of people】
Catalog
1. Brief introduction of the paper
2. GPT And ELMO Differences and connections .
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .
Paper:GPT: Improving Language Understandingby Generative Pre-Training
Code:GPT
1. Brief introduction of the paper
GPT yes “Generative Pre-Training” For short , It refers to the generative pre training .GPT Adopt a two-stage process , The first stage is to use language model for pre training , The second stage passes Fine-tuning The model solves the downstream task . The following figure shows GPT The pre training process of .
2. GPT And ELMO Differences and connections .
(1) The same thing :GPT and ELMO It is similar to the two-stage model .
(2) Difference : First , Feature extractor is not used RNN, But with Transformer, Its feature extraction ability is better than RNN; secondly ,GPT Although the pre training is still based on the language model as the target task , But it is one-way
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
why
The problem of supervised learning , The supervised learning here only uses a large amount of manually marked data to train the model .
- Most deep learning methods require a lot of manually labeled data , But in fact, we don't have so much manually marked data , This limits their applicability in many fields .
- Some can use unsupervised learning to learn language information , If you use manually marked data to learn . It can be time-consuming and expensive
- Even if it is available under a large amount of Supervision , Learning good representations in an unsupervised way can provide an important performance boost . so far , The most convincing evidence is the widespread use of pre trained word embedding to improve a range of NLP Performance of tasks
how
We use a two-stage training method . First , We use the language model to learn the initial parameters of the neural network model from unlabeled data . And then , We analyze the language model according to the downstream marked data Fine-tuning.x
The first on the left of the figure below is Transformer The model structure diagram and the right figure are GPT Model

GPT The whole process is shown in the figure on the right , The word vector (token embedding) And the position vector (position embedding) And as input , after 12 Layer of Masked Multi-Head Attention and Feed Forward( Of course, the middle also includes Layer Norm), Get the predicted vector and the vector of the last word , The word vector of the last word will be used as a follow-up fine-tuning The input of .
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .

( Left )Transformer Decoder Structure and training objectives . ( Right ) Enter the transitions used to fine tune different tasks . We convert all structured inputs into sequences of tokens processed by our pre training model , Then there is a linear +softmax layer .
4. Other
4.1. Why? GPT It can be done in parallel , Although it is also an autoregressive language model (AR Model )
The key is this Masked Self attention. Autoregression is the use of the following Masked Of Self attention Realization .
Masking The purpose of is to ensure that the left side cannot see the right side . For example, the following picture . The horizontal axis is Query, The vertical axis is Key. Hollow yes Masked It's worth it .
Input is 1 when , It can't see anything , Input 2 when , It can see 1. Input 3 when , It can see 1,2. And so on , If we want to achieve , According to the following Masked self attention Realization .
1->2->3->4->5->6
In this way, autoregression can be implemented in parallel . Unlike RNN or LSTM This serial method realizes autoregression .

Code implementation
import numpy as np
def test1():
nd = 6
ns = 6
i = np.arange(nd)[:, None]
j = np.arange(ns)
m = i >= j - ns + nd
print(m)
if __name__ == '__main__':
test1()
Output results
[[ True False False False False False]
[ True True False False False False]
[ True True True False False False]
[ True True True True False False]
[ True True True True True False]
[ True True True True True True]]边栏推荐
- Tutorial for beginners of small programs day01
- Niuke walks on the tree (ingenious application of parallel search)
- Terminal -- Zsh of terminal three swordsmen
- 工作小记: sendto失败 errno 22
- 训练一个图像分类器demo in PyTorch【学习笔记】
- Numpy (time date and time increment)
- 【新书推荐】MongoDB Performance Tuning
- MySQL index optimization miscellaneous
- 单片机 MCU 固件打包脚本软件
- Express file download
猜你喜欢

MySQL knowledge summary (useful for thieves)

I'm late for school

Talk about the kotlin cooperation process and the difference between job and supervisorjob

Numpy (data type)

float

抽象类和接口

Self service terminal handwritten Chinese character recognition input method library tjfink introduction

Idea setting automatic package Guide

【新书推荐】MongoDB Performance Tuning

Agp7.0|kts makes a reinforced plug-in
随机推荐
仿照微信Oauth2.0接入方案
12. problem set: process, thread and JNI architecture
Express の Hello World
MySQL index and data storage structure foundation
Application of hongruan face recognition
Use V-IF with V-for
I once met a girl whom I most wanted to take care of all my life. Later... No later
【新书推荐】Cleaning Data for Effective Data Science
Pass anonymous function to simplification principle
Mysq database remote connection error, remote connection is not allowed
Express の post request
Linear-gradient()
About the smart platform solution for business hall Terminal Desktop System
How do I start? (continuously updating)
Clickhouse installation (quick start)
(zero) most complete JVM knowledge points
Talking about kotlin process exception handling mechanism
Microsoft. Bcl. Async usage summary -- in Net framework 4.5 project Net framework version 4.5 and above can use async/await asynchronous feature in C 5
Handwriting sorter component
5. Messager framework and imessager interface