当前位置:网站首页>GPT (improving language understanding generative pre training) paper notes
GPT (improving language understanding generative pre training) paper notes
2022-06-30 09:37:00 【A grain of sand in the vast sea of people】
Catalog
1. Brief introduction of the paper
2. GPT And ELMO Differences and connections .
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .
Paper:GPT: Improving Language Understandingby Generative Pre-Training
Code:GPT
1. Brief introduction of the paper
GPT yes “Generative Pre-Training” For short , It refers to the generative pre training .GPT Adopt a two-stage process , The first stage is to use language model for pre training , The second stage passes Fine-tuning The model solves the downstream task . The following figure shows GPT The pre training process of .
2. GPT And ELMO Differences and connections .
(1) The same thing :GPT and ELMO It is similar to the two-stage model .
(2) Difference : First , Feature extractor is not used RNN, But with Transformer, Its feature extraction ability is better than RNN; secondly ,GPT Although the pre training is still based on the language model as the target task , But it is one-way
3. Contribution or major improvement
3.1. Transformer Semi supervised learning
why
The problem of supervised learning , The supervised learning here only uses a large amount of manually marked data to train the model .
- Most deep learning methods require a lot of manually labeled data , But in fact, we don't have so much manually marked data , This limits their applicability in many fields .
- Some can use unsupervised learning to learn language information , If you use manually marked data to learn . It can be time-consuming and expensive
- Even if it is available under a large amount of Supervision , Learning good representations in an unsupervised way can provide an important performance boost . so far , The most convincing evidence is the widespread use of pre trained word embedding to improve a range of NLP Performance of tasks
how
We use a two-stage training method . First , We use the language model to learn the initial parameters of the neural network model from unlabeled data . And then , We analyze the language model according to the downstream marked data Fine-tuning.x
The first on the left of the figure below is Transformer The model structure diagram and the right figure are GPT Model
GPT The whole process is shown in the figure on the right , The word vector (token embedding) And the position vector (position embedding) And as input , after 12 Layer of Masked Multi-Head Attention and Feed Forward( Of course, the middle also includes Layer Norm), Get the predicted vector and the vector of the last word , The word vector of the last word will be used as a follow-up fine-tuning The input of .
3.2. Based on different tasks in fine-tuning Phase pair GPT Input data transformation .
( Left )Transformer Decoder Structure and training objectives . ( Right ) Enter the transitions used to fine tune different tasks . We convert all structured inputs into sequences of tokens processed by our pre training model , Then there is a linear +softmax layer .
4. Other
4.1. Why? GPT It can be done in parallel , Although it is also an autoregressive language model (AR Model )
The key is this Masked Self attention. Autoregression is the use of the following Masked Of Self attention Realization .
Masking The purpose of is to ensure that the left side cannot see the right side . For example, the following picture . The horizontal axis is Query, The vertical axis is Key. Hollow yes Masked It's worth it .
Input is 1 when , It can't see anything , Input 2 when , It can see 1. Input 3 when , It can see 1,2. And so on , If we want to achieve , According to the following Masked self attention Realization .
1->2->3->4->5->6
In this way, autoregression can be implemented in parallel . Unlike RNN or LSTM This serial method realizes autoregression .
Code implementation
import numpy as np
def test1():
nd = 6
ns = 6
i = np.arange(nd)[:, None]
j = np.arange(ns)
m = i >= j - ns + nd
print(m)
if __name__ == '__main__':
test1()
Output results
[[ True False False False False False]
[ True True False False False False]
[ True True True False False False]
[ True True True True False False]
[ True True True True True False]
[ True True True True True True]]
边栏推荐
- Use of Baidu face recognition API
- Express - static resource request
- Harmonyos actual combat - ten thousand words long article understanding service card development process
- Express get request
- Pipe pipe --namedpipe and anonymouspipe
- 训练一个图像分类器demo in PyTorch【学习笔记】
- 布隆过滤器
- Talk about the kotlin cooperation process and the difference between job and supervisorjob
- 5. Messager framework and imessager interface
- I once met a girl whom I most wanted to take care of all my life. Later... No later
猜你喜欢
随机推荐
Distributed things
布隆过滤器
utlis 内存池 对象池
Solution to the eighth training competition of 2020 Provincial Games
Properties of string
Initialize static resource demo
Solution to the sixth training competition of 2020 provincial competition
Talk about how the kotlin collaboration process establishes structured concurrency
thrift简单使用
11.自定义hooks
Differences between the notify(), notifyall(), notifydatasetchanged(), notifydatasetinvalidated() methods in the adapter
Tclistener server and tcpclient client use -- socket listening server and socketclient use
桂林 穩健醫療收購桂林乳膠100%股權 填補乳膠產品線空白
Cb/s Architecture - Implementation Based on cef3+mfc
oracle跨数据库复制数据表-dblink
Harmonyos actual combat - ten thousand words long article understanding service card development process
4. use ibinder interface flexibly for short-range communication
Self service terminal handwritten Chinese character recognition input method library tjfink introduction
Agp7.0|kts makes a reinforced plug-in
Microsoft. Bcl. Async usage summary -- in Net framework 4.5 project Net framework version 4.5 and above can use async/await asynchronous feature in C 5