当前位置:网站首页>[NLP] pre training model - gpt1
[NLP] pre training model - gpt1
2022-07-01 13:46:00 【Coriander Chrysanthemum】
background
I don't say much nonsense , First put out the links of three papers :GPT1:Improving Language Understanding by Generative Pre-Training、GPT2:Language Models are Unsupervised Multitask Learners、GPT3:Language Models are Few-Shot Learners. Teacher Li Mu is also B There is an introduction on the station GPT Video of model :GPT,GPT-2,GPT-3 Intensive reading 【 Intensive reading 】.
First of all, let's tidy up Transformer The time period after the emergence of some language models :Transformer:2017/06,GPT1:2018/06,BERT:2018/10,GTP2:2019/02,GPT3:2020/05. From the time sequence of the emergence of the model , It's really competitive .
GPT1
We know that NLP There are many tasks in the field , Such as Q & A , semantic similarity , Text classification, etc . Although a large number of unmarked text corpora are very rich , But there is little labeled data for learning these specific tasks , Then it is more difficult to carry out model training .GTP1 This paper is about , The language model is tested on different unlabeled text corpora Generative pre training , Then for each specific task Differential fine tuning , You can get good results on these tasks . Compared with the previous method ,GPT1 Use task aware input transformations during tuning , To achieve effective conversion , At the same time, make minimal changes to the model architecture . Here we can think of word2vec, Although it is also a pre training model , But we will still construct a neural network according to the task type , and GPT1 There is no need to .GPT1 The effectiveness of this approach is illustrated by its effectiveness on a wide range of benchmarks for natural language understanding .GTP1 In general, the task agnostic model is better than the differentiated training model using the architecture specially made for each task , In the study of 12 One of the tasks is 9 Three have significantly improved the technical level .
Before that , Of course, the most successful pre training model is word2vec. In this paper, we put forward 2 Major issues ,1. How to choose what kind of loss function in a given unlabeled corpus ? Of course, there will also be some tasks at this time, such as language models , Machine translation, etc , But there is no loss function that performs well in all tasks ;2. How to express the learned text Effectively transfer to downstream tasks ?
GPT1 Our approach is to use semi supervised on a large number of unlabeled corpus (semi-supervised) The method of learning a language model , Then fine tune the downstream tasks . So far , What we can think of better in the language model is RNN and Transformer That's it . Compared with RNN,Transformer The characteristics learned are more robust , The text explains , Compared with alternatives such as circular Networks , This model selection provides us with more structured memory , Used to deal with long-term dependencies in text , Thus, robust transmission performance is produced in different tasks (This model choice provides us with a more structured memory for handling long-term dependencies intext, compared to alternatives like recurrent networks, resulting in robust transfer performance acrossdiverse tasks. ) And it uses a task related input representation when migrating (During transfer, we utilize task-specific input adaptations derived from traversal-styleapproaches, which process structured text input as a single contiguous sequence of tokens)
Model structure
Model training includes two stages . The first stage is to learn high-capacity language models on a large corpus . Next is the fine-tuning stage , At this stage , Adjust the model to suit the discrimination task with marked data .
Unsupervised pre training
Existing unsupervised corpus data U = { u 1 , ⋯ , u n } {\mathcal U} = \{ u_1,\cdots,u_n \} U={ u1,⋯,un}, A standard language model is used to minimize the following likelihood function :
L 1 ( U ) = ∑ ⋅ i log P ( u i ∣ u i − k , … , u i − 1 ; Θ ) L_{1}(\mathcal{U})=\sum_{\cdot_{i}} \log P\left(u_{i} \mid u_{i-k}, \ldots, u_{i-1} ; \Theta\right) L1(U)=⋅i∑logP(ui∣ui−k,…,ui−1;Θ)
among Θ \Theta Θ Namely GPT1 Model , Then before use k k k individual token To predict number one i i i individual token Probability , That's the window size , It's a super parameter . So from the first 0 From the beginning to the end, all the results are added up to get the objective function L 1 L_1 L1, choice log The function is also to avoid the multiplication of probabilities until the final value is gone . In other words, the model can maximize the probability of corpus information .
In this article, we use multiple layers Transformer decoder. stay Transformer Of decoder Due to the existence of mask, when extracting features, you can only see the content in front of the current character , The following content is used to calculate the attention mechanism 0.
In the process of forecasting , Add a given context U = ( u k , ⋯ , u − 1 ) U=(u_{_k},\cdots,u_{-1}) U=(uk,⋯,u−1) It's context token Vector , n n n Express transformer decoder The number of layers , W e W_e We yes token embedding matrix , W p W_p Wp Is the position vector embedding matrix . Then the prediction context is U U U The process of the next word of is as follows :
h 0 = U W e + W p h l = t r a n s f o r m e r _ b l o c k ( h l − 1 ) , ∀ i ∈ [ 1 , n ] P ( u ) = s o f t m a x ( h n W e T ) h_0 = UW_e+W_p \\ h_l = transformer\_block(h_{l-1}),\forall { i\in[1, n]}\\ P(u)=softmax(h_nW_e^{T}) h0=UWe+Wphl=transformer_block(hl−1),∀i∈[1,n]P(u)=softmax(hnWeT)
Fine tuning based on supervision (fine-tuning)
Use L 1 L_1 L1 After training a model with the objective function formula , We can use the parameters of this model supervised Task . This model after training is what we call the pre training model . Suppose we have marked corpus C C C, Each sample is a sequence x 1 , ⋯ , x m x^1,\cdots,x^m x1,⋯,xm, The corresponding label is y y y. After the input data passes through our pre training model, we get a h l m h_l^m hlm. This h l m h_l^m hlm It is the input of a linear output layer added according to the task , The predicted result is y y y. The following formula :
P ( y ∣ x 1 , … , x m ) = softmax ( h l m W y ) P\left(y \mid x^{1}, \ldots, x^{m}\right)=\operatorname{softmax}\left(h_{l}^{m} W_{y}\right) P(y∣x1,…,xm)=softmax(hlmWy)
The loss function of the fine-tuning model is as follows :
L 2 ( C ) = ∑ ( x , y ) log P ( y ∣ x 1 , … , x m ) L_{2}(\mathcal{C})=\sum_{(x, y)} \log P\left(y \mid x^{1}, \ldots, x^{m}\right) L2(C)=(x,y)∑logP(y∣x1,…,xm)
The paper also says , If you add a sequence to predict the loss of the next word in the sequence , namely L 1 L_1 L1, Add to the loss function , It works better , That is, we have the following loss function :
L 3 ( C ) = L 2 ( C ) + λ ∗ L 1 ( C ) L_{3}(\mathcal{C})=L_{2}(\mathcal{C})+\lambda * L_{1}(\mathcal{C}) L3(C)=L2(C)+λ∗L1(C)
among λ \lambda λ It's a super parameter .
in general , Only additional parameters are needed in the trimmer W y W_y Wy. So how to NLP Task input is represented and processed using a pre training model ? Look at this picture first :
Take the text categorization task as an example , Add the specified delimiter, Then send it to Transformer, Next to that is Linear that will do . The second task is Entailment( contains ), That is what we introduced before NIU Mission . Other tasks are to splice according to the above figure .
in general , By constructing the input of different types of pre training models , To complete multiple NLP Mission , The pre training model itself will not change .
边栏推荐
- Analysis report on the development prospect and investment strategic planning of China's wafer manufacturing Ⓔ 2022 ~ 2028
- Investment analysis and prospect prediction report of global and Chinese p-nitrotoluene industry Ⓙ 2022 ~ 2027
- Research Report on China's software outsourcing industry investment strategy and the 14th five year plan Ⓡ 2022 ~ 2028
- Spark source code (V) how does dagscheduler taskscheduler cooperate with submitting tasks, and what is the corresponding relationship between application, job, stage, taskset, and task?
- Application of 5g industrial gateway in scientific and technological overload control; off-site joint law enforcement for over limit, overweight and overspeed
- [sword finger offer] 55 - I. depth of binary tree
- Social distance (cow infection)
- QT社团管理系统
- Flow management technology
- When you really learn databinding, you will find "this thing is really fragrant"!
猜你喜欢
Six years of technology iteration, challenges and exploration of Alibaba's globalization and compliance
What is the future development direction of people with ordinary education, appearance and family background? The career planning after 00 has been made clear
“国防七子”经费暴增,清华足足362亿元,甩第二名101亿 |全国高校2022预算大公开...
Kongsong (Xintong Institute) - cloud security capacity building and trend in the digital era
MySQL 66 questions, 20000 words + 50 pictures in detail! Necessary for review
详细讲解面试的 IO多路复用,select,poll,epoll
6年技术迭代,阿里全球化出海&合规的挑战和探索
Understand the window query function of tdengine in one article
Build a vc2010 development environment and create a tutorial of "realizing Tetris game in C language"
Fiori applications are shared through the enhancement of adaptation project
随机推荐
Etcd 概要 机制 和使用场景
Global and Chinese silicone defoamer production and marketing demand and investment forecast analysis report Ⓨ 2022 ~ 2027
[anwangbei 2021] Rev WP
Fiori 应用通过 Adaptation Project 的增强方式分享
【NLP】预训练模型——GPT1
Global and Chinese polypropylene industry prospect analysis and market demand forecast report Ⓝ 2022 ~ 2027
Introduction to topological sorting
Leetcode question 1: sum of two numbers (3 languages)
Simplex, half duplex, full duplex, TDD and FDD
word2vec训练中文词向量
2022年PMP项目管理考试敏捷知识点(6)
A Fletter version of Notepad
French Data Protection Agency: using Google Analytics or violating gdpr
IO的几种模型 阻塞,非阻塞,io多路复用,信号驱动和异步io
Blind box NFT digital collection platform system development (build source code)
[安网杯 2021] REV WP
Self cultivation of open source programmers who contributed tens of millions of lines of code to shardingsphere and later became CEO
学会使用LiveData和ViewModel,我相信会让你在写业务时变得轻松
3.4 《数据库系统概论》之数据查询—SELECT(单表查询、连接查询、嵌套查询、集合查询、多表查询)
龙蜥社区开源 coolbpf,BPF 程序开发效率提升百倍