当前位置：网站首页>Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-Tuning: Optimizing Continuous Prompts for Generation

2022-07-28 03:41:00 【HDU-Dade】

List of articles

In-context Learning
Prefix-tuning
Related Work
Prefix-tuning-intuition
Fine-tuning
Result（table-to-text）
Application：Personalization

Reference resources
Prefix-Tuning: Optimizing Continuous Prompts for Generation Explanation of the author's little sister

In-context Learning

Insert picture description here
advantage

Just write different tips for different tasks , No task specific training is required

shortcoming

But you can't use very large data sets , Such as GPT-3 There is a bounded context window , Only a limited number of token, So when we have a training set longer than the context window , Context learning cannot make full use of this training set
We have to manually prompt , These manually written tips may not be the best
GPT-3 It cannot be well extended to smaller models

Prefix-tuning

Insert picture description here

Freeze the pre training language model , Optimize prefix only , Only store this very small prefix for each task . As the number of tasks increases , The cost is very small
Prefix can be trained . It is not necessary to specify
Context learning is a unique framework , Only applicable to large models . Prefix learning can make prompt Extended to smaller models

Related Work

Insert picture description here

Tuning the top k layers
adjustment top k Layers are a common practice in large fine-tuning models . Usually k be equal to 1 or 2. Adjust the parameter quantity to 20% Because we have to adjust the language model header that contains many parameters
Adapter-tuning （ Also known as lightweight trim ）
Another effective way to adjust the parameters of the language model for downstream tasks , Freeze pre training parameters , And in LM Add some trainable between each layer of mlp layer

Prefix-tuning-intuition

Insert picture description here
Optimized for discrete instructions
Optimized for continuous word embedding

Optimize prefix activation of all layers

Fine-tuning

Insert picture description here
Connect x and y In order to obtain z, adopt Autoregresive LM, Calculate the activation vector at each time step $h_i$ , therefore $h_i$ By doing context activation and time distribution $i$ Input to calculate
The goal is every generation $y$ Sum of logarithmic probabilities of each marker in

Insert picture description here
Mission ：
table-to-text Mission ： Input $X$ Represents a linear table , Output $Y$ Represents a short text ;

Autoregressive model ： At some point $i$ ,Transformer The hidden state vectors of each layer of are spliced together and used to predict the next word ;
Overall adoption encoder-to-decoder framework ;

Can be token Optimized for continuous word embedding , Instead of optimizing discrete tags , Its effect will spread upward to all Transformer Activation layer , Then propagate to the right to subsequent Tags . This is more expressive than discrete prompts that need to match the embedding of real words . meanwhile , This is not as good as interfering with the expressiveness of all active layers , This avoids long-term dependence and includes more adjustable parameters . therefore ,Prefix-Tuning All layer parameters corresponding to the prefix part are optimized .

Add one prefix, The autoregressive model is expressed as $z = [p re f i x; x; y]$ ,encoder decoder The model is expressed as $z = [p re f i x; x; p re f i x'; y];$
Input part $p re f i x, x, y$ Of position id Respectively as $P_{idx},X_{idx} and Y_{idx}$
prefix-tuning Initialize a trainable matrix , Write it down as $P_\theta\in\mathbb{R}^{|P_{idx}|\times dim(h_i)}$
Its dimension is prefix × Activate the dimension of the vector
$h_i$ Used to store prefix parameters：
- In the prefix section token, Training matrix designed by parameter selection
- And the other parts token, The parameters are fixed and are parameters of the pre training language model .