当前位置:网站首页>Interpretation of MAML (model agnostic meta learning)

Interpretation of MAML (model agnostic meta learning)

2022-06-22 11:00:00 Qianyu QY

Address of thesis :proceedings.mlr.press/v70/finn17a/finn17a.pdf

5.1 brief introduction

Model-Agnostic: Can be applied to Any gradient descent Model of , Can be used for Different learning tasks ( Such as classification 、 Return to 、 Policy gradient RL).

Meta-Learning: Training models on a large number of learning tasks , So that the model only uses A small number of training samples You can learn new task ( Speed up fine-tune). Different tasks have different models .

It is necessary to consider fusing previous experience with a small amount of new information , At the same time, avoid over fitting .

The core of the method yes The initial parameters of the training model , Thus, the model can achieve the best performance by only a few steps of gradient update on a small number of new task samples .

5.2 Method

First, the following algorithm is used 1 Initialize network weights , Then fine tune the training on the new task .

For the above algorithm 1 The interpretation of :

MAML Is the purpose of : Learn the initialization weight of the network , Thus, the network can achieve good results only by training one or several steps on new tasks .

And the core of deep learning , Algorithm 1 Medium Training tasks and Test task of fine tuning after initializing the weight All samples , And general deep learning Training set and Test set The purpose and concept of .

Formulate the above description : Suppose the initialization weight of the network is θ \theta θ, Network in different new tasks τ i \tau_i τi The weight of the training set after one-step gradient update is θ i ′ \theta'_i θi, Use the updated weights θ i ′ \theta'_i θi On a new mission τ i \tau_i τi Calculate the loss on the test set of L i ( θ i ′ ) \mathcal{L}_i (\theta'_i) Li(θi),MAML The aim is to make different new tasks τ i \tau_i τi The sum of the losses on the , The formula is as follows :
L = m i n   ∑ τ i ∼ p ( τ ) L i ( θ i ′ ) L = min ~ \sum_{\tau_i \sim p(\tau)} \mathcal{L}_i (\theta'_i) L=min τip(τ)Li(θi)
Take the above as the total loss function , Network weight θ \theta θ Make a gradient descent , as follows :
θ ← θ − β ∇ θ ∑ τ i ∼ p ( τ ) L i ( θ i ′ )    = θ − β ∑ τ i ∼ p ( τ ) ∇ θ L i ( θ i ′ ) \theta \leftarrow \theta - \beta \nabla_{\theta} \sum_{\tau_i \sim p(\tau)} \mathcal{L}_i (\theta'_i) \\ ~~ = \theta - \beta \sum_{\tau_i \sim p(\tau)} \nabla_{\theta} \mathcal{L}_i (\theta'_i) θθβθτip(τ)Li(θi)  =θβτip(τ)θLi(θi)
Calculation ∇ θ L i ( θ i ′ ) \nabla_{\theta} \mathcal{L}_i (\theta'_i) θLi(θi)

Borrow the formula in teacher lihongyi's handout , ϕ = θ \phi=\theta ϕ=θ, θ ^ = θ i ′ \hat{\theta}=\theta'_i θ^=θi, ∇ θ L i ( θ i ′ ) = ∇ ϕ l ( θ ^ ) \nabla_{\theta} \mathcal{L}_i (\theta'_i) = \nabla_{\phi} l(\hat\theta) θLi(θi)=ϕl(θ^), ∇ ϕ l ( θ ^ ) \nabla_{\phi} l(\hat\theta) ϕl(θ^) It can be decomposed into the following formula ,

among , θ ^ \hat{\theta} θ^ from ϕ \phi ϕ To calculate the , as follows :

It is calculated by the following formula ∇ ϕ l ( θ ^ ) \nabla_{\phi} l(\hat\theta) ϕl(θ^) Every derivative in :

Calculating the second derivative is very time-consuming , therefore MAML The first derivative approximation method is proposed in this paper , That is, assume that the second derivative is 0, The formula is simplified as follows :

simplified , ∇ ϕ l ( θ ^ ) → ∇ θ ^ l ( θ ^ ) \nabla_{\phi} l(\hat\theta) \rightarrow \nabla_{\hat\theta} l(\hat\theta) ϕl(θ^)θ^l(θ^), The original gradient descent formula is transformed into :
θ ← θ − β ∑ τ i ∼ p ( τ ) ∇ θ i ′ L i ( θ i ′ ) \theta \leftarrow \theta - \beta \sum_{\tau_i \sim p(\tau)} \nabla_{\theta'_i} \mathcal{L}_i (\theta'_i) θθβτip(τ)θiLi(θi)
namely , Directly for each updated θ i ′ \theta'_i θi Calculate the gradient , Apply the gradient to the... Before update θ \theta θ On .

problem :

1、 Why is it that multiple tasks are sampled circularly and randomly for learning ?

answer : Build enough different tasks , Make the network fully trained , Thus, when facing new tasks, only a few steps of updating can achieve better results .

2、 Why do the first gradient calculation and the second gradient calculation use different samples under the same task , namely support set and query set?

answer : The former is a training set , Used to calculate θ i ′ \theta'_i θi, The latter is the test set , Used to calculate the loss .

3、 Instead of pre training on a lot of tasks first ( Calculate the gradient only once at a time ), Fine tune the new tasks , What are the advantages ?

answer : The purpose of pre training is to optimize the performance of the network on all tasks , When applying this optimal model to new task tuning , May fall into local optimal value and other problems ; and MAML The purpose of this paper is to optimize the performance of the model after several steps of training on a new task , Consider the future optimal value , So it won't be optimal on some tasks , And fall into suboptimal on other tasks .

For more details, please refer to :

https://zhuanlan.zhihu.com/p/57864886

https://www.bilibili.com/video/BV1w4411872t?p=7&vd_source=383540c0e1a6565a222833cc51962ed9

原网站

版权声明
本文为[Qianyu QY]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/173/202206221046553244.html