当前位置：网站首页>Interpretation of MAML (model agnostic meta learning)

Interpretation of MAML (model agnostic meta learning)

2022-06-22 11:00:00 【Qianyu QY】

Address of thesis ：proceedings.mlr.press/v70/finn17a/finn17a.pdf

5.1 brief introduction

Model-Agnostic： Can be applied to Any gradient descent Model of , Can be used for Different learning tasks （ Such as classification 、 Return to 、 Policy gradient RL）.

Meta-Learning： Training models on a large number of learning tasks , So that the model only uses A small number of training samples You can learn new task （ Speed up fine-tune）. Different tasks have different models .

It is necessary to consider fusing previous experience with a small amount of new information , At the same time, avoid over fitting .

The core of the method yes The initial parameters of the training model , Thus, the model can achieve the best performance by only a few steps of gradient update on a small number of new task samples .

5.2 Method

First, the following algorithm is used 1 Initialize network weights , Then fine tune the training on the new task .

For the above algorithm 1 The interpretation of ：

MAML Is the purpose of ： Learn the initialization weight of the network , Thus, the network can achieve good results only by training one or several steps on new tasks .

And the core of deep learning , Algorithm 1 Medium Training tasks and Test task of fine tuning after initializing the weight All samples , And general deep learning Training set and Test set The purpose and concept of .

Formulate the above description ： Suppose the initialization weight of the network is $\theta$ , Network in different new tasks $\tau_i$ The weight of the training set after one-step gradient update is $\theta'_i$ , Use the updated weights $\theta'_i$ On a new mission $\tau_i$ Calculate the loss on the test set of $\mathcal{L}_i (\theta'_i)$ ,MAML The aim is to make different new tasks $\tau_i$ The sum of the losses on the , The formula is as follows ：
$\sum_{\tau_i \sim p(\tau)} \mathcal{L}_i (\theta'_i)$
Take the above as the total loss function , Network weight $\theta$ Make a gradient descent , as follows ：
$\theta \leftarrow \theta - \beta \nabla_{\theta} \sum_{\tau_i \sim p(\tau)} \mathcal{L}_i (\theta'_i) \\ ~~ = \theta - \beta \sum_{\tau_i \sim p(\tau)} \nabla_{\theta} \mathcal{L}_i (\theta'_i)$
Calculation $\nabla_{\theta} \mathcal{L}_i (\theta'_i)$ ：

Borrow the formula in teacher lihongyi's handout , $\phi=\theta$ , $\hat{\theta}=\theta'_i$ , $\nabla_{\theta} \mathcal{L}_i (\theta'_i) = \nabla_{\phi} l(\hat\theta)$ , $\nabla_{\phi} l(\hat\theta)$ It can be decomposed into the following formula ,

among , $\hat{\theta}$ from $\phi$ To calculate the , as follows ：

It is calculated by the following formula $\nabla_{\phi} l(\hat\theta)$ Every derivative in :

Calculating the second derivative is very time-consuming , therefore MAML The first derivative approximation method is proposed in this paper , That is, assume that the second derivative is 0, The formula is simplified as follows ：

simplified , $\nabla_{\phi} l(\hat\theta) \rightarrow \nabla_{\hat\theta} l(\hat\theta)$ , The original gradient descent formula is transformed into ：
$\theta \leftarrow \theta - \beta \sum_{\tau_i \sim p(\tau)} \nabla_{\theta'_i} \mathcal{L}_i (\theta'_i)$
namely , Directly for each updated $\theta'_i$ Calculate the gradient , Apply the gradient to the... Before update $\theta$ On .

problem ：

1、 Why is it that multiple tasks are sampled circularly and randomly for learning ？

answer ： Build enough different tasks , Make the network fully trained , Thus, when facing new tasks, only a few steps of updating can achieve better results .

2、 Why do the first gradient calculation and the second gradient calculation use different samples under the same task , namely support set and query set？

answer ： The former is a training set , Used to calculate $\theta'_i$ , The latter is the test set , Used to calculate the loss .

3、 Instead of pre training on a lot of tasks first （ Calculate the gradient only once at a time ）, Fine tune the new tasks , What are the advantages ？

answer ： The purpose of pre training is to optimize the performance of the network on all tasks , When applying this optimal model to new task tuning , May fall into local optimal value and other problems ; and MAML The purpose of this paper is to optimize the performance of the model after several steps of training on a new task , Consider the future optimal value , So it won't be optimal on some tasks , And fall into suboptimal on other tasks .

For more details, please refer to ：

https://zhuanlan.zhihu.com/p/57864886

https://www.bilibili.com/video/BV1w4411872t?p=7&vd_source=383540c0e1a6565a222833cc51962ed9

原网站

版权声明
本文为[Qianyu QY]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/173/202206221046553244.html