当前位置：网站首页>[Multi-task optimization] DWA, DTP, Gradnorm (CVPR 2019, ECCV 2018, ICML 2018)

[Multi-task optimization] DWA, DTP, Gradnorm (CVPR 2019, ECCV 2018, ICML 2018)

2022-08-01 20:00:00 【chad_lee】

Optimization of multi-task learning models

有多个task就有多个loss,常见的MTL模型lossCan be directly and simply for multiple tasksloss相加：
$L=\sum_{i} L_{i}$
Obviously there is a big problem with this approach,因为不同task的label分布不同,同时不同task的lossThe magnitudes are also different,The entire model is likely to be used by somelossEspecially large tasks dominate.The easiest way is weightingloss,Manually designed weights：
$L=\sum_{i} w_{i} * L_{i}$
But this way this weight is fixed throughout the training epoch,The weights may vary in different training stages,Dynamic weights are ：
$L=\sum_{i} w_{i}(t, \theta) * L_{i}$
t是训练的step,thetaare other parameters of the model.However, this approach is not necessarily better than manual design weights.

一些设计 $w_{i}(t, \theta)$ 的方法：

《End-to-End Multi-Task Learning with Attention》 CVPR 2019

CVPR 2019的《End-to-End Multi-Task Learning with Attention》提出的Dynamic Weight Averaging（DWA）,核心公式如下所示：
$\begin{gathered} r_{n}(t-1)=\frac{L_{n}(t-1)}{L_{n}(t-2)} \\ w_{i}(t)=\frac{N \exp \left(r_{i}(t-1) / T\right)}{\sum_{n} \exp \left(r_{n}(t-1) / T\right)} \end{gathered}$
$L_{n}(t-1) $是任务 n 在 t - 1 时的训练 l oss,因此$ r_{n}(t-1) $ 是此时loss的下降速度,$r_{n}(t-1) $越小,训练速度越快.（has begun to converge,loss=0time is over）

$w_i(t)$ 代表不同任务loss的权重,直观理解就是loss收敛越快的任务,权重越小,The average degree of weights is determined by the temperature coefficientT控制

《Dynamic task prioritization for multitask learning》 ECCV 2018

DTP（Dynamic Task Prioritization）：
$w_{i}(t)=-\left(1-k_{i}(t)\right)^{\gamma_{i}} \log \left(k_{i}(t)\right)$
$k_i(t) $ 表示第t步的a measurekpi值,Value dimension0~1之间,比如在分类任务中KPIIt can be the accuracy rate on the training set, etc,It can reflect how well the model fits this task,γis the manually adjusted temperature coefficient.Intuitive understanding is similarfocal loss,The better the task, the better,获得的权重越小.

《Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks》 ICML 2018

影响力最大的是GradNorm.其核心思想相对前述的DWA和DTP更为复杂,核心观点为

不仅考虑loss收敛的速度,进一步希望loss本身的量级能尽量接近
不同的任务以相近的速度训练(与gradient相关)

From dynamic weights with parameters $L=\sum_{i} w_{i}(t, \theta) * L_{i} $出发,The authors also define a training weight $w_i(t,\theta) $ 相关的 gradient loss.（定一个lossused to optimize trainingloss的权重）

$w_i(t)$ 刚开始初始化为1 or hyperparameters,然后用gradient loss来优化.

Get the mission first i 在 t 时刻的梯度的2范数,and all tasks平均值：
$\begin{gathered} G_{W}^{(i)}(t)=\left\|\nabla_{W}\left(w_{i}(t) L_{i}(t)\right)\right\|_{2} \\ \bar{G}_{W}(t)=A V G\left(G_{W}^{(i)}(t)\right) \end{gathered}$
其中Wis a subset of model parameters,also needs to be appliedGradient Normalization的参数集,Typically the last layer in the model to share parameters is selected.Then get different tasksloss的训练速度：
$\begin{gathered} \tilde{L}_{i}(t)=L_{i}(t) / L_{i}(0) \\ r_{i}(t)=\tilde{L}_{i}(t) / A V G\left(\tilde{L}_{i}(t)\right) \end{gathered}$
$r_{i}(t) $Measures the speed of task training, $r_{i}(t) $ 越大,Indicates that the task is trained more slowly.这点和DWA的思想接近,但是这里使用的是第一步的loss,而不是DWA中的前一步loss.

最终gradient loss为：
$L_{g r a d}\left(t ; w_{i}(t)\right)=\sum_{i}\left|G_{W}^{(i)}(t)-\bar{G}_{W}(t) *\left[r_{i}(t)\right]^{\alpha}\right|_{1}$

$\bar{G}{W}(t) *\left[r{i}(t)\right]^{\alpha} $表示理想的梯度标准化后的值.**这里的gradient loss只用于更新 $w_{i}(t) $* * .$ w_{i}(t) $It will also go through the final weightnormalize,使得 $\sum_{i} w_{i}(t)=N $,N是任务的数量.
αis a hyperparameter that sets the strength of the restoring force,That is, the training speed of the task is adjusted to the average level of intensity.If the complexity of the tasks is very different,The learning rates vary widely between roughly figures,should use the higher onealphafor a stronger training rate balance;Conversely for multiple similar tasks,应该使用较小的α.
从gradient loss的定义来看, $r_i(t) $ 越大,Indicates faster training,gradient loss越大;$\left|G_{W}^{(i)}(t)-\bar{G}{W}(t)\right| $表明 l oss 量级的变化,不论$ G{W}^{(i)}(t) $过大或者过小都会导致gradient loss变大.
所以gradient loss 希望：1、不同任务的loss的量级接近;2、Different tasks are trained at similar speeds(收敛速度)