当前位置：网站首页>[Multi-task model] Progressive Layered Extraction: A Novel Multi-Task Learning Model for Personalized (RecSys'20)

[Multi-task model] Progressive Layered Extraction: A Novel Multi-Task Learning Model for Personalized (RecSys'20)

2022-08-01 20:01:00 【chad_lee】

Tencent's video recommendation team,建模的目标包含用户的多种不同的行为：点击,分享,评论等等.每次请求,The ranking points of the candidates are calculated according to the formula：
$\text { score }=p V T R^{w V T R} \times p V C R^{w V C R} \times p S H R^{w S H R} \times \ldots \times p_{C M R}^{w C M} \times f(\text { video } l e n)$
其中w是超参,表示相对重要性

在这里插入图片描述

There are often complex relationships between multiple targets,Therefore, the phenomenon of seesaw often appears when modeling multiple targets at the same time,i.e. multiple tasksnegative transfer的问题：

在这里插入图片描述

GCG

MMOEIn theory, there is an optimal situation where features can be automatically selected,But this situation depends：1、gateCan you choose;2、也依赖expertCan produce a variety of characteristics（所有expert输出类似,无可奈何）.

因此本文提出的Customized Gate ControlMake this problem a little easier,Divide experts into big peers and small peers,both sharedexpert们,每个task也有专门的expert们,A little less difficult.

在这里插入图片描述

这样EA只被taskA训,EB只被taskB训,Guaranteed at least.

input是x,任务k的输出是
$y^{k}(x)=t^{k}\left(g^{k}(x)\right)$
其中 $t^k$ 是这个任务的NN tower, $g^{k}(x)$ 是第kThe output of the gating network for each task：
$g^{k}(x)=w^{k}(x) S^{k}(x)$
其中x是原始输入, $w^{k}(x)$ 是一个加权函数,Corresponds to the weight of each expert respectively,是一个softmax的输出：
$w^{k}(x)=\operatorname{Softmax}\left(W_{g}^{k} x\right)$
其中 $W_{g}^{k} \in R^{\left(m_{k}+m_{s}\right) \times d}$ ,mk和ms是 shared experts 和 specific experts 的个数. $S^{k}(x)$ is the output vector of all expertscontackcalled togetherselected matrix：
$S^{k}(x)=\left[E_{(k, 1)}^{T}, E_{(k, 2)}^{T}, \ldots, E_{\left(k, m_{k}\right)}^{T}, E_{(s, 1)}^{T}, E_{(s, 2)}^{T}, \ldots, E_{\left(s, m_{s}\right)}^{T}\right]^{T}$

PLE

But there are also problems after dividing small peers,不同taskThe role of the auxiliary supervision signal is small again（Because the difference from the independent model is only one sharedexpert,能力有限）.所以PLEIt is to connect several layers of expert networks,让共享expert更强一些.

在这里插入图片描述

优化方法

Half of the multi-objective task optimization is to set different weights for different subtasks,损失函数加权：
$L\left(\theta_{1}, \ldots \ldots, \theta_{K}, \theta_{s}\right)=\sum_{k=1}^{K} \omega_{k} L_{k}\left(\theta_{k}, \theta_{s}\right)$
But this paper considers the problem of inconsistency in the training sample space in more detail：

在这里插入图片描述

比如用户只有点击后才能进行分享和评论.本文是在 Loss 上进行一定的优化,联合训练这些任务,在计算每个任务的损失时需要把样本空间相同的合并,并忽略不在自己样本空间的样本,即不同的任务仍使用其各自样本空间中的样本.I understand it to mean a time when the model is updated,不会同时用SHR和CTR的loss来更新

At the same time, this paper also considers different tasks to set a dynamic weight,比如task k的初始loss权重为 $\omega_{k, 0}$ ,那么在第t个epoch的时候loss权重为：
$\omega_{k}^{(t)}=\omega_{k, 0} \times \gamma_{k}^{t}$
其中 $\gamma_{k}^{t}$ is the update rate of the previous step.