当前位置:网站首页>[Multi-task model] Progressive Layered Extraction: A Novel Multi-Task Learning Model for Personalized (RecSys'20)
[Multi-task model] Progressive Layered Extraction: A Novel Multi-Task Learning Model for Personalized (RecSys'20)
2022-08-01 20:01:00 【chad_lee】
Tencent's video recommendation team,建模的目标包含用户的多种不同的行为:点击,分享,评论等等.每次请求,The ranking points of the candidates are calculated according to the formula:
score = p V T R w V T R × p V C R w V C R × p S H R w S H R × … × p C M R w C M × f ( video l e n ) \text { score }=p V T R^{w V T R} \times p V C R^{w V C R} \times p S H R^{w S H R} \times \ldots \times p_{C M R}^{w C M} \times f(\text { video } l e n) score =pVTRwVTR×pVCRwVCR×pSHRwSHR×…×pCMRwCM×f( video len)
其中w是超参,表示相对重要性
There are often complex relationships between multiple targets,Therefore, the phenomenon of seesaw often appears when modeling multiple targets at the same time,i.e. multiple tasksnegative transfer的问题:
GCG
MMOEIn theory, there is an optimal situation where features can be automatically selected,But this situation depends:1、gateCan you choose;2、也依赖expertCan produce a variety of characteristics(所有expert输出类似,无可奈何).
因此本文提出的Customized Gate ControlMake this problem a little easier,Divide experts into big peers and small peers,both sharedexpert们,每个task也有专门的expert们,A little less difficult.
这样EA只被taskA训,EB只被taskB训,Guaranteed at least.
input是x,任务k的输出是
y k ( x ) = t k ( g k ( x ) ) y^{k}(x)=t^{k}\left(g^{k}(x)\right) yk(x)=tk(gk(x))
其中 t k t^k tk是这个任务的NN tower, g k ( x ) g^{k}(x) gk(x) 是第kThe output of the gating network for each task:
g k ( x ) = w k ( x ) S k ( x ) g^{k}(x)=w^{k}(x) S^{k}(x) gk(x)=wk(x)Sk(x)
其中x是原始输入, w k ( x ) w^{k}(x) wk(x)是一个加权函数,Corresponds to the weight of each expert respectively,是一个softmax的输出:
w k ( x ) = Softmax ( W g k x ) w^{k}(x)=\operatorname{Softmax}\left(W_{g}^{k} x\right) wk(x)=Softmax(Wgkx)
其中 W g k ∈ R ( m k + m s ) × d W_{g}^{k} \in R^{\left(m_{k}+m_{s}\right) \times d} Wgk∈R(mk+ms)×d,mk和ms是 shared experts 和 specific experts 的个数. S k ( x ) S^{k}(x) Sk(x)is the output vector of all expertscontackcalled togetherselected matrix:
S k ( x ) = [ E ( k , 1 ) T , E ( k , 2 ) T , … , E ( k , m k ) T , E ( s , 1 ) T , E ( s , 2 ) T , … , E ( s , m s ) T ] T S^{k}(x)=\left[E_{(k, 1)}^{T}, E_{(k, 2)}^{T}, \ldots, E_{\left(k, m_{k}\right)}^{T}, E_{(s, 1)}^{T}, E_{(s, 2)}^{T}, \ldots, E_{\left(s, m_{s}\right)}^{T}\right]^{T} Sk(x)=[E(k,1)T,E(k,2)T,…,E(k,mk)T,E(s,1)T,E(s,2)T,…,E(s,ms)T]T
PLE
But there are also problems after dividing small peers,不同taskThe role of the auxiliary supervision signal is small again(Because the difference from the independent model is only one sharedexpert,能力有限).所以PLEIt is to connect several layers of expert networks,让共享expert更强一些.
优化方法
Half of the multi-objective task optimization is to set different weights for different subtasks,损失函数加权:
L ( θ 1 , … … , θ K , θ s ) = ∑ k = 1 K ω k L k ( θ k , θ s ) L\left(\theta_{1}, \ldots \ldots, \theta_{K}, \theta_{s}\right)=\sum_{k=1}^{K} \omega_{k} L_{k}\left(\theta_{k}, \theta_{s}\right) L(θ1,……,θK,θs)=k=1∑KωkLk(θk,θs)
But this paper considers the problem of inconsistency in the training sample space in more detail:
比如用户只有点击后才能进行分享和评论.本文是在 Loss 上进行一定的优化,联合训练这些任务,在计算每个任务的损失时需要把样本空间相同的合并,并忽略不在自己样本空间的样本,即不同的任务仍使用其各自样本空间中的样本.I understand it to mean a time when the model is updated,不会同时用SHR和CTR的loss来更新
At the same time, this paper also considers different tasks to set a dynamic weight,比如task k的初始loss权重为 ω k , 0 \omega_{k, 0} ωk,0,那么在第t个epoch的时候loss权重为:
ω k ( t ) = ω k , 0 × γ k t \omega_{k}^{(t)}=\omega_{k, 0} \times \gamma_{k}^{t} ωk(t)=ωk,0×γkt
其中 γ k t \gamma_{k}^{t} γkt is the update rate of the previous step.
边栏推荐
- 【1374. 生成每种字符都是奇数个的字符串】
- An implementation of an ordered doubly linked list.
- 【无标题】
- Heavy cover special | intercept 99% malicious traffic, reveal WAF offensive and defensive drills best practices
- BN BatchNorm + BatchNorm的替代新方法KNConvNets
- 洛谷 P2440 木材加工
- 【Redis】缓存雪崩、缓存穿透、缓存预热、缓存更新、缓存击穿、缓存降级
- 常用命令备查
- 解除360对默认浏览器的检测与修改
- 使用Huggingface在矩池云快速加载预训练模型和数据集
猜你喜欢
Find the sum of two numbers
面试突击70:什么是粘包和半包?怎么解决?
17. Load balancing
启明云端分享|盘点ESP8684开发板有哪些功能
Does LabVIEW really close the COM port using VISA Close?
58:第五章:开发admin管理服务:11:开发【管理员人脸登录,接口】;(未实测)(使用了阿里AI人脸识别)(演示了,使用RestTemplate实现接口调用接口;)
Compse编排微服务实战
【Untitled】
MongoDB快速上手
【多任务优化】DWA、DTP、Gradnorm(CVPR 2019、ECCV 2018、 ICML 2018)
随机推荐
From ordinary advanced to excellent test/development programmer, all the way through
Compose实战-实现一个带下拉加载更多功能的LazyColumn
【节能学院】数据机房中智能小母线与列头柜方案的对比分析
【节能学院】推进农业水价综合改革的意见解读
【1374. 生成每种字符都是奇数个的字符串】
【torch】张量乘法:matmul,einsum
Determine a binary tree given inorder traversal and another traversal method
The graphic details Eureka's caching mechanism/level 3 cache
kingbaseV8R3和postgreSQL哪个版本最接近?
【Redis】缓存雪崩、缓存穿透、缓存预热、缓存更新、缓存击穿、缓存降级
17. Load balancing
9月备考PMP,应该从哪里备考?
JS数组过滤
Creo5.0草绘如何绘制正六边形
常用命令备查
Arthas 常用命令
使用微信公众号给指定微信用户发送信息
datax - 艰难debug路
第55章 业务逻辑之订单、支付实体定义
MySQL开发技巧——存储过程