当前位置:网站首页>[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18
[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18
2022-08-01 20:01:00 【chad_lee】
Understand at the model level,We often spend a lot of energy on a single goal“Find strong features”和“Remove redundant features”输入到模型,提高模型效果.那么切换到MTL时,每个task所需要的“强特”and exclusionary“Negative”是不同的,MTLThe purpose is for eachtask Find their strong and negative specials as much as possible.
Understand at the optimization level,多个task同时优化模型,某些taskwill dominate the optimization process of the model,drowned out the otherstask.
Understand from the perspective of supervisory signals,MTL不仅仅是任务,It is also a data augmentation,相当于每个task多了k-1A supervisory signal to aid learning,Some features can be derived from otherstask学的更好.Monitor the quality of the signal andtasksimilarity between them,不相似的taskInstead, it's noise.
#SB、MOE、MMOE
《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》Google KDD 2018
share-bottom
Multiple tasks share the same bottom network,The bottom network outputs the feature vector of a sample,Each subtask picks up a small by itselfNN tower.
优点:简单,并且模型过拟合的风险小(Because it is not easy to overfit multiple tasks at the same time,It can be said that multiple tasks supervise each other and penalize overfitting);Multitasking is more relevant,Complement each other the better.
缺点:If the connection between tasks is not strong(矛盾、冲突),Then the optimization direction for the underlying network may be the opposite.
Bottom output:f(x),子任务tower:$h^k_x ,Output for each subtask: ,Output for each subtask: ,Output for each subtask:y_x^k = h^k_x(f(x)) $
One-gate-MoE
将inputInput to three independentexpert(3个nn),同时将input输入到gate,gate输出每个expert被选择的概率,然后将三个expert的输出加权求和,输出给tower:
y k = h k ( ∑ i = 1 n g i f i ( x ) ) , y^{k}=h^{k}\left(\sum_{i=1}^{n} g_{i} f_{i}(x)\right) \text { ,} yk=hk(i=1∑ngifi(x)) ,
其中 g ( ) g() g() is a multi-classification module,且 ∑ i = 1 n g ( x ) i = 1 \sum_{i=1}^{n} g(x)_{i}=1 ∑i=1ng(x)i=1 ,$f_{i}(), i=1, \cdots, n 是 n 个 e x p e r t n e t w o r k , k 表示 k 个任务, 是n个expert network,k表示k个任务, 是n个expertnetwork,k表示k个任务,h^k$means afterNN tower.
So here is equivalent to givinginput当作query,给bottomThe output adds oneattention,Look at the formula to change the soup without changing the medicine,$h^k $outside the parentheses,这就导致不同的towerThe input is still the same,没有解决task冲突问题.
但是MOEIt can solve the problem of domain adaptation,用于cross domain:
MMOE
因此很自然,每个 $h_k $ Don't put it outside parentheses,不同的 h k h_k hkThe input is different.
所以每个task各分配一个gate,这样gateThe role is no longerattention了,Rather, it's personalized for eachtaskSelect important features,Filter redundant features:
f k ( x ) = ∑ i = 1 n g i k ( x ) f i ( x ) g k ( x ) = softmax ( W g k x ) \begin{aligned} f^{k}(x) &=\sum_{i=1}^{n} g_{i}^{k}(x) f_{i}(x) \\ \ \ g^{k}(x) &=\operatorname{softmax}\left(W_{gk}x\right) \end{aligned} fk(x) gk(x)=i=1∑ngik(x)fi(x)=softmax(Wgkx)
其中gis a linear change+softmax.
There is a situation in theory,gate能给每个task筛选特征,As for whether the model can be optimized to this situation,不好说.
实验
边栏推荐
猜你喜欢
随机推荐
9月备考PMP,应该从哪里备考?
数据库系统原理与应用教程(071)—— MySQL 练习题:操作题 110-120(十五):综合练习
分享一个适用于MCU项目的代码框架
deploy zabbix
30-day question brushing plan (5)
OSPO 五阶段成熟度模型解析
使用微信公众号给指定微信用户发送信息
Compse编排微服务实战
通配符 SSL/TLS 证书
ssh & scp
17、负载均衡
Redis 做签到统计
30天刷题计划(五)
【多任务学习】Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18
手撸代码,Redis发布订阅机制实现
BN BatchNorm + BatchNorm的替代新方法KNConvNets
油猴hook小脚本
AcWing 797. 差分
Does LabVIEW really close the COM port using VISA Close?
18、分布式配置中心nacos