当前位置：网站首页>[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18

[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18

2022-08-01 20:01:00 【chad_lee】

Understand at the model level,We often spend a lot of energy on a single goal“Find strong features”和“Remove redundant features”输入到模型,提高模型效果.那么切换到MTL时,每个task所需要的“强特”and exclusionary“Negative”是不同的,MTLThe purpose is for eachtask Find their strong and negative specials as much as possible.
Understand at the optimization level,多个task同时优化模型,某些taskwill dominate the optimization process of the model,drowned out the otherstask.
Understand from the perspective of supervisory signals,MTL不仅仅是任务,It is also a data augmentation,相当于每个task多了k-1A supervisory signal to aid learning,Some features can be derived from otherstask学的更好.Monitor the quality of the signal andtasksimilarity between them,不相似的taskInstead, it's noise.
#SB、MOE、MMOE

《Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts》Google KDD 2018

在这里插入图片描述

share-bottom

Multiple tasks share the same bottom network,The bottom network outputs the feature vector of a sample,Each subtask picks up a small by itselfNN tower.

优点：简单,并且模型过拟合的风险小（Because it is not easy to overfit multiple tasks at the same time,It can be said that multiple tasks supervise each other and penalize overfitting）;Multitasking is more relevant,Complement each other the better.
缺点：If the connection between tasks is not strong（矛盾、冲突）,Then the optimization direction for the underlying network may be the opposite.
Bottom output：f(x),子任务tower：$h^k_x $,Output for each subtask：$ y_x^k = h^k_x(f(x)) $

One-gate-MoE

将inputInput to three independentexpert（3个nn）,同时将input输入到gate,gate输出每个expert被选择的概率,然后将三个expert的输出加权求和,输出给tower:
$y^{k}=h^{k}\left(\sum_{i=1}^{n} g_{i} f_{i}(x)\right) \text { ,}$
其中 $g ()$ is a multi-classification module,且 $\sum_{i=1}^{n} g(x)_{i}=1$ ,$f_{i}(), i=1, \cdots, n $是 n 个 e x p er t n e tw or k, k 表示 k 个任务,$ h^k$means afterNN tower.

So here is equivalent to givinginput当作query,给bottomThe output adds oneattention,Look at the formula to change the soup without changing the medicine,$h^k $outside the parentheses,这就导致不同的towerThe input is still the same,没有解决task冲突问题.

但是MOEIt can solve the problem of domain adaptation,用于cross domain：
在这里插入图片描述

MMOE

因此很自然,每个 $h_k $ Don't put it outside parentheses,不同的 $h_k$ The input is different.

所以每个task各分配一个gate,这样gateThe role is no longerattention了,Rather, it's personalized for eachtaskSelect important features,Filter redundant features：
$\begin{aligned} f^{k}(x) &=\sum_{i=1}^{n} g_{i}^{k}(x) f_{i}(x) \\ \ \ g^{k}(x) &=\operatorname{softmax}\left(W_{gk}x\right) \end{aligned}$
其中gis a linear change+softmax.

There is a situation in theory,gate能给每个task筛选特征,As for whether the model can be optimized to this situation,不好说.

实验

在这里插入图片描述

原网站

版权声明
本文为[chad_lee]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/213/202208011953404145.html

当前位置：网站首页>[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18

[Multi-task learning] Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts KDD18

share-bottom

One-gate-MoE

MMOE

实验

边栏推荐

猜你喜欢

随机推荐