当前位置:网站首页>Thesis reading_ Multi task learning_ MMoE
Thesis reading_ Multi task learning_ MMoE
2022-07-25 17:31:00 【xieyan0811】
Introduce
English title :Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
Chinese title : Task relationship modeling in multi expert mixed multi task learning
Address of thesis :https://dl.acm.org/doi/pdf/10.1145/3219819.3220007
field : Deep learning , Multi task learning
Time of publication :2018
author :Jiaqi Ma, University of Michigan , Google
Source :KDD
Quantity cited :137
Code and data :https://github.com/drawbridge/keras-mmoe
Reading time :22.07.24
Journal entry
Multi task learning is generally used for the same input characteristics , Use a model to learn multiple tasks at the same time . Predict multiple tags at once , This can save training and prediction time , It can also save space for storing models .
The previous method is mainly the underlying shared network , The upper layer trains its own network for each task . The problem with this is , If multiple tasks are not highly relevant , It is possible to pull shared parameters in different directions , Although in theory, multiple tasks can assist each other , Provide more information , But the effect of implementation is often not as good as that of a separate training model .
Introduce
The effect of multi task learning generally depends on the correlation between different tasks . What is put forward in this paper MMoE(Multi-gate Mixture-of-Experts) Yes, the previous method MoE Improvement . It is mainly used to solve the problem of multi task correlation , Optimize multiple objectives at the same time . For example, predict whether users buy and user satisfaction at the same time .
In the course of research , The problems encountered are : How to measure the relevance of different tasks ; If the model is not too large and complex due to multitasking .
Article contribution
- Put forward MMoE structure , The upper network based on gating is constructed , The model can automatically adjust network parameters .
- The method of generating experimental data is designed , In order to better measure the impact of task relevance on modeling
- Better results are achieved in the experimental data set , It solves the problem of large-scale data training in the real world
Method

The previous implementation method is shown in the figure -1(a) Shown , The underlying network Shared Bottom Shared parameters , The upper layer uses double tower or multi tower structure to adapt to different tasks :

among k It's a specific task ,f(x) It's the underlying model ,hk It's the upper model .
Then, as shown in the figure -1(b) Shown MoE Model ( It is also recorded as OMoE), It uses multiple expert networks as the underlying , Use the input to calculate the gating value to set the proportion of each expert's contribution , Then send the calculated results to the upper network .

among g It's door control ,n It's the number of experts , The formula combines the results of various experts . For each entity , Only part of the network is activated .
chart -1 Is the network structure proposed in this paper MMoE, And MoE The difference is that it calculates different gating and sets the proportion of experts for different tasks .

Among them Wgk Is a trainable matrix , Used to select experts based on input .
Each gating network linearly divides the input space into n Regions , Each area corresponds to an expert .MMoE Decide how areas managed by different door controls overlap . If an area is less relevant to the task , Then sharing experts will be punished , The gating network of tasks will learn to use different experts .
experiment
Synthetic data experiment
Synthetic data can better compare the effects of different task correlations , The experimental comparison using synthetic data is shown in the figure -4 Shown :

- For all models , The training effect of tasks with high correlation is better
- In the case of different degrees of Correlation ,MMoE Better than OMoE and Shared-Bottom Model . And when the correlation is consistent ,MMoE and OMoE The results are almost the same
- be based on MoE The effect of the two models is obviously better than Shared-Bottom, And convergence is faster , This explanation MoE Structure makes the model better trained .
Real data experiment
Population income data
Use population income data , Two groups of experiments were carried out , The first group trains two tasks at the same time : Whether the training income exceeds 50K Married or not ; The second group also trained in education and marriage . Training data 199523. The training results are as follows :

Large scale content recommendation
Train with hundreds of millions of recommendation data from Google . The goal is to optimize : Viscosity related goals , Such as click through rate and sticky time ; And satisfaction related goals . Use specific evaluation criteria AUC and R-Squared scores. The effect is shown in the table -3 Shown :

From the picture -6 You can see the contributions of different experts to different tasks :

边栏推荐
- 【硬件工程师】关于信号电平驱动能力
- 我也是醉了,Eureka 延迟注册还有这个坑!
- 服务器端架构设计期末复习知识点总结
- Ultimate doll 2.0 | cloud native delivery package
- Text translation software - text batch translation converter free of charge
- 爬虫框架-crawler
- [knowledge atlas] practice -- Practice of question answering system based on medical knowledge atlas (Part4): problem analysis and retrieval sentence generation combined with problem classification
- Page table cache of Linux kernel source code analysis
- 04. Find the median of two positive arrays
- EasyUI modification and DataGrid dialog form control use
猜你喜欢

I2C通信——时序图

Jenkins' file parameters can be used to upload files

Virtual memory management

ACL 2022 | comparative learning based on optimal transmission to achieve interpretable semantic text similarity

Data analysis and privacy security become the key factors for the success or failure of Web3.0. How do enterprises layout?

论文阅读_多任务学习_MMoE

Chapter III data types and variables
![[Hardware Engineer] can't select components?](/img/bd/fdf62b85c082f7e51bf44737f1f787.png)
[Hardware Engineer] can't select components?
![[knowledge atlas] practice -- Practice of question answering system based on medical knowledge atlas (Part3): rule-based problem classification](/img/4c/aeebbc9698f8d5c23ed6473c9aca34.png)
[knowledge atlas] practice -- Practice of question answering system based on medical knowledge atlas (Part3): rule-based problem classification

虚拟内存管理
随机推荐
第三章、数据类型和变量
Ultimate doll 2.0 | cloud native delivery package
With 8 years of product experience, I have summarized these practical experience of continuous and efficient research and development
stm32F407------SPI
服务器端架构设计期末复习知识点总结
Chapter III data types and variables
我也是醉了,Eureka 延迟注册还有这个坑!
I'm also drunk. Eureka delayed registration and this pit!
Gtx1080ti fiber HDMI interference flash screen 1080ti flash screen solution
动态规划题目记录
8 年产品经验,我总结了这些持续高效研发实践经验 · 研发篇
【硬件工程师】DC-DC隔离式开关电源模块为什么会用到变压器?
Go语言系列:Go从哪里来,Go将去哪里?
我们被一个 kong 的性能 bug 折腾了一个通宵
Go language series: where does go come from and where will go?
Redis源码与设计剖析 -- 17.Redis事件处理
理财有保本产品吗?
第四章:操作符
STM32 PAJ7620U2手势识别模块(IIC通信)程序源码详解
[Nanjing University of Aeronautics and Astronautics] information sharing for the first and second examinations of postgraduate entrance examination