当前位置:网站首页>Classic reading of multi task learning: MMOE model
Classic reading of multi task learning: MMOE model
2022-06-11 16:41:00 【kaiyuan_ sjtu】

author | You know, bloggers @ The green maple brushes the bank
Arrangement | NewBeeNLP
Brought Today Yes Google Published in KDD2018, Classic model for multi task recommendation MMOE.
The paper :Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
Address : https://dl.acm.org/doi/10.1145/3219819.3220007
Because the innovation of this paper tends to model structure and design , Not the optimization of model strategy . Therefore, this explanation will mainly focus on this .
One 、 Why multi task modeling is needed
In general , Single task modeling generally focuses on ctr, Whether in the e-commerce scenario or the news recommendation scenario , But over time , Some problems will occur , Such as the headline party in the news field , Click but not buy in the e-commerce scenario , In this case , Is it possible to avoid this kind of problem through multi task model ?
What is a multitasking model : Multi task learning aims to build a single model to learn multiple goals and tasks at the same time .
Such as predicting news at the same time ctr And reading time , E-commerce scenarios ctr And purchase conversion rate .
however , Usually, the relationship between tasks will greatly affect the prediction quality of multi task models . That is, the traditional multi task model is sensitive to the task relationship . In this paper, 3.2-3.3 Section of the experiment to draw a conclusion .
therefore , Study task-specific objectives and inter-task relationships The trade-off between them is also very important .
It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships.
Two 、 Related work
The framework of multi task learning mainly adopts shared-bottom structure , That is, different tasks share the underlying hidden layer , And then by building different towers on top , To correspond to different tasks . The advantage of this is to reduce the number of parameters , But the disadvantage is There are optimization conflicts between different tasks in the training process (optimizationn confilicts).
So there are some other structures , Such as : The parameters of the two tasks are not shared , But between parameters L2 Norm limit . Or learning hidden layers for each task embedding Then combine . Through the tensor decomposition model (tensor factorization model) Get hidden layer parameters of different tasks . For details, please refer to the original text 2.1 The section deals with the thesis .
be relative to shared-bottom structure , Other methods produce a large number of parameters , But it does solve the problem of task optimization conflict , But these methods are faced with an industrial problem that cannot be bypassed —— Large scale service in real environment .
This article draws on MoE model, Put forward Multi-gate Mixture-of-Experts model (MMOE) Model , contrast shared-bottom, They are more excellent in model expression ability and training difficulty , More effective in the real world .
3、 ... and 、 Model architecture

Shared-bottom Multi-task Model
Pictured above a Shown , Suppose there is K A mission , The upper layer will have K A tower ( In the figure K=2), The function of each tower is , k=1,2,...,K.shared-bottom Layer as bottom shared layer , Expressed as a function . For different tasks, the output is
Original Mixture-of-Experts (MoE) Model
Formula for
among , For experts The possibility of . It's a network of experts i∈1,2,...,n. representative gating The Internet , Is to integrate the lower level expert An integrated way of network . More specifically , produce n individual experts The probability distribution on the , So that all expert Weight sum of network results , It is similar to the model fusion method in machine learning .
although MoE Originally developed as an integrated approach to multiple individual models , but Eigen Et al Shazeer They transformed it into basic building blocks ——MoE layer .MoE Layers have the same properties as MoE The same structure as the model , But accept the output of the previous layer as input and output to the following layer. Then the whole model is trained end-to-end . Namely the MoE As a small part of the big model .
In the picture b by One-gate Mixture-of-Experts (OMoE) model, There is only one gating The Internet . The following experimental comparison uses OMoE.
Multi-gate Mixture-of-Experts (MMoE) Model
Pictured c Shown , The model structure proposed in this paper , And Shared-bottom Multi task model comparison , The model is designed to capture task differences , No more model parameters are required . The key is to use MoE Instead of Shared-bottom, And for every task task A separate one has been added gating The Internet . For example, for tasks k There's a formula
where
gating Internet use DNN+ReLU+softmax Realization :
among Is a trainable matrix ,n Is the number of expert networks ,d It's a feature dimension .
Every gating The network can be based on training choice Corresponding to the input expert Network weight . It is advantageous to share parameters flexibly in the case of multi task learning .
Suppose under extreme conditions ,gating Network intelligence selects one expert Network output , Then each gate network actually linearly divides the input space into n Regions (n Number of tasks ), One for each area expert, each expert The network is responsible for a task , The model degenerates into a combination of single task models .
All in all ,MMoE By determining how the separations caused by different gates overlap , Modeling task relationships in complex ways .
If the task is less relevant , Then share expert Will be punished , The control network for these tasks will learn how to use different expert. So the model takes into account the relevance and difference of the captured tasks . And shared-bottom The model compares ,MMoE There are only a few additional gating networks , And the number of model parameters in the gating network can be ignored .
Four 、 Experimental part
Artificially constructed data sets
The relevance of the two tasks cannot be easily changed in real data , In order to explore the impact of task relevance on model results , To manually construct data sets , For details, please refer to 3.2 section , The correlation is measured by Pearson The correlation coefficient .
Model settings
Input dimensions 100, Divided into 8 individual expert The Internet , Every expert The Internet Of hidden size by 16. top floor 2 A mission ,towner The Internet hidden size=8, So the parameter is 1000×16( Every expert features Embedding Parameterization ) ×8(expert Number ) + 16*8( Every towner Parameters )*2(task Number ).
Experimental results

The experimental conclusion
For all models , The performance on the data with high correlation is better than that on the data with low correlation .
MMoE The performance gap of the model on data with different correlations is much smaller than OMoE Models and Shared-Bottom Model . When we compare MMoE Models and OMoE Model time , This trend is particularly evident : In extreme cases where the two tasks are the same ,MMoE Models and OMoE There is little difference in performance between models ; However , When the correlation between tasks decreases ,OMoE The performance of the model has obvious degradation , And yes MMoE The impact of the model is small . therefore , It is important to use task specific gates to simulate task differences in low correlation situations .
In terms of average performance , Two kinds of MoE The model outperforms in all scenarios Shared-Bottom Model . This shows that MoE The structure itself brings additional benefits . According to this observation ,MoE Model ratio Shared-Bottom The model has better trainability .
Model trainability (Trainability)
For large network models , We are more concerned about whether the model can be trained , for example Set whether the superparameter setting and model row initialization of different models are robust to the model .
Therefore, this paper naturally studies the robustness of the model to the randomness of data and model initialization . Repeat the experiment several times under each setting . Each time the data is generated from the same distribution but different random seeds , And the initialization of the model is also different , Observe different tasks loss Changes .
experimental result

Conclusion :
shared-bottom The variance ratio of the model is based on MoE Our model fluctuates a lot . It means shared-bottom Models are usually based on MoE The model has more low quality local minima .
secondly , When the task relevance is 1 when ,OMoE The performance variance of the model is related to MMoE The performance variance of the model has similar robustness , But when task relevance drops to 0 when ,OMoE The robustness of .MMoE and OMoE The only difference between them is whether there is a multi door structure . This verifies the effectiveness of multi gate structure in solving the poor local minimum caused by task difference conflict .
The lowest of all three models loss It's comparable . Because neural network is a universal approximator in theory . With enough model capacity , There should be a “ Correct shared-bottom Model , You can learn two tasks well . however , Now it's 200 The distribution of independent experiments . The article points out that for larger and more complex models ( for example , When shared-bottom The Internet is a RNN when ), Probably get “ correct ” Models may become less important as task relationships become less important . So we come to the conclusion that : It is still desirable to explicitly model task relationships .
5、 ... and 、 Mass service
The model is deployed in Google Inc. On , A content platform with hundreds of millions of users . The business scenario is , According to the current consumption behavior of the user , List of related items recommended for next consumption .
For the two tasks, they are respectively set to :
Relevant objectives for participation Such as click through rate and participation time (click through rate and engagement time)
For satisfaction related indicators Such as liking (like rate)
Training data includes implicit feedback from hundreds of billions of users , Such as click and like . If you train alone , The model of each task needs to learn billions of parameters . therefore , Compared to learning multiple goals alone ,Shared-Bottom The architecture has the advantage of smaller model sizes . In fact, this kind of Shared-Bottom The model has been used in production .
utilize 100 Billion data volume Set up batch_size=1024, Show the process 200w step、400w step and 600wstep Result .

You can see MMoE It works best .

For tasks that don't work ,gating The network is blocked Expert The coefficient distribution of the network
6、 ... and 、 Conclusion
A new multi task model paradigm is proposed MMoE A bit as follows :
Better handle scenarios with less task relevance .
Under the same task relevance , The effect is relative to the common Shared-Bottom The model is better ,loss A lower .
The model has better computational advantages ,gating Networks are usually light , also Expert The network is shared across all tasks . Besides , The model can also be realized by taking the gating network as a sparse top-k gating To achieve higher computational efficiency .
Communicate together
I want to learn and progress with you !『NewBeeNLP』 At present, many communication groups in different directions have been established ( machine learning / Deep learning / natural language processing / Search recommendations / Figure network / Interview communication / etc. ), Quota co., LTD. , Quickly add the wechat below to join the discussion and exchange !( Pay attention to it o want Notes Can pass )


边栏推荐
- ^32执行上下文栈面试题
- JVM 的组成
- Can I eat meat during weight loss? Will you get fat?
- 微服务连接云端Sentinel 控制台失败及连接成功后出现链路空白问题(已解决)
- 项目无法加载nacos配置中心的配置文件问题
- unittest 如何知道每个测试用例的执行时间
- A team of heavyweights came to the "digital transformation" arena of CLP Jinxin ice and snow sports
- Regression prediction | realization of RBF RBF neural network with multiple inputs and single output by MATLAB
- 从0到1了解Prometheus
- JINTE NET基金会将通过线上直播参与维度链全球战略发布会
猜你喜欢

核密度估计(二维、三维)

瑞吉外卖项目(三)员工管理业务开发
![Interview classic question: how to do the performance test? [Hangzhou multi surveyors] [Hangzhou multi surveyors \wang Sir]](/img/ea/2c5b48b08a9654b61694b93a2e7d0a.png)
Interview classic question: how to do the performance test? [Hangzhou multi surveyors] [Hangzhou multi surveyors \wang Sir]

基于udp端口猜测的内网穿透

2022g1 industrial boiler stoker test questions and simulation test

Composition of JVM

虚拟局域网划分与虚拟局域网间路由(VLAN)

【opencvsharp】斑点检测 条码解码 图像操作 图像旋转/翻转/缩放 透视变换 图像显示控件 demo笔记

How unittest knows the execution time of each test case

【opencvsharp】opencvsharp_ samples. Core sample code Notes
随机推荐
web安全-靶场笔记
Report on the operation situation and future prospects of China's gear oil industry (2022-2028)
Common tools and commands for information collection
Text driven for creating and editing images (with source code)
If you want to learn ArrayList well, it is enough to read this article
Simulated 100 questions and simulated examination for main principals of hazardous chemical business units in 2022
[ISITDTU 2019]EasyPHP
Ruiji takeout project (III) employee management business development
Jinte Net Foundation will participate in the global strategy conference of dimension chain through online live broadcast
Time series prediction | MATLAB realizes future multi-step prediction of RBF RBF neural network time series
The micro service failed to connect to the cloud sentinel console and the link blank problem occurred after the connection was successful (resolved)
2022 national question bank and mock examination for safety officer-b certificate
2022年R1快开门式压力容器操作考试题库及模拟考试
Memory image of various data types in C language
Differences between list and set access elements
2022起重机司机(限桥式起重机)考试题模拟考试题库及模拟考试
[leetcode daily question] Repeat overlay string matching
unittest 如何知道每个测试用例的执行时间
Elasitcsearch基础学习笔记(1)
RDKit 安装