当前位置：网站首页>Classic reading of multi task learning: MMOE model

Classic reading of multi task learning: MMOE model

2022-06-11 16:41:00 【kaiyuan_ sjtu】

author | You know, bloggers @ The green maple brushes the bank
Arrangement | NewBeeNLP

Brought Today Yes Google Published in KDD2018, Classic model for multi task recommendation MMOE.

The paper ：Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts
Address : https://dl.acm.org/doi/10.1145/3219819.3220007

Because the innovation of this paper tends to model structure and design , Not the optimization of model strategy . Therefore, this explanation will mainly focus on this .

One 、 Why multi task modeling is needed

In general , Single task modeling generally focuses on ctr, Whether in the e-commerce scenario or the news recommendation scenario , But over time , Some problems will occur , Such as the headline party in the news field , Click but not buy in the e-commerce scenario , In this case , Is it possible to avoid this kind of problem through multi task model ？

What is a multitasking model ： Multi task learning aims to build a single model to learn multiple goals and tasks at the same time .

Such as predicting news at the same time ctr And reading time , E-commerce scenarios ctr And purchase conversion rate .

however , Usually, the relationship between tasks will greatly affect the prediction quality of multi task models . That is, the traditional multi task model is sensitive to the task relationship . In this paper, 3.2-3.3 Section of the experiment to draw a conclusion .

therefore , Study task-specific objectives and inter-task relationships The trade-off between them is also very important .

It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships.

Two 、 Related work

The framework of multi task learning mainly adopts shared-bottom structure , That is, different tasks share the underlying hidden layer , And then by building different towers on top , To correspond to different tasks . The advantage of this is to reduce the number of parameters , But the disadvantage is There are optimization conflicts between different tasks in the training process (optimizationn confilicts).

So there are some other structures , Such as ： The parameters of the two tasks are not shared , But between parameters L2 Norm limit . Or learning hidden layers for each task embedding Then combine . Through the tensor decomposition model (tensor factorization model) Get hidden layer parameters of different tasks . For details, please refer to the original text 2.1 The section deals with the thesis .

be relative to shared-bottom structure , Other methods produce a large number of parameters , But it does solve the problem of task optimization conflict , But these methods are faced with an industrial problem that cannot be bypassed —— Large scale service in real environment .

This article draws on MoE model, Put forward Multi-gate Mixture-of-Experts model (MMOE) Model , contrast shared-bottom, They are more excellent in model expression ability and training difficulty , More effective in the real world .

3、 ... and 、 Model architecture

Shared-bottom Multi-task Model

Pictured above a Shown , Suppose there is K A mission , The upper layer will have K A tower ( In the figure K=2), The function of each tower is , k=1,2,...,K.shared-bottom Layer as bottom shared layer , Expressed as a function . For different tasks, the output is

Original Mixture-of-Experts (MoE) Model

Formula for

among , For experts The possibility of . It's a network of experts i∈1,2,...,n. representative gating The Internet , Is to integrate the lower level expert An integrated way of network . More specifically , produce n individual experts The probability distribution on the , So that all expert Weight sum of network results , It is similar to the model fusion method in machine learning .

although MoE Originally developed as an integrated approach to multiple individual models , but Eigen Et al Shazeer They transformed it into basic building blocks ——MoE layer .MoE Layers have the same properties as MoE The same structure as the model , But accept the output of the previous layer as input and output to the following layer. Then the whole model is trained end-to-end . Namely the MoE As a small part of the big model .

In the picture b by One-gate Mixture-of-Experts (OMoE) model, There is only one gating The Internet . The following experimental comparison uses OMoE.

Multi-gate Mixture-of-Experts (MMoE) Model

Pictured c Shown , The model structure proposed in this paper , And Shared-bottom Multi task model comparison , The model is designed to capture task differences , No more model parameters are required . The key is to use MoE Instead of Shared-bottom, And for every task task A separate one has been added gating The Internet . For example, for tasks k There's a formula

where

gating Internet use DNN+ReLU+softmax Realization ：

among Is a trainable matrix ,n Is the number of expert networks ,d It's a feature dimension .

Every gating The network can be based on training choice Corresponding to the input expert Network weight . It is advantageous to share parameters flexibly in the case of multi task learning .

Suppose under extreme conditions ,gating Network intelligence selects one expert Network output , Then each gate network actually linearly divides the input space into n Regions (n Number of tasks ), One for each area expert, each expert The network is responsible for a task , The model degenerates into a combination of single task models .

All in all ,MMoE By determining how the separations caused by different gates overlap , Modeling task relationships in complex ways .

If the task is less relevant , Then share expert Will be punished , The control network for these tasks will learn how to use different expert. So the model takes into account the relevance and difference of the captured tasks . And shared-bottom The model compares ,MMoE There are only a few additional gating networks , And the number of model parameters in the gating network can be ignored .

Four 、 Experimental part

Artificially constructed data sets

The relevance of the two tasks cannot be easily changed in real data , In order to explore the impact of task relevance on model results , To manually construct data sets , For details, please refer to 3.2 section , The correlation is measured by Pearson The correlation coefficient .

Model settings

Input dimensions 100, Divided into 8 individual expert The Internet , Every expert The Internet Of hidden size by 16. top floor 2 A mission ,towner The Internet hidden size=8, So the parameter is 1000×16( Every expert features Embedding Parameterization ) ×8(expert Number ) + 16*8( Every towner Parameters )*2(task Number ).

Experimental results

The experimental conclusion

For all models , The performance on the data with high correlation is better than that on the data with low correlation .
MMoE The performance gap of the model on data with different correlations is much smaller than OMoE Models and Shared-Bottom Model . When we compare MMoE Models and OMoE Model time , This trend is particularly evident ： In extreme cases where the two tasks are the same ,MMoE Models and OMoE There is little difference in performance between models ; However , When the correlation between tasks decreases ,OMoE The performance of the model has obvious degradation , And yes MMoE The impact of the model is small . therefore , It is important to use task specific gates to simulate task differences in low correlation situations .
In terms of average performance , Two kinds of MoE The model outperforms in all scenarios Shared-Bottom Model . This shows that MoE The structure itself brings additional benefits . According to this observation ,MoE Model ratio Shared-Bottom The model has better trainability .

Model trainability (Trainability)

For large network models , We are more concerned about whether the model can be trained , for example Set whether the superparameter setting and model row initialization of different models are robust to the model .

Therefore, this paper naturally studies the robustness of the model to the randomness of data and model initialization . Repeat the experiment several times under each setting . Each time the data is generated from the same distribution but different random seeds , And the initialization of the model is also different , Observe different tasks loss Changes .

experimental result

Conclusion ：

shared-bottom The variance ratio of the model is based on MoE Our model fluctuates a lot . It means shared-bottom Models are usually based on MoE The model has more low quality local minima .
secondly , When the task relevance is 1 when ,OMoE The performance variance of the model is related to MMoE The performance variance of the model has similar robustness , But when task relevance drops to 0 when ,OMoE The robustness of .MMoE and OMoE The only difference between them is whether there is a multi door structure . This verifies the effectiveness of multi gate structure in solving the poor local minimum caused by task difference conflict .
The lowest of all three models loss It's comparable . Because neural network is a universal approximator in theory . With enough model capacity , There should be a “ Correct shared-bottom Model , You can learn two tasks well . however , Now it's 200 The distribution of independent experiments . The article points out that for larger and more complex models （ for example , When shared-bottom The Internet is a RNN when ）, Probably get “ correct ” Models may become less important as task relationships become less important . So we come to the conclusion that ： It is still desirable to explicitly model task relationships .

5、 ... and 、 Mass service

The model is deployed in Google Inc. On , A content platform with hundreds of millions of users . The business scenario is , According to the current consumption behavior of the user , List of related items recommended for next consumption .

For the two tasks, they are respectively set to ：

Relevant objectives for participation Such as click through rate and participation time (click through rate and engagement time)
For satisfaction related indicators Such as liking (like rate)

Training data includes implicit feedback from hundreds of billions of users , Such as click and like . If you train alone , The model of each task needs to learn billions of parameters . therefore , Compared to learning multiple goals alone ,Shared-Bottom The architecture has the advantage of smaller model sizes . In fact, this kind of Shared-Bottom The model has been used in production .

utilize 100 Billion data volume Set up batch_size=1024, Show the process 200w step、400w step and 600wstep Result .

You can see MMoE It works best .

For tasks that don't work ,gating The network is blocked Expert The coefficient distribution of the network

6、 ... and 、 Conclusion

A new multi task model paradigm is proposed MMoE A bit as follows ：

Better handle scenarios with less task relevance .
Under the same task relevance , The effect is relative to the common Shared-Bottom The model is better ,loss A lower .
The model has better computational advantages ,gating Networks are usually light , also Expert The network is shared across all tasks . Besides , The model can also be realized by taking the gating network as a sparse top-k gating To achieve higher computational efficiency .

Communicate together

I want to learn and progress with you ！『NewBeeNLP』 At present, many communication groups in different directions have been established （ machine learning / Deep learning / natural language processing / Search recommendations / Figure network / Interview communication / etc. ）, Quota co., LTD. , Quickly add the wechat below to join the discussion and exchange ！（ Pay attention to it o want Notes Can pass ）