当前位置:网站首页>Classic reading of multi task learning: MMOE model

Classic reading of multi task learning: MMOE model

2022-06-11 16:41:00 kaiyuan_ sjtu

ae7c267847c4c79dc6a8de398e888ac5.png

author  |  You know, bloggers @ The green maple brushes the bank   

Arrangement  | NewBeeNLP

Brought Today Yes Google Published in KDD2018, Classic model for multi task recommendation MMOE.

  • The paper :Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts

  • Address : https://dl.acm.org/doi/10.1145/3219819.3220007

Because the innovation of this paper tends to model structure and design , Not the optimization of model strategy . Therefore, this explanation will mainly focus on this .

One 、 Why multi task modeling is needed

In general , Single task modeling generally focuses on ctr, Whether in the e-commerce scenario or the news recommendation scenario , But over time , Some problems will occur , Such as the headline party in the news field , Click but not buy in the e-commerce scenario , In this case , Is it possible to avoid this kind of problem through multi task model ?

What is a multitasking model : Multi task learning aims to build a single model to learn multiple goals and tasks at the same time .

Such as predicting news at the same time ctr And reading time , E-commerce scenarios ctr And purchase conversion rate .

however , Usually, the relationship between tasks will greatly affect the prediction quality of multi task models . That is, the traditional multi task model is sensitive to the task relationship . In this paper, 3.2-3.3 Section of the experiment to draw a conclusion .

therefore , Study task-specific objectives and inter-task relationships The trade-off between them is also very important .

It is therefore important to study the modeling tradeoffs between task-specific objectives and inter-task relationships.

Two 、 Related work

The framework of multi task learning mainly adopts shared-bottom structure , That is, different tasks share the underlying hidden layer , And then by building different towers on top , To correspond to different tasks . The advantage of this is to reduce the number of parameters , But the disadvantage is There are optimization conflicts between different tasks in the training process (optimizationn confilicts).

So there are some other structures , Such as : The parameters of the two tasks are not shared , But between parameters L2 Norm limit . Or learning hidden layers for each task embedding Then combine . Through the tensor decomposition model (tensor factorization model) Get hidden layer parameters of different tasks . For details, please refer to the original text 2.1 The section deals with the thesis .

be relative to shared-bottom structure , Other methods produce a large number of parameters , But it does solve the problem of task optimization conflict , But these methods are faced with an industrial problem that cannot be bypassed —— Large scale service in real environment .

This article draws on MoE model, Put forward Multi-gate Mixture-of-Experts model (MMOE) Model , contrast shared-bottom, They are more excellent in model expression ability and training difficulty , More effective in the real world .

3、 ... and 、 Model architecture

8d79306067beb469e6f54f2dd6efe658.png

Shared-bottom Multi-task Model

Pictured above a Shown , Suppose there is K A mission , The upper layer will have K A tower ( In the figure K=2), The function of each tower is , k=1,2,...,K.shared-bottom Layer as bottom shared layer , Expressed as a function . For different tasks, the output is

Original Mixture-of-Experts (MoE) Model

Formula for

among , For experts The possibility of . It's a network of experts i∈1,2,...,n. representative gating The Internet , Is to integrate the lower level expert An integrated way of network . More specifically ,   produce n individual experts The probability distribution on the , So that all expert Weight sum of network results , It is similar to the model fusion method in machine learning .

although MoE Originally developed as an integrated approach to multiple individual models , but Eigen Et al Shazeer They transformed it into basic building blocks ——MoE layer .MoE Layers have the same properties as MoE The same structure as the model , But accept the output of the previous layer as input and output to the following layer. Then the whole model is trained end-to-end . Namely the MoE As a small part of the big model .

In the picture b by One-gate Mixture-of-Experts (OMoE) model, There is only one gating The Internet . The following experimental comparison uses OMoE.

Multi-gate Mixture-of-Experts (MMoE) Model

Pictured c Shown , The model structure proposed in this paper , And Shared-bottom Multi task model comparison , The model is designed to capture task differences , No more model parameters are required . The key is to use MoE Instead of Shared-bottom, And for every task task A separate one has been added gating The Internet . For example, for tasks k There's a formula

where

gating Internet use DNN+ReLU+softmax Realization :

among Is a trainable matrix ,n Is the number of expert networks ,d It's a feature dimension .

Every gating The network can be based on training choice Corresponding to the input expert Network weight . It is advantageous to share parameters flexibly in the case of multi task learning .

Suppose under extreme conditions ,gating Network intelligence selects one expert Network output , Then each gate network actually linearly divides the input space into n Regions (n Number of tasks ), One for each area expert, each expert The network is responsible for a task , The model degenerates into a combination of single task models .

All in all ,MMoE By determining how the separations caused by different gates overlap , Modeling task relationships in complex ways .

If the task is less relevant , Then share expert Will be punished , The control network for these tasks will learn how to use different expert. So the model takes into account the relevance and difference of the captured tasks . And shared-bottom The model compares ,MMoE There are only a few additional gating networks , And the number of model parameters in the gating network can be ignored .

Four 、 Experimental part

Artificially constructed data sets

The relevance of the two tasks cannot be easily changed in real data , In order to explore the impact of task relevance on model results , To manually construct data sets , For details, please refer to 3.2 section , The correlation is measured by Pearson The correlation coefficient .

Model settings

Input dimensions 100, Divided into 8 individual expert The Internet , Every expert The Internet Of hidden size by 16. top floor 2 A mission ,towner The Internet hidden size=8, So the parameter is 1000×16( Every expert features Embedding Parameterization ) ×8(expert Number ) + 16*8( Every towner Parameters )*2(task Number ).

Experimental results

6368067f6bf27e223649f6c7eda105c0.png

The experimental conclusion

  1. For all models , The performance on the data with high correlation is better than that on the data with low correlation .

  2. MMoE The performance gap of the model on data with different correlations is much smaller than OMoE Models and Shared-Bottom Model . When we compare MMoE Models and OMoE Model time , This trend is particularly evident : In extreme cases where the two tasks are the same ,MMoE Models and OMoE There is little difference in performance between models ; However , When the correlation between tasks decreases ,OMoE The performance of the model has obvious degradation , And yes MMoE The impact of the model is small . therefore , It is important to use task specific gates to simulate task differences in low correlation situations .

  3. In terms of average performance , Two kinds of MoE The model outperforms in all scenarios Shared-Bottom Model . This shows that MoE The structure itself brings additional benefits . According to this observation ,MoE Model ratio Shared-Bottom The model has better trainability .

Model trainability (Trainability)

For large network models , We are more concerned about whether the model can be trained , for example Set whether the superparameter setting and model row initialization of different models are robust to the model .

Therefore, this paper naturally studies the robustness of the model to the randomness of data and model initialization . Repeat the experiment several times under each setting . Each time the data is generated from the same distribution but different random seeds , And the initialization of the model is also different , Observe different tasks loss Changes .

experimental result

ed09c133805c845811169016847f6cf0.png

Conclusion :

  1. shared-bottom The variance ratio of the model is based on MoE Our model fluctuates a lot . It means shared-bottom Models are usually based on MoE The model has more low quality local minima .

  2. secondly , When the task relevance is 1 when ,OMoE The performance variance of the model is related to MMoE The performance variance of the model has similar robustness , But when task relevance drops to 0 when ,OMoE The robustness of .MMoE and OMoE The only difference between them is whether there is a multi door structure . This verifies the effectiveness of multi gate structure in solving the poor local minimum caused by task difference conflict .

  3. The lowest of all three models loss It's comparable . Because neural network is a universal approximator in theory . With enough model capacity , There should be a “ Correct shared-bottom Model , You can learn two tasks well . however , Now it's 200 The distribution of independent experiments . The article points out that for larger and more complex models ( for example , When shared-bottom The Internet is a RNN when ), Probably get “ correct ” Models may become less important as task relationships become less important . So we come to the conclusion that : It is still desirable to explicitly model task relationships .

5、 ... and 、 Mass service

The model is deployed in Google Inc. On , A content platform with hundreds of millions of users . The business scenario is , According to the current consumption behavior of the user , List of related items recommended for next consumption .

For the two tasks, they are respectively set to :

  • Relevant objectives for participation Such as click through rate and participation time (click through rate and engagement time)

  • For satisfaction related indicators Such as liking (like rate)

Training data includes implicit feedback from hundreds of billions of users , Such as click and like . If you train alone , The model of each task needs to learn billions of parameters . therefore , Compared to learning multiple goals alone ,Shared-Bottom The architecture has the advantage of smaller model sizes . In fact, this kind of Shared-Bottom The model has been used in production .

utilize 100 Billion data volume Set up batch_size=1024, Show the process 200w step、400w step and 600wstep Result .

af824aeda8efa92fa19ec3667ad9d6f4.png

You can see MMoE It works best .

66aa215844b05e559e5c370a95347821.png

For tasks that don't work ,gating The network is blocked Expert The coefficient distribution of the network

6、 ... and 、 Conclusion

A new multi task model paradigm is proposed MMoE A bit as follows :

  • Better handle scenarios with less task relevance .

  • Under the same task relevance , The effect is relative to the common Shared-Bottom The model is better ,loss A lower .

  • The model has better computational advantages ,gating Networks are usually light , also Expert The network is shared across all tasks . Besides , The model can also be realized by taking the gating network as a sparse top-k gating To achieve higher computational efficiency .

Communicate together

I want to learn and progress with you !『NewBeeNLP』 At present, many communication groups in different directions have been established ( machine learning / Deep learning / natural language processing / Search recommendations / Figure network / Interview communication /  etc. ), Quota co., LTD. , Quickly add the wechat below to join the discussion and exchange !( Pay attention to it o want Notes Can pass )

d32eea1bee010e63ea6a0e157cbc8fcb.png

b2d10c08e41e925951a6d5a8002f10b5.gif

原网站

版权声明
本文为[kaiyuan_ sjtu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/162/202206111631578045.html