当前位置：网站首页>[recommendation system 02] deepfm, youtubednn, DSSM, MMOE

[recommendation system 02] deepfm, youtubednn, DSSM, MMOE

2022-07-07 10:35:00 【ECCUSXR】

1 DeepFM

Online advertising data set ： Criteo Labs

describe ： Contains millions of click feedback records of display advertisements , This data can be used as the click through rate (CTR) The basis of the forecast .

The dataset has 40 Features , The first column is the label , The value of 1 Indicates that the advertisement has been clicked , And value 0 Indicates that the advertisement has not been clicked . Other features include 13 individual dense The characteristics and 26 individual sparse features .

1.1 Feature Engineering

Dense features ： Also known as numerical characteristics , For example, salary 、 Age . In this tutorial Dense The feature performs two operations ：
- MinMaxScaler normalization , Make its value in [0,1] Between
- Discretize it into new Sparse features
- Sparse features ： Also known as category type feature , Such as gender 、 Education . In this tutorial Sparse Feature directly LabelEncoder Coding operation , Map the original category string to a number , In the model, we will generate Embedding vector .

1.2 Torch-RecHub frame

Torch-RecHub The framework is mainly based on PyTorch and sklearn, Easy to use and expand 、 It can reproduce the practical recommendation model in the industry , Highly modular , Support common Layer, Support common sorting models 、 Recall model 、 Multi task learning ;
Usage method ： Use DataGenerator Build a data loader , By building lightweight models , And model training is carried out based on the unified trainer , Finally, complete the model evaluation .

1.3 The background

DeepFM Huawei Noah Ark laboratory in 2017 The model proposed in .

FM（Factorization Machines, Factorizer ） It is intended to solve the problem that it is difficult to train model parameters in the scene of sparse data .FM As a recommendation algorithm, it is widely used in recommendation systems and computational advertising , It is usually used to predict the click through rate CTR（click-through rate） And conversion CVR（conversion rate）.

https://zhuanlan.zhihu.com/p/342803984

solve DNN（ All connected neural networks ） limitations ： The network parameter is too large , take One Hot The feature is transformed into Dense Vector
FNN（ Feedforward neural networks ） and PNN（ Probabilistic neural networks ）： Use pre trained FM modular , Connect to DNN It's formed on the surface FNN Model , After that Embedding layer and hidden layer1 Add a product layer , Use product layer Replace FM Pre training layer , formation PNN Model ( Don't understand, )

2 FM part

FM Layer It is mainly composed of first-order features and second-order features , after Sigmoid obtain logitcs
Model formula ：
-
FM The model formula of is a general fitting equation , Different loss functions can be used to solve regression、classification Other questions ,FM New samples can be predicted in linear time .
advantage ：
1. The vector inner product is used as the weight of the cross feature , It can be used when the data is very sparse , Effectively train the weight of cross features （ Because you don't need two features that are not zero at the same time ）
2. Through formula optimization , obtain O(nk) The computational complexity of , Computational efficiency is very high
3. Although the overall feature space in the recommended scenario is very large , however FM The training and prediction only need to deal with the non-zero features in the samples , This also improves the speed of model training and online prediction .
4. Due to the high computational efficiency of the model , And it can automatically mine long tail low-frequency materials in sparse scenes , Applicable to recall 、 There are three stages of rough and fine discharge . When applied at different stages , Sample construction 、 Fitting objectives and online services are different “
shortcoming ： Only the second-order intersection of features can be displayed , There is nothing we can do about higher-order crossover .

3 Deep part

Model composition ：
1. Use full connection to connect Dense Embedding Input to Hidden Layer, solve DNN The problem of parameter explosion in
2. Embedding The output of the layer is to put all id Class features correspond to embedding The vectors are connected together , And type in DNN in
Model formula ：

4 YoutubeDNN

YouTubeDNN Realization - You know

according to User embedding and item imbedding Use nearest neighbor search Method recall , stay softmax Using negative sampling .

Input data is highly sparse , Use embeding and average pooling Handle , Get users to watch / Search interest

Discrete data can be used embeding Handle , Continuous data can be processed by normalization or bucket division

skip-gram： The word vector is trained by predicting the context word with the head word , The input of the model is the central word , Obtain sequence samples in the form of sliding window , After getting the central word , According to the matrix multiplication of word vector , Get the word vector of the head word , Then multiply it by the context matrix , Get the similarity between the central word and each word , adopt softmax Get similar probability , And choose the one with the greatest probability index Output .
Convert a word into one-hot The vector is multiplied by the embedded matrix to obtain the embedded vector , Send in softmax Get the prediction .
Build loss function （ Suppose the dimension has 10000 individual ）：

Insert picture description here

YoutubeDNN training： The right part is similar to skip-gram, But the central word is used directly embedding obtain , The left part splices the user's feature vectors into a large vector and then DNN Dimension reduction

5 DSSM

Deep Structured Semantic Models The twin tower model

Insert picture description here

stay NLP Medium negative sampling is not target In the sample .k Usually take 5-20, It cannot be uniform sampling , Even sampling will get many words with high frequency , It can be done in the following way ：

Insert picture description here

6 The concept of multitasking

First, what is multi task learning ？

In general classification tasks , Usually, each instance corresponds to one label. As shown in the figure below . The first i Instances only correspond to the second class .

Insert picture description here

However , In multi classification learning , An instance will correspond to multiple label.
Insert picture description here

Loss function ：

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-JR9CDPYb-1656385819493)(C:\Users\96212\AppData\Roaming\Typora\typora-user-images\image-20220628104734434.png)]

m Sample size
j For the first time j individual lable
Multitasking learning will share the same low-order features
For multi task learning , We can try a large enough neural network to handle all tasks

7 MMOE

Task relationship modeling in multi expert mixed multi task learning

The paper https://dl.acm.org/doi/10.1145/3219819.3220007

MMOE Information about the model ：

Video introduction Of youtube Address ：https://www.youtube.com/watch?v=Dweg47Tswxw
keras Framework implementations The open source address of ：https://github.com/drawbridge/k

Multitask model ： Learn commonalities and differences between different tasks , It can improve the quality and efficiency of modeling .
Multi task model design pattern ：
1. Hard Parameter Sharing Method ： The bottom layer is the shared hidden layer , Learn the common pattern of each task , The upper layer uses some specific full connection layers to learn specific task patterns
2. Soft Parameter Sharing Method ： Underlying does not use shared shared bottom, It's more than one tower, To different tower Assign different weights
3. Task sequence dependency modeling ： This is suitable for different tasks with a certain sequence dependency

Insert picture description here

Hybrid expert system （MoE） It's a neural network , It's one of them combine Model of . Data generation methods applicable to data sets are different . Different from the general neural network, it separates and trains multiple models according to the data , Each model is called Experts , and Door control module Used to choose which expert to use , The actual output of the model is the combination of the output of each model and the weight of the gating model . Each expert model can adopt different functions （ Various linear or nonlinear functions ）. Hybrid expert system is to integrate multiple models into a single task .

Hybrid expert system has two architectures ：competitive MoE and cooperative MoE.competitive MoE The local area of the data in is forcibly concentrated in each discrete space of the data , and cooperative MoE There are no mandatory restrictions .

MoE Model principle ： Based on multiple Expert Aggregate output , Through the gating network mechanism （ Attention networks ） Get each Expert The weight of
be based on OMOE Model , Every Expert Tasks have a gated network
characteristic ：
1. Avoid task conflicts , Adjust according to different door controls , Select those that are helpful for the current task Expert Combine
2. Build relationships between tasks
3. Flexible parameter sharing
4. The model can converge quickly during training

原网站

版权声明
本文为[ECCUSXR]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207070830198586.html