当前位置：网站首页>Introduction of classic wide & deep model and implementation of tensorflow 2 code

Introduction of classic wide & deep model and implementation of tensorflow 2 code

2022-06-26 21:34:00 【Romantic data analysis】

Wide & Deep Model is introduced

The goal is ：
Content ：

The goal is ：

Classic recommended depth model Wide & Deep. complete paper The name is 《Wide & Deep Learning for Recommender Systems》
Goggle stay 2016 Put forward in Wide & Deep Model

Content ：

This article written by brother Zhihu is very simple and clear , Extract it directly , original text ： Know the original

This paper introduces a classic recommended depth model Wide & Deep. complete paper The name is 《Wide & Deep Learning for Recommender Systems》

One . Model is introduced

wide & deep The model architecture of is shown in the figure below

Insert picture description here
You can see wide & deep The model is divided into wide and deep Two parts .

wide Part is a simple linear model , Of course, it's not just a single characteristic linear model , In practice, more features are used to cross . For example, users often buy books A Buy books again B, So books A And books B The two features have a strong correlation , It can be trained as a cross feature .
deep Part one is a feedforward neural network model .
The linear model and the feedforward neural network model are combined for training .
- wide Part can be understood as a linear memory model ,
- deep Some are good at reasoning .
- Combination of the two , It has the function of reasoning and memory , The recommended results are more accurate .

Two . Recommend system architecture

Insert picture description here

When a user requests to come , The recommendation system will start with a large number of item Pick out O(100) Users may be interested in item( Recall phase ）. Then this O(100) individual item Will be input into the model for sorting . Select... According to the sorting results of the model topN individual item Return to the user . meanwhile , The user will respond to the presentation item Click , Buying and so on . Final , User feature, Context feature,item Of feature and user action Will log Keep your information , New training data will be generated after processing , Provide training to the model .paper Focus on using wide & deep An architecture based sorting model .

3、 ... and . Wide part

wide Part is actually a simple linear model y = wx + b.y It is our prediction target , x = [ x1, x2, … , xd] yes d individual feature Vector ,w = [w1, w2, … , wd] It's the parameters of the model ,b yes bias. there d individual feature Include the original input feature And transformed feature.

One of the most important transformations feature be called cross-product transformation . If x1 It's gender ,x1=0 For men ,x1=1 For women .x2 It's a hobby ,x2=0 I don't like watermelon ,x2=1 I like eating watermelon . So we can use it x1 and x2 Construct a new feature, Make x3=(x1 && x2), be x3=1 She is a girl and likes to eat watermelon , If you are not a girl or don't like eating watermelon , be x3=0. This is the result of transformation x3 Namely cross-product conversion . The purpose of this transformation is to obtain the influence of cross features on the prediction target , Add nonlinearity to the linear model .

This step is equivalent to artificially extracting some important relationship between the two features .tensorflow Using functions in tf.feature_column.crossed_column

Four . Deep part

deep Part is the feedforward neural network model . about High dimensional sparse classification features , First, it will be transformed into a low dimensional dense vector , such as embeding operation . And then as a neural network hidden layers Input for training .Hidden layers The calculation formula of is as follows

Insert picture description here

f Is the activation function ( for example ReLu）,a It's the last one hidden layer Output , W Is the parameter to be trained ,b yes bias

5、 ... and . Wide and Deep Training together

adopt weight sum The way to wide and deep The output of , And then through logistic loss Functions are trained together . about wide Part of , It is generally used FTRL Training . about deep The part of is AdaGrad Training .

For a logistic regression problem , The prediction formula is as follows

Insert picture description here

6、 ... and . system implementation

The implementation of the recommendation system is divided into three stages ： The data generated , Model training and model services . As shown in the figure below

Insert picture description here

（1） Data generation stage

At this stage , lately N Days of users and item Will be used to generate training data . Each displayed item There will be a target label. for example 1 Indicates that the user has clicked ,0 Indicates that the user has not clicked .

In the picture Vocabulary Generation Mainly used for data conversion . For example, the classification feature needs to be converted into the corresponding integer Id, Continuous real number features will be mapped to according to the cumulative probability distribution [0, 1] wait .

（2） Model training stage

In the data generation stage, we generate the sparse feature , Dense features and label Training sample , These samples will be put into the model as input for training . As shown in the figure below

Insert picture description here

wide The section of contains the passage Cross Product The characteristics of transformation . about deep Part of , The classification feature first passes through a layer embedding, Then with dense features concatenate get up after , after 3 Layer of hidden layers, Last sum wide Some of them unite to pass sigmoid Output .

paper It also mentions , because google The number of training samples exceeds 5000 Billion , The cost and delay of retraining all samples each time is very large . To solve this problem , When initializing a new model , Will use the old model embedding Parameter and linear model weight Parameter initializes the new model .

（3） Model service phase

After confirming that the training is OK , The model can go online . For user requests , The server will first select the candidate set that the user is interested in , Then these candidate sets will be put into the model for prediction . Rank the scores of prediction results from high to low , Then take the... Of the sorting result topN Return to the user .

You can see it here , actually ,Wide & Deep The model is used in the sorting layer , No recall layer is used ？ Why? ？ because Wide & Deep Training and prediction of models , There are still many operations , It takes a lot of time , Used in the recall phase of large-scale items , The performance overhead is a bit overwhelming .
Only when the data of the recall layer , From millions to hundreds , You can do this by Wide & Deep The model sorts the 100 level items recalled , Before getting top N（N=20）.

7、 ... and . summary

Wide & Deep The model is used in the sorting layer . become wide and deep Two parts .
* wide Part can be understood as a linear memory model ,
* deep Some are good at reasoning .
* Combination of the two , It has the function of reasoning and memory , The recommended results are more accurate .

8、 ... and . Code ：

To achieve deep, Re realization wide, Then the two data results are spliced , In the final SIGMOD Activate .

deep part

1、 Relative category characteristics （feature） Conduct embeding

# genre features vocabulary
genre_vocab = ['beijin', 'shanghai', 'shenzhen', 'chengdu', 'xian', 'suzhou', 'guangzhou']

GENRE_FEATURES = {
    
    'city': genre_vocab
}

# all categorical features
categorical_columns = []
for feature, vocab in GENRE_FEATURES.items():
    cat_col = tf.feature_column.categorical_column_with_vocabulary_list(
        key=feature, vocabulary_list=vocab)
    emb_col = tf.feature_column.embedding_column(cat_col, 10)
    categorical_columns.append(emb_col)

2、 embeding Vectors are spliced with conventional numerical features , Send in MLP

# deep part for all input features
deep = tf.keras.layers.DenseFeatures(user_numerical_columns + categorical_columns)(inputs)
deep = tf.keras.layers.Dense(128, activation='relu')(deep)
deep = tf.keras.layers.Dense(128, activation='relu')(deep)

wide part

1、 Artificially extract some important , Features that are related to each other . Use tensorflow The function in ：

tf.feature_column.crossed_column([movie_col, rated_movie]

2、 Then the cross features are analyzed multi-hot code .

def indicator_column(categorical_column):
  """Represents multi-hot representation of given categorical column.

crossed_feature = tf.feature_column.indicator_column(tf.feature_column.crossed_column([movie_col, rated_movie], 10000))

3、 Transform a sparse matrix into a dense vector .

# wide part for cross feature
wide = tf.keras.layers.DenseFeatures(crossed_feature)(inputs)

wide+deep

The output of the two models , Stitched together and then linearly activated a neuron （ Or neuron activation should be OK ）. The final forecast score result .

both = tf.keras.layers.concatenate([deep, wide])
output_layer = tf.keras.layers.Dense(1, activation='sigmoid')(both)
model = tf.keras.Model(inputs, output_layer)

Program running condition ：

Tips ： And the former 3 The same data in this article , Predicted results ：

5319/5319 [==============================] - 115s 21ms/step - loss: 67599.8828 - accuracy: 0.5150 - auc: 0.5041 - auc_1: 0.4693
Epoch 2/5
5319/5319 [==============================] - 114s 21ms/step - loss: 0.6549 - accuracy: 0.6526 - auc: 0.7150 - auc_1: 0.6806
Epoch 3/5
5319/5319 [==============================] - 118s 22ms/step - loss: 0.6326 - accuracy: 0.6722 - auc: 0.7363 - auc_1: 0.7065
Epoch 4/5
5319/5319 [==============================] - 116s 22ms/step - loss: 0.6173 - accuracy: 0.6792 - auc: 0.7410 - auc_1: 0.7133
Epoch 5/5
5319/5319 [==============================] - 113s 21ms/step - loss: 0.6067 - accuracy: 0.6840 - auc: 0.7435 - auc_1: 0.7176
1320/1320 [==============================] - 21s 15ms/step - loss: 0.6998 - accuracy: 0.5645 - auc: 0.5718 - auc_1: 0.5391


Test Loss 0.6997529864311218, Test Accuracy 0.5645247101783752, Test ROC AUC 0.5717922449111938, Test PR AUC 0.539068877696991

You can see , The model is more accurate during training , It's also more time-consuming .
But in test The accuracy of the training set has not been significantly improved . May be test There are many Chinese and new users . No previous behavioral data , Or a lot of data to train , The accuracy of the model is not too high .