当前位置：网站首页>Keras deep learning practice - recommend system data coding

Keras deep learning practice - recommend system data coding

2022-07-27 13:48:00 【Hope Xiaohui】

Keras Deep learning practice —— Recommended system data code

0. Preface

stay 《 Detailed explanation of self encoder 》 in , We introduced the necessity of data coding , At the same time, taking image coding as an example, the self encoder is realized (AutoEncoder) And its many variants . The recommendation system uses customer and product information , According to the user's interest characteristics and purchase behavior , Recommend information and products of interest to users . In this section , We will encode users and movies in the data set related to movie evaluation .

1. The necessity of recommending system data coding

In order to understand the necessity of data coding in the recommendation system , We consider the application scenario of movie recommendation to users . Similar to text analysis , If we want to make every movie / The user encodes alone , Because there are thousands of movies , It will eventually provide a vector coding of thousands of dimensions for each movie .
Code users in a lower dimension according to their viewing habits , Thus, movies can be grouped according to their similarity , This will help us mine the movies that users are more likely to watch . Similar ideas can also be applied to e-commerce recommendation systems , And recommend products to customers in supermarkets .

2. Recommended system data code

In this paper , We mainly consider the application scenarios of movie recommendation to users , There may be millions of users and thousands of movies in such a database , We cannot encode the data alone . under these circumstances , Data coding will come in handy . One of the most popular techniques used in recommender system coding is matrix decomposition . Next , We will understand how it works and generate embeddeds for users and movies .

2.1 Encode users and movies in the recommendation system

The principle of encoding users and movies is as follows ： In terms of considering users' preferences for different movies , If two users like movies similar , Then it means that the two user coding vectors should be similar , Follow the same logic , If two movies are similar , For example, they belong to the same type or have similar cast , Then it means that the two films should have similar coding vectors .

2.2 Data set introduction

The data set used for model training contains user information and the rating information of movies they have watched , The first column represents the user number , The second column represents the movie number , The third column shows the user's rating of the movie , The last column represents the timestamp , The figure below shows some of the data downloaded .

Data sets

The data set can be downloaded from the following link ：https://pan.baidu.com/s/1yYQw6uuXVsj9PHsT68rx1w, Extraction code : ifjr.

2.3 Coding strategy for recommendation system

Before starting to implement the model , We first introduce the coding strategy workflow for the recommendation system , In order to recommend new movies according to the history of movies watched by users ：

Load data set , Assign to users and movies ID
Convert users and movies to 32 Dimensional coding vector
Use Keras The function in API Calculate movies and users 32 Dot product of dimensional vectors ：
- If there is 1000000 Users and 10000 A movie , Then the size of the encoded movie matrix is 10000 × 32, The size of the user matrix is 1000000 × 32 Size
- The size of the dot product of the two is 1000000 × 10000
Flatten the output after dot product and pass through a full connection layer , Then connect to the output layer , The output layer uses linear activation functions , The range of output values is 1 To 5, Indicates the predicted user's rating of the movie
Fitting model
Extract the embedding weights of users and movies respectively
You can find movies that are similar to a given movie by calculating the similarity between the movie that users are interested in and other movies in the data set

Next , We will encode users and movies into embedded vectors in the recommendation system .

2.4 Implement the coding model of recommendation system

(1) First , Import datasets and required libraries ：

import numpy as np
import pandas as pd
from keras.layers import Input, Embedding, Dense, Dropout, merge, Flatten, dot
from keras.models import Model
from keras.optimizers import Adam

column_names = ['User', 'Movies', 'rating', 'timestamp']
ratings = pd.read_csv('u.data', sep='\t', names=column_names)

print(ratings.head())

Printed out use head() The data set data sample information read by the method is as follows ：

   User  Movies  rating  timestamp
0   196     242       3  881250949
1   186     302       3  891717742
2    22     377       1  878887116
3   244      51       2  880606923
4   166     346       1  886397596

(2) Convert users and movies into classification variables , Created two new variables ：User2 and Movies2, Used for classification ：

ratings['User2']=ratings['User'].astype('category')
ratings['Movies2']=ratings['Movies'].astype('category')

(3) Assign unique to each user and movie ID：

users = ratings.User.unique()
movies = ratings.Movies.unique()

print(len(users))
print(len(movies))

userid2idx = {
    o:i for i,o in enumerate(users)}
moviesid2idx = {
    o:i for i,o in enumerate(movies)}
idx2userid = {
    i:o for i,o in enumerate(users)}
idx2moviesid = {
    i:o for i,o in enumerate(movies)}

(4) take ID Add as a new column to the original table ：

ratings['Movies2'] = ratings.Movies.apply(lambda x: moviesid2idx[x])
ratings['User2'] = ratings.User.apply(lambda x: userid2idx[x])

print(ratings.head())

Again using head() The data sample information read by the method after adding a new column is as follows ：

   User  Movies  rating  timestamp  User2  Movies2
0   196     242       3  881250949      0        0
1   186     302       3  891717742      1        1
2    22     377       1  878887116      2        2
3   244      51       2  880606923      3        3
4   166     346       1  886397596      4        4

(5) For each user ID And the movie ID Define embedding ：

n_users = ratings.User.nunique()
n_movies = ratings.Movies.nunique()

In the above code , Extract the total number of different users and different movies in the dataset . Next , Defined function embedding_input, This function converts a ID As input and convert it into an embedded vector , The dimension of this vector is n_out, share n_in Input values ：

def embedding_input(name,n_in,n_out):
    inp = Input(shape=(1,),dtype='int64',name=name)
    return inp, Embedding(n_in,n_out,input_length=1)(inp)

Next , Extract one for each user and each movie 32 Coding vector of dimension .

n_factors = 32
user_in, u = embedding_input('user_in', n_users, n_factors)
movie_in, a = embedding_input('article_in', n_movies, n_factors)

(6) Build a neural network model ：

import keras.backend as K
def rmse(y_true,y_pred):
  score = K.sqrt(K.mean(K.pow(y_true - y_pred, 2)))
  return score

x = dot([u,a], axes=1)
x=Flatten()(x)
x = Dense(500, activation='relu')(x)
x = Dense(1)(x)
model = Model([user_in, movie_in],x)
adam = Adam(lr=0.01)
model.compile(adam,loss='mse', metrics=[rmse])
model.summary()

The brief architecture information output of the model is as follows ：

Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param # Connected to 
==================================================================================================
user_in (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
article_in (InputLayer)         [(None, 1)]          0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 1, 32)        30176       user_in[0][0]                    
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 1, 32)        53824       article_in[0][0]                 
__________________________________________________________________________________________________
dot (Dot)                       (None, 32, 32)       0           embedding[0][0]                  
                                                                 embedding_1[0][0]                
__________________________________________________________________________________________________
flatten (Flatten)               (None, 1024)         0           dot[0][0]                        
__________________________________________________________________________________________________
dense (Dense)                   (None, 500)          512500      flatten[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1)            501         dense[0][0]                      
==================================================================================================
Total params: 597,001
Trainable params: 597,001
Non-trainable params: 0
__________________________________________________________________________________________________

(7) Fitting model ：

model.fit([ratings.User2,ratings.Movies2], ratings.rating,
            epochs=50,
            batch_size=128)

(8) Extract the vector of each user or movie ：

# Extracting user vectors
model.get_weights()[0]
# Extracting movie vectors
model.get_weights()[1]

(9) Last , We will verify whether similar movies have similar embeddedness . When calculating the similarity between embeddings , We usually use cosine similarity to measure . Select the first 600 A movie , The cosine similarity is calculated as follows ：

from sklearn.metrics.pairwise import cosine_similarity
print(np.argmax(cosine_similarity(model.get_weights()[1][600].reshape(1,-1),model.get_weights()[1][:600].reshape(600,32))))