当前位置：网站首页>01_ Movie recommendation (contentbased)_ Object portrait

01_ Movie recommendation (contentbased)_ Object portrait

2022-07-27 10:11:00 【Big data Da Wenxi】

Content based movie recommendation ： Object portraits

Construction steps of object portrait ：

utilize tags.csv The label of each film in the film is used as a candidate keyword for the film
utilize TF·IDF Calculate the number of labels for each movie tfidf value , selection TOP-N Keywords as movie portrait labels
Take the classification word of the film directly as the portrait label of each film

be based on TF-IDF Feature extraction technology

Mentioned earlier , The feature labels of object portraits mainly refer to, such as the director of the film 、 actor 、 The author of the book 、 Data of structural words such as publishing house , That is, their feature extraction , In particular, the calculation of sign vector is relatively simple , Such as directly defining the classification of works 0 perhaps 1 The state of .

But there are other features , For example, the content introduction of the film 、 Film reviews of films 、 Text data such as book Abstracts , These are called unstructured data , First of all, they should also belong to a feature label of the item , But when such feature tags are quantified , That is, it is difficult to define when calculating its eigenvector .

Therefore, it is necessary to use some natural language processing 、 Information retrieval and other technologies , Quantify unstructured data such as user's text comments or other text content information , So as to achieve a more perfect object portrait / User portrait .

TF-IDF Algorithm is one of the widely used algorithms in the field of natural language processing . Can be used to extract , The keywords are used to calculate the weight of the target document , These weights are combined to obtain the eigenvector .

Algorithm principle

TF-IDF A method for calculating the weight of words or phrases in documents in the field of natural language processing , yes Word frequency （Term Frequency,TF） And reverse document frequency （Inverse Document Frequency,IDF） The product of the .TF Refers to the number of times a given word appears in the file . This number is usually normalized , To prevent it from leaning towards long documents （ The same word may have a higher word frequency in a long file than in a short file , Whether the word is important or not ）.IDF Is a measure of the universal importance of words , Of a particular word IDF, You can divide the total number of files by the number of files containing the word , And then take the quotient to get .

TF-IDF The algorithm is based on the assumption that ： If a word appears more frequently in the target document and less frequently in other documents , Then this word can be used to distinguish the target document . There are two things to master about this assumption ：

High frequency in this document ;
Low frequency in other documents .

therefore ,TF-IDF The calculation of the algorithm can be divided into word frequency （Term Frequency,TF） And reverse document frequency （Inverse Document Frequency,IDF） Two parts , from TF and IDF To set the weight of document words .

TF It refers to the frequency of a word in the document . Suppose the document set contains... Documents $N$ , The document set contains keywords $k_i$ The number of documents is $n_i$ , $f_{ij}$ Indicates the keyword $k_i$ In the document $d_j$ Is the number of times , $f_{dj}$ Represents a document $d_j$ The total number of words appearing in , $k_i$ In the document dj The frequency of words in $TF_{ij}$ Defined as ： $TF_{ij}=\frac {f_{ij}}{f_{dj}}$ . And pay attention to , This number is usually normalized , To prevent it from leaning towards long documents （ It means that the same word may have a higher word frequency in long files than in short files , Whether the word is important or not ）.

IDF Is a measure of the universal importance of words . Indicates how often a word appears in the entire document set , The key words are obtained by taking the logarithm of the calculated results $k_i$ The inverse document frequency of $IDF_i$ ： $IDF_i=log\frac {N}{n_i}$

from TF and IDF Calculate the weight of words as ： $w_{ij}=TF_{ij}$ · $IDF_{i}=\frac {f_{ij}}{f_{dj}}$ · $log\frac {N}{n_i}$

Conclusion ：TF-IDF It is directly proportional to the number of occurrences of words in the document , Inversely proportional to the number of occurrences of the word in the entire document set .

purpose ： In the target document , Extract key words ( Feature tags ) The method is to put all the words in the document TF-IDF Calculate and compare , Take one of them TF-IDF The most valuable k The number constitutes the feature vector of the target document, which is used to represent the document .

Be careful ： Stop words present in the document （Stop Words）, Such as “ yes ”、“ Of ” And so on. , Words that have no meaning for the central idea of the document , In word segmentation, you need to filter out and then calculate the of other words TF-IDF value .

Examples of algorithms

For the calculation of film reviews TF-IDF, To film “ Pirates of the Caribbean ： The curse of the Black Pearl ” For example , Let's say it has a total of 1000 Film review , The total number of words in one film review is 200, The most frequent words are “ The pirates ”、“ The captain ”、“ free ”, Namely 20、15、10 Time , And this 3 The number of times a word is mentioned in all film reviews is 1000、500、100, Is this 3 The order of words as keywords is calculated as follows .

Filter out the stop words in the film review , Calculate the word frequency of other words . Take the three words that appear most as an example to calculate as follows ：
- “ The pirates ” The frequency of words appearing is 20/200＝0.1
- “ The captain ” The frequency of words appearing is 15/200=0.075
- “ free ” The frequency of words appearing is 10/200=0.05;
Calculate the inverse document frequency of words as follows ：
- “ The pirates ” Of IDF by ：log(1000/1000)=0
- “ The captain ” Of IDF by ：log(1000/500)=0.3
  “ free ” Of IDF by ：log(1000/100)=1
from 1 and 2 The result of the calculation is to find the TF-IDF result ,“ The pirates ” by 0,“ The captain ” by 0.0225,“ free ” by 0.05.

By comparison, we can get , The key words of the film review should be ：“ free ”、“ The captain ”、“ The pirates ”. Put these words TF-IDF Values as their weights are arranged in the corresponding order , You get the eigenvector of the film review , Let's use this vector to represent the film review , The component size of each dimension in the vector corresponds to the importance of this attribute .

Multiply and sum all the film review vectors in the total film review set by a specific coefficient , Get the comprehensive film review vector of the film , Combine with the basic attributes of the film to construct the object portrait of the video , Similarly, build user portraits , A variety of methods can be used to calculate the similarity between the object portrait and the user portrait , Make recommendations for users .

Load data set

import pandas as pd
import numpy as np
''' -  utilize tags.csv The label of each film in the film is used as a candidate keyword for the film  -  utilize TF·IDF Calculate the number of labels for each movie tfidf value , selection TOP-N Keywords as movie portrait labels  -  And the classification words of the film are directly used as the portrait label of each film  '''

def get_movie_dataset():
    #  Load tags based on all movies 
    # all-tags.csv come from ml-latest Data set 
    #  because ml-latest-small Too much tag data in , So use it to expand 
    _tags = pd.read_csv("datasets/ml-latest-small/all-tags.csv", usecols=range(1, 3)).dropna()
    tags = _tags.groupby("movieId").agg(list)

    #  Load movie list dataset 
    movies = pd.read_csv("datasets/ml-latest-small/movies.csv", index_col="movieId")
    #  Separate category words 
    movies["genres"] = movies["genres"].apply(lambda x: x.split("|"))
    #  Match the corresponding tag data for each movie , If not, it will be NAN
    movies_index = set(movies.index) & set(tags.index)
    new_tags = tags.loc[list(movies_index)]
    ret = movies.join(new_tags)

    #  Building movie datasets , Including movies Id、 The movie name 、 Category 、 The label has four fields 
    #  If the movie has no tag data , Then replace with an empty list 
    # map(fun, Iteratable object )
    movie_dataset = pd.DataFrame(
        map(
            lambda x: (x[0], x[1], x[2], x[2]+x[3]) if x[3] is not np.nan else (x[0], x[1], x[2], []), ret.itertuples())
        , columns=["movieId", "title", "genres","tags"]
    )

    movie_dataset.set_index("movieId", inplace=True)
    return movie_dataset

movie_dataset = get_movie_dataset()
print(movie_dataset)

be based on TF·IDF extract TOP-N key word , Build movie portraits

from gensim.models import TfidfModel

import pandas as pd
import numpy as np

from pprint import pprint

# ......

def create_movie_profile(movie_dataset):
    '''  Use tfidf, Analyze and extract topn key word  :param movie_dataset: :return: '''
    dataset = movie_dataset["tags"].values

    from gensim.corpora import Dictionary
    #  Create a word bag based on the data set , And count the word frequency , Put all the words in one dictionary , Use the index to get 
    dct = Dictionary(dataset)
    #  Each piece of data will be , Returns the corresponding word index and word frequency 
    corpus = [dct.doc2bow(line) for line in dataset]
    #  Training TF-IDF Model , Computation TF-IDF value 
    model = TfidfModel(corpus)

    movie_profile = {
    }
    for i, mid in enumerate(movie_dataset.index):
        #  Return... According to each data , vector 
        vector = model[corpus[i]]
        #  according to TF-IDF Worthy of top-n Key words 
        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
        #  Extract the corresponding name according to the keyword 
        movie_profile[mid] = dict(map(lambda x:(dct[x[0]], x[1]), movie_tags))

    return movie_profile

movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))

Perfect the key words

from gensim.models import TfidfModel

import pandas as pd
import numpy as np

from pprint import pprint

# ......

def create_movie_profile(movie_dataset):
    '''  Use tfidf, Analyze and extract topn key word  :param movie_dataset: :return: '''
    dataset = movie_dataset["tags"].values

    from gensim.corpora import Dictionary
    #  Create a word bag based on the data set , And count the word frequency , Put all the words in one dictionary , Use the index to get 
    dct = Dictionary(dataset)
    #  Each piece of data will be , Returns the corresponding word index and word frequency 
    corpus = [dct.doc2bow(line) for line in dataset]
    #  Training TF-IDF Model , Computation TF-IDF value 
    model = TfidfModel(corpus)

    _movie_profile = []
    for i, data in enumerate(movie_dataset.itertuples()):
        mid = data[0]
        title = data[1]
        genres = data[2]
        vector = model[corpus[i]]
        movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
        topN_tags_weights = dict(map(lambda x: (dct[x[0]], x[1]), movie_tags))
        #  Add category words , And set the weight value to 1.0
        for g in genres:
            topN_tags_weights[g] = 1.0
        topN_tags = [i[0] for i in topN_tags_weights.items()]
        _movie_profile.append((mid, title, topN_tags, topN_tags_weights))

    movie_profile = pd.DataFrame(_movie_profile, columns=["movieId", "title", "profile", "weights"])
    movie_profile.set_index("movieId", inplace=True)
    return movie_profile

movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))

In order to quickly match the corresponding movie according to the specified keyword , Therefore, it is necessary to label the image of the object , establish Inverted index

Inverted index Introduction

Usually data stores data , It's all based on objects ID As index , To extract other information data of the item

The inverted index is to use other data of the item as the index , To extract their corresponding items ID list

# ......

'''  establish tag- Inverted index of items  '''

def create_inverted_table(movie_profile):
    inverted_table = {
    }
    for mid, weights in movie_profile["weights"].iteritems():
        for tag, weight in weights.items():
            # To inverted_table dict  use tag As Key DE value   If you can't get it, return []
            _ = inverted_table.get(tag, [])
            _.append((mid, weight))
            inverted_table.setdefault(tag, _)
    return inverted_table

inverted_table = create_inverted_table(movie_profile)
pprint(inverted_table)