当前位置:网站首页>01_ Movie recommendation (contentbased)_ Object portrait
01_ Movie recommendation (contentbased)_ Object portrait
2022-07-27 10:11:00 【Big data Da Wenxi】
Content based movie recommendation : Object portraits
Construction steps of object portrait :
- utilize tags.csv The label of each film in the film is used as a candidate keyword for the film
- utilize TF·IDF Calculate the number of labels for each movie tfidf value , selection TOP-N Keywords as movie portrait labels
- Take the classification word of the film directly as the portrait label of each film
be based on TF-IDF Feature extraction technology
Mentioned earlier , The feature labels of object portraits mainly refer to, such as the director of the film 、 actor 、 The author of the book 、 Data of structural words such as publishing house , That is, their feature extraction , In particular, the calculation of sign vector is relatively simple , Such as directly defining the classification of works 0 perhaps 1 The state of .
But there are other features , For example, the content introduction of the film 、 Film reviews of films 、 Text data such as book Abstracts , These are called unstructured data , First of all, they should also belong to a feature label of the item , But when such feature tags are quantified , That is, it is difficult to define when calculating its eigenvector .
Therefore, it is necessary to use some natural language processing 、 Information retrieval and other technologies , Quantify unstructured data such as user's text comments or other text content information , So as to achieve a more perfect object portrait / User portrait .
TF-IDF Algorithm is one of the widely used algorithms in the field of natural language processing . Can be used to extract , The keywords are used to calculate the weight of the target document , These weights are combined to obtain the eigenvector .
Algorithm principle
TF-IDF A method for calculating the weight of words or phrases in documents in the field of natural language processing , yes Word frequency (Term Frequency,TF) And reverse document frequency (Inverse Document Frequency,IDF) The product of the .TF Refers to the number of times a given word appears in the file . This number is usually normalized , To prevent it from leaning towards long documents ( The same word may have a higher word frequency in a long file than in a short file , Whether the word is important or not ).IDF Is a measure of the universal importance of words , Of a particular word IDF, You can divide the total number of files by the number of files containing the word , And then take the quotient to get .
TF-IDF The algorithm is based on the assumption that : If a word appears more frequently in the target document and less frequently in other documents , Then this word can be used to distinguish the target document . There are two things to master about this assumption :
- High frequency in this document ;
- Low frequency in other documents .
therefore ,TF-IDF The calculation of the algorithm can be divided into word frequency (Term Frequency,TF) And reverse document frequency (Inverse Document Frequency,IDF) Two parts , from TF and IDF To set the weight of document words .
TF It refers to the frequency of a word in the document . Suppose the document set contains... Documents N N N, The document set contains keywords k i k_i ki The number of documents is n i n_i ni, f i j f_{ij} fij Indicates the keyword k i k_i ki In the document d j d_j dj Is the number of times , f d j f_{dj} fdj Represents a document d j d_j dj The total number of words appearing in , k i k_i ki In the document dj The frequency of words in T F i j TF_{ij} TFij Defined as : T F i j = f i j f d j TF_{ij}=\frac {f_{ij}}{f_{dj}} TFij=fdjfij. And pay attention to , This number is usually normalized , To prevent it from leaning towards long documents ( It means that the same word may have a higher word frequency in long files than in short files , Whether the word is important or not ).
IDF Is a measure of the universal importance of words . Indicates how often a word appears in the entire document set , The key words are obtained by taking the logarithm of the calculated results k i k_i ki The inverse document frequency of I D F i IDF_i IDFi: I D F i = l o g N n i IDF_i=log\frac {N}{n_i} IDFi=logniN
from TF and IDF Calculate the weight of words as : w i j = T F i j w_{ij}=TF_{ij} wij=TFij· I D F i = f i j f d j IDF_{i}=\frac {f_{ij}}{f_{dj}} IDFi=fdjfij· l o g N n i log\frac {N}{n_i} logniN
Conclusion :TF-IDF It is directly proportional to the number of occurrences of words in the document , Inversely proportional to the number of occurrences of the word in the entire document set .
purpose : In the target document , Extract key words ( Feature tags ) The method is to put all the words in the document TF-IDF Calculate and compare , Take one of them TF-IDF The most valuable k The number constitutes the feature vector of the target document, which is used to represent the document .
Be careful : Stop words present in the document (Stop Words), Such as “ yes ”、“ Of ” And so on. , Words that have no meaning for the central idea of the document , In word segmentation, you need to filter out and then calculate the of other words TF-IDF value .
Examples of algorithms
For the calculation of film reviews TF-IDF, To film “ Pirates of the Caribbean : The curse of the Black Pearl ” For example , Let's say it has a total of 1000 Film review , The total number of words in one film review is 200, The most frequent words are “ The pirates ”、“ The captain ”、“ free ”, Namely 20、15、10 Time , And this 3 The number of times a word is mentioned in all film reviews is 1000、500、100, Is this 3 The order of words as keywords is calculated as follows .
Filter out the stop words in the film review , Calculate the word frequency of other words . Take the three words that appear most as an example to calculate as follows :
- “ The pirates ” The frequency of words appearing is 20/200=0.1
- “ The captain ” The frequency of words appearing is 15/200=0.075
- “ free ” The frequency of words appearing is 10/200=0.05;
Calculate the inverse document frequency of words as follows :
- “ The pirates ” Of IDF by :log(1000/1000)=0
- “ The captain ” Of IDF by :log(1000/500)=0.3
“ free ” Of IDF by :log(1000/100)=1
from 1 and 2 The result of the calculation is to find the TF-IDF result ,“ The pirates ” by 0,“ The captain ” by 0.0225,“ free ” by 0.05.
By comparison, we can get , The key words of the film review should be :“ free ”、“ The captain ”、“ The pirates ”. Put these words TF-IDF Values as their weights are arranged in the corresponding order , You get the eigenvector of the film review , Let's use this vector to represent the film review , The component size of each dimension in the vector corresponds to the importance of this attribute .
Multiply and sum all the film review vectors in the total film review set by a specific coefficient , Get the comprehensive film review vector of the film , Combine with the basic attributes of the film to construct the object portrait of the video , Similarly, build user portraits , A variety of methods can be used to calculate the similarity between the object portrait and the user portrait , Make recommendations for users .
Load data set
import pandas as pd
import numpy as np
''' - utilize tags.csv The label of each film in the film is used as a candidate keyword for the film - utilize TF·IDF Calculate the number of labels for each movie tfidf value , selection TOP-N Keywords as movie portrait labels - And the classification words of the film are directly used as the portrait label of each film '''
def get_movie_dataset():
# Load tags based on all movies
# all-tags.csv come from ml-latest Data set
# because ml-latest-small Too much tag data in , So use it to expand
_tags = pd.read_csv("datasets/ml-latest-small/all-tags.csv", usecols=range(1, 3)).dropna()
tags = _tags.groupby("movieId").agg(list)
# Load movie list dataset
movies = pd.read_csv("datasets/ml-latest-small/movies.csv", index_col="movieId")
# Separate category words
movies["genres"] = movies["genres"].apply(lambda x: x.split("|"))
# Match the corresponding tag data for each movie , If not, it will be NAN
movies_index = set(movies.index) & set(tags.index)
new_tags = tags.loc[list(movies_index)]
ret = movies.join(new_tags)
# Building movie datasets , Including movies Id、 The movie name 、 Category 、 The label has four fields
# If the movie has no tag data , Then replace with an empty list
# map(fun, Iteratable object )
movie_dataset = pd.DataFrame(
map(
lambda x: (x[0], x[1], x[2], x[2]+x[3]) if x[3] is not np.nan else (x[0], x[1], x[2], []), ret.itertuples())
, columns=["movieId", "title", "genres","tags"]
)
movie_dataset.set_index("movieId", inplace=True)
return movie_dataset
movie_dataset = get_movie_dataset()
print(movie_dataset)
be based on TF·IDF extract TOP-N key word , Build movie portraits
from gensim.models import TfidfModel
import pandas as pd
import numpy as np
from pprint import pprint
# ......
def create_movie_profile(movie_dataset):
''' Use tfidf, Analyze and extract topn key word :param movie_dataset: :return: '''
dataset = movie_dataset["tags"].values
from gensim.corpora import Dictionary
# Create a word bag based on the data set , And count the word frequency , Put all the words in one dictionary , Use the index to get
dct = Dictionary(dataset)
# Each piece of data will be , Returns the corresponding word index and word frequency
corpus = [dct.doc2bow(line) for line in dataset]
# Training TF-IDF Model , Computation TF-IDF value
model = TfidfModel(corpus)
movie_profile = {
}
for i, mid in enumerate(movie_dataset.index):
# Return... According to each data , vector
vector = model[corpus[i]]
# according to TF-IDF Worthy of top-n Key words
movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
# Extract the corresponding name according to the keyword
movie_profile[mid] = dict(map(lambda x:(dct[x[0]], x[1]), movie_tags))
return movie_profile
movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))
Perfect the key words
from gensim.models import TfidfModel
import pandas as pd
import numpy as np
from pprint import pprint
# ......
def create_movie_profile(movie_dataset):
''' Use tfidf, Analyze and extract topn key word :param movie_dataset: :return: '''
dataset = movie_dataset["tags"].values
from gensim.corpora import Dictionary
# Create a word bag based on the data set , And count the word frequency , Put all the words in one dictionary , Use the index to get
dct = Dictionary(dataset)
# Each piece of data will be , Returns the corresponding word index and word frequency
corpus = [dct.doc2bow(line) for line in dataset]
# Training TF-IDF Model , Computation TF-IDF value
model = TfidfModel(corpus)
_movie_profile = []
for i, data in enumerate(movie_dataset.itertuples()):
mid = data[0]
title = data[1]
genres = data[2]
vector = model[corpus[i]]
movie_tags = sorted(vector, key=lambda x: x[1], reverse=True)[:30]
topN_tags_weights = dict(map(lambda x: (dct[x[0]], x[1]), movie_tags))
# Add category words , And set the weight value to 1.0
for g in genres:
topN_tags_weights[g] = 1.0
topN_tags = [i[0] for i in topN_tags_weights.items()]
_movie_profile.append((mid, title, topN_tags, topN_tags_weights))
movie_profile = pd.DataFrame(_movie_profile, columns=["movieId", "title", "profile", "weights"])
movie_profile.set_index("movieId", inplace=True)
return movie_profile
movie_dataset = get_movie_dataset()
pprint(create_movie_profile(movie_dataset))
In order to quickly match the corresponding movie according to the specified keyword , Therefore, it is necessary to label the image of the object , establish Inverted index
Inverted index Introduction
Usually data stores data , It's all based on objects ID As index , To extract other information data of the item
The inverted index is to use other data of the item as the index , To extract their corresponding items ID list
# ......
''' establish tag- Inverted index of items '''
def create_inverted_table(movie_profile):
inverted_table = {
}
for mid, weights in movie_profile["weights"].iteritems():
for tag, weight in weights.items():
# To inverted_table dict use tag As Key DE value If you can't get it, return []
_ = inverted_table.get(tag, [])
_.append((mid, weight))
inverted_table.setdefault(tag, _)
return inverted_table
inverted_table = create_inverted_table(movie_profile)
pprint(inverted_table)
边栏推荐
- npm常用命令
- Interview JD T5, was pressed on the ground friction, who knows what I experienced?
- Qt 学习(二) —— Qt Creator简单介绍
- Shell变量、系统预定义变量$HOME、$PWD、$SHELL、$USER、自定义变量、特殊变量$n、$#、$*、[email protected]、$?、env看所有的全局变量值、set看所有变量
- Shell read read console input, use of read
- 安装CUDA失败的情况nsight visual studio edition失败
- Introduction to regular expressions of shell, general matching, special characters: ^, $,., * Character range (brackets): [], special characters: \, matching mobile phone number
- LeetCode.814. 二叉树剪枝____DFS
- [cloud native • Devops] master the container management tool rancher
- Talk about 10 scenarios of index failure. It's too stupid
猜你喜欢

VS2019+CUDA11.1新建项目里没有CUDA选项
![Introduction to regular expressions of shell, general matching, special characters: ^, $,., * Character range (brackets): [], special characters: \, matching mobile phone number](/img/31/ed0d8c1a5327059f2de7493bec1c6c.png)
Introduction to regular expressions of shell, general matching, special characters: ^, $,., * Character range (brackets): [], special characters: \, matching mobile phone number

NFT system development - Tutorial

When I went to oppo for an interview, I got numb

超赞的卡尔曼滤波详解文章

食品安全 | 菜板环境很重要,这些使用细节你知道吗?

Dcgan paper improvements + simplified code

pillow的原因ImportError: cannot import name ‘PILLOW_VERSION‘ from ‘PIL‘,如何安装pillow<7.0.0

Leetcode.814. binary tree pruning____ DFS

省应急管理厅:广州可争取推广幼儿应急安全宣教经验
随机推荐
2016 outlook
卸载CUDA11.1
Visual slam lecture notes (I): Lecture 1 + Lecture 2
Interview JD T5, was pressed on the ground friction, who knows what I experienced?
System parameter constant table of system architecture:
About getter/setter methods
线代004
Understand chisel language. 27. Chisel advanced finite state machine (I) -- basic finite state machine (Moore machine)
3D face reconstruction and dense alignment with position map progression network
找工作 4 个月, 面试 15 家,拿到 3 个 offer
省应急管理厅:广州可争取推广幼儿应急安全宣教经验
StyleGAN论文笔记+修改代码尝试3D点云生成
Anchor free detector: centernet
活体检测综述
LeetCode.814. 二叉树剪枝____DFS
历时一年,论文终于被国际顶会接收了
原生input标签的文件上传
Review of in vivo detection
open3d库的安装,conda常用指令,导入open3d时报这个错误Solving environment: failed with initial frozen solve. Retrying w
Ant高级-path和fileset