当前位置:网站首页>K-means based user portrait clustering model
K-means based user portrait clustering model
2022-07-01 21:37:00 【Desperately_ petty thief】
One 、 summary
Use K-means The main purpose of user clustering is to realize the film recommendation system of user portrait , The recommendation consists of two parts , The first part realizes movie recommendation through collaborative filtering , The second part , Build user portrait to realize movie recommendation , The implementation is realized through static attributes : User's gender , Age , regional , role ( Student 、 Office worker 、 unemployed )、 regional 、 Marital status , Use clustering (K-mens) The algorithm divides the crowd , Realize the movie recommendation of the same crowd , Secondly, dynamic properties : By recording user behavior data , Search record usage word2vec The model realizes similar search keywords and recommends movies .( I will explain the recommendation system later )
The language used in the text is python3.6 The above implementation , The tool used is jupyter-notebook Development , Mainly used python Of sklearn、pandas、faker、matplotlib Wait for the library to realize .
Abstract : This article mainly describes the second part of the use of K-means Clustering builds user portrait model to realize the recommendation of movies loved by the same users .
Two 、 theory
K-means It is an unsupervised clustering algorithm , The main idea is to randomly select K Objects as the initial clustering center . Then calculate the distance between each object and each seed cluster center , To find similar objects .
3、 ... and 、 Realization
- Data acquisition
We need data information before clustering : Including name, 、 Age 、 Gender 、 Address 、 Type of work 、 Marital status, etc , But our thorny problem , Where does the data come from , At present, there are two ways to obtain data : First, you can use web crawler technology to crawl information related personal information , The second kind python Have a powerful library faker It can help us realize the automatic simulation generation of user information .faker You can refer to the blog I wrote before ~ faker Generating user information
from faker import Faker
import pandas as pd
import numpy as np
from collections import OrderedDict
fake = Faker(locale='zh_CN')
sex_dict = OrderedDict([(' male ', 0.52), (' Woman ', 0.48)])
married_dict = OrderedDict([(' unmarried ', 0.4), (' married ', 0.5), (' Divorce ', 0.1), ])
work_dict = OrderedDict([(' on-the-job ', 0.7), (' freelance ', 0.3), ])
type_dict = OrderedDict([(' white collar ', 0.45), (' Teachers' ', 0.1), (' Worker ', 0.2), (' Civil servants ', 0.1), (' sales ', 0.15)])
name = []
age = []
address = []
sex = []
work = []
married = []
type = []
# Number of people generated
num = 1000
# Generate name age address
for i in range(num):
name.append(fake.name())
age.append(fake.random_int(min=16,max=45))
address.append(fake.province())
sex.append(fake.random_element(sex_dict))
work.append(fake.random_element(work_dict))
type.append(fake.random_element(type_dict))
married.append(fake.random_element(married_dict))
data = {'name':name,'age':age,'sex':sex,'address':address,'work':work,'married':married,'type':type}
users = pd.DataFrame.from_dict(data)
- Data processing
Above, we have successfully used faker The library generates user information , But let's think about , It is impossible for computer algorithms to understand the meaning of Chinese , So how to make kmeans The model knows the meaning of these data ? We need to vectorize the data generated above , There are many methods of text vectorization , for example :one-hot、tfidf wait , Of course, different vectorization methods have their own advantages and disadvantages , If you are interested, you can learn about it alone , There is no detailed answer here .
In this article, we use one-hot The idea is to transform the data of different dimensions into data vectors by dictionary mapping .
sex_map = {' male ':0, ' Woman ':1}
married_map = {' unmarried ':0,' married ':1,' Divorce ':2}
work_map = {' on-the-job ':0, ' freelance ':1}
type_map = {' white collar ':0,' Teachers' ':1,' Worker ':2,' Civil servants ':3,' sales ':4}
addr_map = {' Shanghai ':0,' Yunnan Province ':1,' Inner Mongolia Autonomous Region ':2,' The Beijing municipal ':3,' Taiwan Province ':4,' Jilin Province ':5,' Sichuan Province ':6,' tianjin ':7,' Ningxia Hui Autonomous Region ':8
,' Anhui Province ':9,' Shandong Province ':10,' Shanxi Province ':11,' Guangdong province, ':12,
' Guangxi Zhuang Autonomous Region ':13,' Xinjiang Uygur Autonomous Region ':14,' Jiangsu Province ':15,' Jiangxi Province ':16,' Hebei Province ':17,' Henan province ':18,' Zhejiang Province ':19,' Hainan ':20,' Hubei province ':21,' Hunan province ':22,' Macao SAR ':23,' Gansu Province ':24,
' Fujian Province ':25,' Tibet Autonomous Region ': 26,' Guizhou Province ':27,' Liaoning Province ':28,' Chongqing City ':29,' Shaanxi Province ':30,' Qinghai Province ':31,' Hong Kong Special Administrative Region ':32,' Heilongjiang Province ':33}
And then use pandas Medium map Method vectorize each dimension data according to the above dictionary type , After conversion, as shown in the figure below :
users['married'] = users['married'].map(married_map)
- Model implementation
We use sklearn Machine learning library creation kmeans Model , What we need to pay attention to is that we are using kmeans Before classification , The data is normalized once .
The purpose of data standardization is to readjust the value of each dimension of data , Make the final data vector fall in [0,1] Within the interval , After standardized treatment , The original data is transformed into dimensionless index evaluation value , Each index value is at the same quantity level , Comprehensive evaluation and analysis can be carried out . If the original index value is directly used for analysis , It will highlight the role of indicators with higher values in comprehensive analysis , Relatively weaken the role of indicators with low numerical level .
I will standardize the steps and kmeans The model makes a connection, so they are all put together , The implementation code is as follows :
model = Pipeline([
('BN',preprocessing.StandardScaler()),('KMS',KMeans())
])
model.fit(X)
# Save the model
joblib.dump(model, "./model/model.pkl")
model = joblib.load("./model/model.pkl")
y_pred = model.predict(X)
# Store the prediction results in the vectorized total data
users['lable'] = y_pred
- The model is saved and tested
model = joblib.load("./model/model.pkl")
y_pred = model.predict([X[0]])
y_pred- Visualization of clustering results
Use adviz The principle of visualization is to map a series of points in multi-dimensional space to two-dimensional space through nonlinear method .
import matplotlib.pyplot as plt
from pandas.plotting import radviz
"""
adviz The visualization principle is a visualization technology that maps a series of points in multi-dimensional space to two-dimensional space through nonlinear method , It is a multi-dimensional visualization method based on the design idea of circular parallel coordinate system . circular m Bar radius representation m Dimensional space , Use a point in the coordinate system to represent many information objects , Its realization principle refers to the force balance theorem of objects in physics .
mm Points in dimensional space {Ai1,Ai2,...,AimAi1,Ai2,...,Aim} The position mapped to the two-dimensional visual space is determined by the spring gravity analysis model .
First, divide a circle into mm Share , The number of bisectors is determined by the data dimension , Record the bisection points as {R1,R2,...,RnR1,R2,...,Rn}, A spring is fixed on each bisection point , The elastic coefficient of each spring is different , Spring jj The coefficient of elasticity is AijAij, That is the first. jj Value of dimension , all mm The other end of a spring is fixed on a small circle , If the small circle finally reaches an equilibrium position Bi={Xi,Yi}TBi={Xi,Yi}T, Then the elastic balance point {Xi,Yi}{Xi,Yi} Namely m Dimensional space points { Ai1,Ai2,...,AimAi1,Ai2,...,Aim} Projection point in two-dimensional space .
Blog :https://blog.csdn.net/Haiyang_Duan/article/details/78985225
"""
plt.figure(' User portrait clustering ',figsize=(15, 7.5), dpi=80)
plt.title('radviz')
radviz(users.loc[:,'age':], class_column='lable')
plt.show()
边栏推荐
- idea中类中显示成员变量和方法
- 月入1W+的自媒体达人都会用到的运营工具
- PCB线路板塞孔工艺的那些事儿~
- 新版Free手机、PC、平板、笔记本四端网站缩略展示图在线一键生成网站源码
- 股票手机开户哪个app好,安全性较高的
- Gaussdb (for MySQL):partial result cache, which accelerates the operator by caching intermediate results
- leetcode刷题:栈与队列06(前 K 个高频元素)
- 杰理之烧录都使用 VBAT 供电,供电电压 4.2V【篇】
- 基于K-means的用户画像聚类模型
- Practical project notes (I) -- creation of virtual machine
猜你喜欢
随机推荐
Introduction à l'ingénierie logicielle (sixième édition) notes d'examen de Zhang haifan
PCB线路板塞孔工艺的那些事儿~
深度学习 神经网络基础
2022年低压电工考试试题及答案
String类型转换BigDecimal、Date类型
《QTreeView+QAbstractItemModel自定义模型》:系列教程之三[通俗易懂]
面试题:MySQL的union all和union有什么区别、MySQL有哪几种join方式(阿里面试题)[通俗易懂]
杰理之、产线装配环节【篇】
新牛牛盲盒微信小程序源码_支持流量变现,带完整素材图片
2022熔化焊接与热切割上岗证题目模拟考试平台操作
工控设备安全加密的意义和措施
【智能QbD风险评估工具】上海道宁为您带来LeanQbD介绍、试用、教程
编译原理复习笔记
[multithreading] realize the singleton mode (hungry and lazy) realize the thread safe singleton mode (double validation lock)
极客DIY开源方案分享——数字幅频均衡功率放大器设计(实用的嵌入式电子设计作品软硬件综合实践)
UVM教程
Go — 相关依赖对应的exe
NIO与传统IO的区别
Internship: complex JSON format data compilation interface
基于YOLOv5的口罩佩戴检测方法

![[NOIP2013]积木大赛 [NOIP2018]道路铺设 贪心/差分](/img/d1/a56231cd4eb3cc1d91d8a55048ccfe.png)







