当前位置:网站首页>处理数据集,使用LabelEncoder将所有id转换为从0开始
处理数据集,使用LabelEncoder将所有id转换为从0开始
2022-07-03 02:39:00 【strawberry47】
推荐算法领域的数据集总是从1开始,或是一串数字,每次处理的时候都要多一个user2id
的操作,实在是麻烦
干脆在使用数据集前就处理好,并保存下user2id
字典,方便后续查询
注意一下:
- sep要改成当前数据集的分隔符(’ ‘,’\t’)
- names改成当前数据集的列名
代码如下:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def load_mat():
data_path = '../dataset/ml-100k/u.data'
df_data = pd.read_csv(data_path, header = None, sep='\t', names =['user_id', 'item_id', 'rating','time'])
lbe_user = LabelEncoder()
lbe_user.fit(df_data['user_id'].unique())
converted_user = lbe_user.transform(df_data['user_id'])
lbe_item = LabelEncoder() # 弄成离散的
lbe_item.fit(df_data['item_id'].unique())
converted_item = lbe_item.transform(df_data['item_id'])
converted_data = pd.DataFrame()
converted_data['user_id'] = converted_user
converted_data['item_id'] = converted_item
converted_data['rating'] = df_data['rating']
# 对应关系
user2id = {
}
for user in lbe_user.classes_:
user2id.update({
user: lbe_user.transform([user])[0]})
item2id = {
}
for item in lbe_item.classes_:
item2id.update({
item: lbe_item.transform([item])[0]})
return converted_data,user2id,item2id
def save(converted_data,user2id,item2id):
sort = converted_data.sort_values(by=['user_id'])
sort.to_csv('../dataset/ml-100k/data_converted', header=None, index=False)
np.save('../dataset/ml-100k/user2id.npy', user2id)
np.save('../dataset/ml-100k/item2id.npy', item2id)
print('successfully saved')
if __name__ == '__main__':
converted_data,user2id,item2id = load_mat()
save(converted_data,user2id,item2id)
边栏推荐
- leetcode540
- Cancer biopsy instruments and kits - market status and future development trends
- GBase 8c 创建用户/角色 示例二
- awk从入门到入土(3)awk内置函数printf和print实现格式化打印
- Informatics Olympiad one general question bank 1006 a+b questions
- [shutter] banner carousel component (shutter_wiper plug-in | swiper component)
- 怎么将yolov5中的PANet层改为BiFPN
- [translation] modern application load balancing with centralized control plane
- Awk from getting started to being buried (2) understand the built-in variables and the use of variables in awk
- 【Flutter】shared_ Preferences local storage (introduction | install the shared_preferences plug-in | use the shared_preferences process)
猜你喜欢
Choose it when you decide
Deep learning: multi-layer perceptron and XOR problem (pytoch Implementation)
The Linux server needs to install the agent software EPS (agent) database
[flutter] example of asynchronous programming code between future and futurebuilder (futurebuilder constructor setting | handling flutter Chinese garbled | complete code example)
基于can总线的A2L文件解析(2)
Tongda OA V12 process center
Principle and application of database
Thread safe singleton mode
[fluent] JSON model conversion (JSON serialization tool | JSON manual serialization | writing dart model classes according to JSON | online automatic conversion of dart classes according to JSON)
Basic operation of binary tree (C language version)
随机推荐
Gbase 8C system table PG_ conversion
GBase 8c 函数/存储过程参数(二)
[tutorial] chrome turns off cross domain policies CORS and samesite, and brings cookies across domains
GBase 8c触发器(三)
sql server 查询指定表的表结构
Restcloud ETL cross database data aggregation operation
awk从入门到入土(1)awk初次会面
GBase 8c 函数/存储过程参数(一)
How to change the panet layer in yolov5 to bifpn
awk从入门到入土(3)awk内置函数printf和print实现格式化打印
Wechat - developed by wechat official account Net core access
[translation] modern application load balancing with centralized control plane
GBase 8c触发器(二)
Gbase 8C system table PG_ database
HW-初始准备
random shuffle注意
HW initial preparation
Monitoring and management of JVM
MATLAB小技巧(24)RBF,GRNN,PNN-神经网络
疫情当头,作为Leader如何进行代码版本和需求开发管控?| 社区征文