当前位置:网站首页>处理数据集,使用LabelEncoder将所有id转换为从0开始
处理数据集,使用LabelEncoder将所有id转换为从0开始
2022-07-03 02:39:00 【strawberry47】
推荐算法领域的数据集总是从1开始,或是一串数字,每次处理的时候都要多一个user2id的操作,实在是麻烦
干脆在使用数据集前就处理好,并保存下user2id字典,方便后续查询
注意一下:
- sep要改成当前数据集的分隔符(’ ‘,’\t’)
- names改成当前数据集的列名
代码如下:
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
def load_mat():
data_path = '../dataset/ml-100k/u.data'
df_data = pd.read_csv(data_path, header = None, sep='\t', names =['user_id', 'item_id', 'rating','time'])
lbe_user = LabelEncoder()
lbe_user.fit(df_data['user_id'].unique())
converted_user = lbe_user.transform(df_data['user_id'])
lbe_item = LabelEncoder() # 弄成离散的
lbe_item.fit(df_data['item_id'].unique())
converted_item = lbe_item.transform(df_data['item_id'])
converted_data = pd.DataFrame()
converted_data['user_id'] = converted_user
converted_data['item_id'] = converted_item
converted_data['rating'] = df_data['rating']
# 对应关系
user2id = {
}
for user in lbe_user.classes_:
user2id.update({
user: lbe_user.transform([user])[0]})
item2id = {
}
for item in lbe_item.classes_:
item2id.update({
item: lbe_item.transform([item])[0]})
return converted_data,user2id,item2id
def save(converted_data,user2id,item2id):
sort = converted_data.sort_values(by=['user_id'])
sort.to_csv('../dataset/ml-100k/data_converted', header=None, index=False)
np.save('../dataset/ml-100k/user2id.npy', user2id)
np.save('../dataset/ml-100k/item2id.npy', item2id)
print('successfully saved')
if __name__ == '__main__':
converted_data,user2id,item2id = load_mat()
save(converted_data,user2id,item2id)
边栏推荐
- Packing and unpacking of JS
- How to change the panet layer in yolov5 to bifpn
- Wechat - developed by wechat official account Net core access
- Gbase 8C system table PG_ cast
- [Hcia]No.15 Vlan间通信
- Simple understanding of SVG
- GBase 8c系统表-pg_constraint
- GBase 8c系统表-pg_amop
- Strategy application of Dameng database
- Today, it's time to copy the bottom!
猜你喜欢

Classes and objects - initialization and cleanup of objects - constructor call rules

Check log4j problems using stain analysis

Random Shuffle attention

4. Classes and objects

基于can总线的A2L文件解析(2)

oauth2.0鉴权,登录访问 “/oauth/token”,请求头Authorization(basicToken)如何取值???

Matlab tips (24) RBF, GRNN, PNN neural network

Kubernetes family container housekeeper pod online Q & A?

Linear rectification function relu and its variants in deep learning activation function
【ROS进阶篇】第六讲 ROS中的录制与回放(rosbag)
随机推荐
Gbase 8C system table PG_ am
GBase 8c系统表-pg_constraint
[hcia]no.15 communication between VLANs
Awk from entry to burial (1) awk first meeting
Add MDF database file to SQL Server database, and the error is reported
Didi programmers are despised by relatives: an annual salary of 800000 is not as good as two teachers
Gbase 8C function / stored procedure parameters (I)
How to change the panet layer in yolov5 to bifpn
leetcode540
[shutter] bottom navigation bar page frame (bottomnavigationbar bottom navigation bar | pageview sliding page | bottom navigation and sliding page associated operation)
SQL statement
定了,就选它
GBase 8c系统表-pg_conversion
[principles of multithreading and high concurrency: 1_cpu multi-level cache model]
Classes and objects - initialization and cleanup of objects - constructor call rules
【教程】chrome關閉跨域策略cors、samesite,跨域帶上cookie
Pytest (6) -fixture (Firmware)
ASP. Net core 6 framework unveiling example demonstration [02]: application development based on routing, MVC and grpc
Interview stereotyped version
【翻译】Flux安全。通过模糊处理获得更多信心