当前位置:网站首页>Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries
2022-07-06 09:16:00 【想成为风筝】
Kaggle竞赛,网址链接:Two Sigma Connect: Rental Listing Inquiries
根据租房网站上的数据信息,预测房子的受欢迎程度。(这是一个分类问题,包含以下数据,有类别变量、整数变量、文本变量)。
随机森林模型
使用sklearn完成建模预测。数据集可在竞赛官网下载。
import numpy as np
import pandas as pd
import zipfile #官网数据集是zip类型,使用zipfile打开
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
for dirname, _, filenames in os.walk(r'E:\Kaggle\Kaggle_dataset01\two_sigma'): #改下自己的路径
for filename in filenames:
print(os.path.join(dirname, filename))
train_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\train.json.zip').open('train.json'))
test_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\test.json.zip').open('test.json'))
#这里自定义了一个数据处理函数。
def data_preprocessing(data):
data['created_year'] = pd.to_datetime(data['created']).dt.year
data['created_month'] = pd.to_datetime(data['created']).dt.month
data['created_day'] = pd.to_datetime(data['created']).dt.day
data['num_description_words'] = data['description'].apply(lambda x:len(x.split(' ')))
data['num_features'] = data['features'].apply(len)
data['num_photos'] = data['photos'].apply(len)
New_data = data[['created_year','created_month','created_day','num_description_words','num_features','num_photos','bathrooms','bedrooms','latitude','longitude','price']]
return New_data
train_x = data_preprocessing(train_df)
train_y = train_df['interest_level']
test_x = data_preprocessing(test_df)
X_train,X_val,y_train,y_val = train_test_split(train_x,train_y,test_size=0.33) #数据切分
clf = RandomForestClassifier(n_estimators=1000) #随机森林模型
clf.fit(X_train,y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val,y_val_pred)
y_test_predict = clf.predict_proba(test_x)
labels2idx = {
label:i for i,label in enumerate(clf.classes_)}
sub = pd.DataFrame()
sub['listing_id'] = df['listing_id']
for label in labels2idx.keys():
sub[label] = y[:,labels2idx[label]]
#保存提交文件
#sub.to_csv('submission.csv',index=False) #竞赛提交文件!
运行上述代码,随机森林的效果并不是很好。有人会问为什么不对数据进行归一化预处理?其实使用随机森林时不需要对数据进行归一化处理,所以就没做。想做的话,自己尝试验证一下。如果想使用随机森林提高模型的鲁棒性,可以考虑改进特征工程部分,获取更好的特征!
边栏推荐
- [Flink] Flink learning
- [Blue Bridge Cup 2017 preliminary] buns make up
- L2-001 紧急救援 (25 分)
- [Bluebridge cup 2021 preliminary] weight weighing
- Niuke novice monthly race 40
- Valentine's Day flirting with girls to force a small way, one can learn
- 天梯赛练习集题解LV1(all)
- Heating data in data lake?
- [BSidesCF_2020]Had_ a_ bad_ day
- vs2019 第一个MFC应用程序
猜你喜欢
error C4996: ‘strcpy‘: This function or variable may be unsafe. Consider using strcpy_ s instead
Stage 4 MySQL database
Case analysis of data inconsistency caused by Pt OSC table change
【yarn】CDP集群 Yarn配置capacity调度器批量分配
分布式節點免密登錄
MongoDB
Vs2019 desktop app quick start
MySQL and C language connection (vs2019 version)
Word typesetting (subtotal)
Vert. x: A simple login access demo (simple use of router)
随机推荐
[Flink] cdh/cdp Flink on Yan log configuration
L2-001 紧急救援 (25 分)
Détails du Protocole Internet
[NPUCTF2020]ReadlezPHP
TypeScript
Composition des mots (sous - total)
wangeditor富文本组件-复制可用
Common regular expression collation
[Kerberos] deeply understand the Kerberos ticket life cycle
MTCNN人脸检测
Mtcnn face detection
[template] KMP string matching
ImportError: libmysqlclient. so. 20: Cannot open shared object file: no such file or directory solution
L2-006 树的遍历 (25 分)
L2-001 emergency rescue (25 points)
Solution to the practice set of ladder race LV1 (all)
Vert. x: A simple TCP client and server demo
[Blue Bridge Cup 2017 preliminary] grid division
Funny cartoon: Programmer's logic
Niuke novice monthly race 40