当前位置:网站首页>Kaggle competition two Sigma connect: rental listing inquiries
Kaggle competition two Sigma connect: rental listing inquiries
2022-07-06 12:00:00 【Want to be a kite】
Kaggle competition , Website links :Two Sigma Connect: Rental Listing Inquiries
According to the data information on the rental website , Predict the popularity of the house .( This is a question of classification , Contains the following data , Variable with category 、 Integer variable 、 Text variable ).
Random forest model
Use sklearn Complete modeling and prediction . The data set can be downloaded from the official website of the competition .
import numpy as np
import pandas as pd
import zipfile # The official website data set is zip type , Use zipfile open
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
for dirname, _, filenames in os.walk(r'E:\Kaggle\Kaggle_dataset01\two_sigma'): # Change your path
for filename in filenames:
print(os.path.join(dirname, filename))
train_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\train.json.zip').open('train.json'))
test_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\test.json.zip').open('test.json'))
# Here is a customized data processing function .
def data_preprocessing(data):
data['created_year'] = pd.to_datetime(data['created']).dt.year
data['created_month'] = pd.to_datetime(data['created']).dt.month
data['created_day'] = pd.to_datetime(data['created']).dt.day
data['num_description_words'] = data['description'].apply(lambda x:len(x.split(' ')))
data['num_features'] = data['features'].apply(len)
data['num_photos'] = data['photos'].apply(len)
New_data = data[['created_year','created_month','created_day','num_description_words','num_features','num_photos','bathrooms','bedrooms','latitude','longitude','price']]
return New_data
train_x = data_preprocessing(train_df)
train_y = train_df['interest_level']
test_x = data_preprocessing(test_df)
X_train,X_val,y_train,y_val = train_test_split(train_x,train_y,test_size=0.33) # Data segmentation
clf = RandomForestClassifier(n_estimators=1000) # Random forest model
clf.fit(X_train,y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val,y_val_pred)
y_test_predict = clf.predict_proba(test_x)
labels2idx = {
label:i for i,label in enumerate(clf.classes_)}
sub = pd.DataFrame()
sub['listing_id'] = df['listing_id']
for label in labels2idx.keys():
sub[label] = y[:,labels2idx[label]]
# Save the submission
#sub.to_csv('submission.csv',index=False) # Competition submission !
Run the above code , The effect of random forest is not very good . Some people will ask why there is no normalization preprocessing for data ? In fact, there is no need to normalize the data when using random forest , So I didn't do . If you want to do it , Try to verify it yourself . If you want to use random forest to improve the robustness of the model , Consider improving the feature engineering part , Get better features !
边栏推荐
- List and set
- [CDH] cdh5.16 configuring the setting of yarn task centralized allocation does not take effect
- Gallery's image browsing and component learning
- 【CDH】CDH5.16 配置 yarn 任务集中分配设置不生效问题
- 4. Install and deploy spark (spark on Yan mode)
- Reading notes of difficult career creation
- Comparaison des solutions pour la plate - forme mobile Qualcomm & MTK & Kirin USB 3.0
- C语言,log打印文件名、函数名、行号、日期时间
- 分布式節點免密登錄
- 2019 Tencent summer intern formal written examination
猜你喜欢
MongoDB
Reno7 60W super flash charging architecture
Mall project -- day09 -- order module
Vert. x: A simple TCP client and server demo
FTP file upload file implementation, regularly scan folders to upload files in the specified format to the server, C language to realize FTP file upload details and code case implementation
Linux yum安装MySQL
小L的试卷
Stage 4 MySQL database
RT-Thread的main线程“卡死”的一种可能原因及解决方案
Reno7 60W超级闪充充电架构
随机推荐
sklearn之feature_extraction.text.CountVectorizer / TfidVectorizer
[yarn] yarn container log cleaning
Oppo vooc fast charging circuit and protocol
imgcat使用心得
机器学习--线性回归(sklearn)
MySQL数据库面试题
几个关于指针的声明【C语言】
Detailed explanation of Union [C language]
OPPO VOOC快充电路和协议
互聯網協議詳解
Apprentissage automatique - - régression linéaire (sklearn)
Pytorch实现简单线性回归Demo
[CDH] cdh5.16 configuring the setting of yarn task centralized allocation does not take effect
[NPUCTF2020]ReadlezPHP
MySQL主从复制的原理以及实现
Togglebutton realizes the effect of switching lights
【Flink】CDH/CDP Flink on Yarn 日志配置
【flink】flink学习
FTP文件上传文件实现,定时扫描文件夹上传指定格式文件文件到服务器,C语言实现FTP文件上传详解及代码案例实现
I2C总线时序详解