当前位置:网站首页>Kaggle competition two Sigma connect: rental listing inquiries
Kaggle competition two Sigma connect: rental listing inquiries
2022-07-06 12:00:00 【Want to be a kite】
Kaggle competition , Website links :Two Sigma Connect: Rental Listing Inquiries
According to the data information on the rental website , Predict the popularity of the house .( This is a question of classification , Contains the following data , Variable with category 、 Integer variable 、 Text variable ).
Random forest model
Use sklearn Complete modeling and prediction . The data set can be downloaded from the official website of the competition .
import numpy as np
import pandas as pd
import zipfile # The official website data set is zip type , Use zipfile open
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
for dirname, _, filenames in os.walk(r'E:\Kaggle\Kaggle_dataset01\two_sigma'): # Change your path
for filename in filenames:
print(os.path.join(dirname, filename))
train_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\train.json.zip').open('train.json'))
test_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\test.json.zip').open('test.json'))
# Here is a customized data processing function .
def data_preprocessing(data):
data['created_year'] = pd.to_datetime(data['created']).dt.year
data['created_month'] = pd.to_datetime(data['created']).dt.month
data['created_day'] = pd.to_datetime(data['created']).dt.day
data['num_description_words'] = data['description'].apply(lambda x:len(x.split(' ')))
data['num_features'] = data['features'].apply(len)
data['num_photos'] = data['photos'].apply(len)
New_data = data[['created_year','created_month','created_day','num_description_words','num_features','num_photos','bathrooms','bedrooms','latitude','longitude','price']]
return New_data
train_x = data_preprocessing(train_df)
train_y = train_df['interest_level']
test_x = data_preprocessing(test_df)
X_train,X_val,y_train,y_val = train_test_split(train_x,train_y,test_size=0.33) # Data segmentation
clf = RandomForestClassifier(n_estimators=1000) # Random forest model
clf.fit(X_train,y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val,y_val_pred)
y_test_predict = clf.predict_proba(test_x)
labels2idx = {
label:i for i,label in enumerate(clf.classes_)}
sub = pd.DataFrame()
sub['listing_id'] = df['listing_id']
for label in labels2idx.keys():
sub[label] = y[:,labels2idx[label]]
# Save the submission
#sub.to_csv('submission.csv',index=False) # Competition submission !
Run the above code , The effect of random forest is not very good . Some people will ask why there is no normalization preprocessing for data ? In fact, there is no need to normalize the data when using random forest , So I didn't do . If you want to do it , Try to verify it yourself . If you want to use random forest to improve the robustness of the model , Consider improving the feature engineering part , Get better features !
边栏推荐
猜你喜欢
随机推荐
Hutool中那些常用的工具类和方法
几个关于指针的声明【C语言】
Gallery's image browsing and component learning
Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries
Bubble sort [C language]
【flink】flink学习
5G工作原理详解(解释&图解)
Apprentissage automatique - - régression linéaire (sklearn)
Nodejs connect mysql
Oppo vooc fast charging circuit and protocol
Some concepts often asked in database interview
express框架详解
数据库面试常问的一些概念
高通&MTK&麒麟 手機平臺USB3.0方案對比
【yarn】CDP集群 Yarn配置capacity调度器批量分配
JS object and event learning notes
Password free login of distributed nodes
Those commonly used tool classes and methods in hutool
ToggleButton实现一个开关灯的效果
2019 Tencent summer intern formal written examination