当前位置:网站首页>Kaggle competition two Sigma connect: rental listing inquiries
Kaggle competition two Sigma connect: rental listing inquiries
2022-07-06 12:00:00 【Want to be a kite】
Kaggle competition , Website links :Two Sigma Connect: Rental Listing Inquiries
According to the data information on the rental website , Predict the popularity of the house .( This is a question of classification , Contains the following data , Variable with category 、 Integer variable 、 Text variable ).
Random forest model
Use sklearn Complete modeling and prediction . The data set can be downloaded from the official website of the competition .
import numpy as np
import pandas as pd
import zipfile # The official website data set is zip type , Use zipfile open
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
for dirname, _, filenames in os.walk(r'E:\Kaggle\Kaggle_dataset01\two_sigma'): # Change your path
for filename in filenames:
print(os.path.join(dirname, filename))
train_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\train.json.zip').open('train.json'))
test_df = pd.read_json(zipfile.ZipFile(r'E:\Kaggle\Kaggle_dataset01\two_sigma\test.json.zip').open('test.json'))
# Here is a customized data processing function .
def data_preprocessing(data):
data['created_year'] = pd.to_datetime(data['created']).dt.year
data['created_month'] = pd.to_datetime(data['created']).dt.month
data['created_day'] = pd.to_datetime(data['created']).dt.day
data['num_description_words'] = data['description'].apply(lambda x:len(x.split(' ')))
data['num_features'] = data['features'].apply(len)
data['num_photos'] = data['photos'].apply(len)
New_data = data[['created_year','created_month','created_day','num_description_words','num_features','num_photos','bathrooms','bedrooms','latitude','longitude','price']]
return New_data
train_x = data_preprocessing(train_df)
train_y = train_df['interest_level']
test_x = data_preprocessing(test_df)
X_train,X_val,y_train,y_val = train_test_split(train_x,train_y,test_size=0.33) # Data segmentation
clf = RandomForestClassifier(n_estimators=1000) # Random forest model
clf.fit(X_train,y_train)
y_val_pred = clf.predict_proba(X_val)
log_loss(y_val,y_val_pred)
y_test_predict = clf.predict_proba(test_x)
labels2idx = {
label:i for i,label in enumerate(clf.classes_)}
sub = pd.DataFrame()
sub['listing_id'] = df['listing_id']
for label in labels2idx.keys():
sub[label] = y[:,labels2idx[label]]
# Save the submission
#sub.to_csv('submission.csv',index=False) # Competition submission !
Run the above code , The effect of random forest is not very good . Some people will ask why there is no normalization preprocessing for data ? In fact, there is no need to normalize the data when using random forest , So I didn't do . If you want to do it , Try to verify it yourself . If you want to use random forest to improve the robustness of the model , Consider improving the feature engineering part , Get better features !
边栏推荐
- 机器学习--决策树(sklearn)
- Kaggle竞赛-Two Sigma Connect: Rental Listing Inquiries
- PyTorch四种常用优化器测试
- List and set
- [Kerberos] deeply understand the Kerberos ticket life cycle
- Bubble sort [C language]
- 2020 WANGDING cup_ Rosefinch formation_ Web_ nmap
- [yarn] yarn container log cleaning
- 2019 Tencent summer intern formal written examination
- Yarn installation and use
猜你喜欢
[yarn] CDP cluster yarn configuration capacity scheduler batch allocation
Implementation scheme of distributed transaction
Unit test - unittest framework
第4阶段 Mysql数据库
E-commerce data analysis -- User Behavior Analysis
Several declarations about pointers [C language]
MongoDB
Linux yum安装MySQL
Word typesetting (subtotal)
共用体(union)详解【C语言】
随机推荐
Contiki source code + principle + function + programming + transplantation + drive + network (turn)
Selective sorting and bubble sorting [C language]
2020 WANGDING cup_ Rosefinch formation_ Web_ nmap
arduino UNO R3的寄存器写法(1)-----引脚电平状态变化
Variable parameter principle of C language function: VA_ start、va_ Arg and VA_ end
Password free login of distributed nodes
MySQL realizes read-write separation
There are three iPhone se 2022 models in the Eurasian Economic Commission database
Matlab learning and actual combat notes
关键字 inline (内联函数)用法解析【C语言】
Reno7 60W super flash charging architecture
Correspondence between STM32 model and contex M
[Kerberos] deeply understand the Kerberos ticket life cycle
Word typesetting (subtotal)
Raspberry pie tap switch button to use
inline详细讲解【C语言】
Reading notes of difficult career creation
[yarn] CDP cluster yarn configuration capacity scheduler batch allocation
FTP file upload file implementation, regularly scan folders to upload files in the specified format to the server, C language to realize FTP file upload details and code case implementation
列表的使用