当前位置:网站首页>Love math experiment | Issue 8 - building of Singapore house price prediction model
Love math experiment | Issue 8 - building of Singapore house price prediction model
2022-06-27 21:00:00 【Data science artificial intelligence】
Love number class :idatacourse.cn
field : consumption
brief introduction : The data comes from the data of aibiying B & B in Singapore , Total data 7907 strip ,16 A field . We passed this experiment Python Visual analysis of the drawing library , View the value distribution of features and the relationship between features . Building regression models , According to the longitude of B & B 、 latitude 、 Type of house 、 The administrative division and other characteristics predict the price of B & B .
data :
./dataset/listings.csv
Catalog
1. Data preparation
1.1 Data set introduction
The data comes from the data of aibiying B & B in Singapore , Total data 7907 strip ,16 A field . We passed this experiment Python The data set is visually analyzed by the drawing library of , View the value distribution of features and the relationship between features . Building regression models , According to the longitude of B & B 、 latitude 、 Type of house 、 The administrative division and other characteristics predict the price of B & B . The meaning of each data field is shown in the following table :
Name | meaning |
|---|---|
id | Room number |
name | Room name |
host_id | Landlord number |
host_name | Landlord's name |
neighbourhood_group | Region group to which it belongs |
neighbourhood | The administrative area |
latitude | latitude |
longitude | longitude |
room_type | Room type ( A complete set of 、 A separate room 、 Joint tenancy ) |
price | Price |
minimum_nights | At least a few nights |
number_of_reviews | comments |
last_review | Last comment time |
reviews_per_month | Average monthly comments |
calculated_host_listings_count | The number of rentable houses owned by the landlord |
availability_365 | The number of days that can be rented in a year |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif']='SimHei'
%config InlineBackend.figure_format = 'svg'
%matplotlib inline
1.2 data fetch
Load the data first , Have some knowledge of data .
flat_data = pd.read_csv('./dataset/listings.csv')
Use Pandas Medium read_csv() Function can read csv file , The result is saved as a DataFrame or Series object , By calling DataFrame or Series Object's shape Method to view the data set size , call head() Method before viewing n Row data , The default is 5.
print(flat_data.shape)
flat_data.head()
(7907, 16)
By calling DataFrame Object's info() Methods the print DataFrame Summary of objects , Include the data type of the column dtype、 Name and missing values , The dimension of the data frame and the memory occupied .
flat_data.info()
Dataset room name name, Last comment time last_review, Average number of comments per month reviews_per_month Three fields have missing values , So we need to deal with the missing values .
# View comments 、 Last comment time 、 The average number of comments per month is three columns
flat_data[['number_of_reviews','last_review','reviews_per_month']]
You can see the last comment time last_review, Average number of comments per month reviews_per_month The reason why the two fields have missing values is that the number of comments is 0, No comment . Data preprocessing before modeling , You can delete the last comment time last_review This column , Average number of comments per month reviews_per_month The missing value is used 0 Fill in .
2. Statistics and visualization
2.1 The number histogram of each classification feature
Use Seaborn Medium barplot() Function to draw a histogram , Show the number of each classification feature
plt.figure(figsize=(8,5))
# Visualize the number of each classification feature
count_uniq = []
columns = ['neighbourhood_group','neighbourhood', 'room_type']
for column in columns:
# Count the number of different values of these three fields
count_uniq.append(flat_data[column].nunique())
print(count_uniq)
sns.barplot(x=columns, y=count_uniq,palette='Set3')
plt.title(' Histogram of the number of each classification feature ')
As can be seen from the above figure , The division group includes 5 Different values , The administrative regions share 43 Different values , The types of houses are 3 Different values .
2.2 Price distribution histogram
Use Seaborn Medium distplot() Function to draw histogram , Show the distribution of prices
plt.figure(figsize=(8,5))
sns.distplot(flat_data["price"]) # Histogram
plt.title(' Price distribution histogram ')
On the whole , Prices are distributed in 0 To 10000 Between , But house prices are 1000 To 10000 There are very few rooms between , House prices are mostly in 1000 following .
2.3 Check the number of houses of different types
Use Seaborn Medium countplot() Function to draw a histogram , Show the number of houses of different types
plt.figure(figsize=(8,5))
sns.countplot(flat_data['room_type'],palette='Set2')
plt.title(' The number of houses of different types ')
The largest number of houses is the whole rent , The second is the type of independent room , The least is the type of room shared . The two types of rooms, whole rent and independent room, account for a large proportion , Maybe more popular , The number of rooms shared is the least .
2.4 Histogram of regional distribution of houses
Use Seaborn Medium countplot() Function to draw a histogram , Show the regional distribution of the house
plt.figure(figsize=(8,5))
sns.countplot(flat_data["neighbourhood_group"])
plt.title(' Histogram of regional distribution of houses ')
It can be seen from the results above that more houses are located in the central area , The second is the western region 、 The East 、 Northeast China , The number of rooms in the northern area is the least .
2.5 Histogram of housing types in different areas
Use Seaborn Medium countplot() Function to draw a histogram , Show the types of houses in different areas
plt.figure(figsize=(8,5))
sns.countplot(data = flat_data,x='room_type',hue='neighbourhood_group')
plt.title(' Histogram of housing types in different areas ')
The central area has the largest number of whole rented houses , Other areas have the most types of independent rooms , The vast majority of shared houses are distributed in the central region , It may be due to the higher housing prices in the central area .
2.6 Box chart of house prices in different regions
Use Seaborn Medium boxplot() Function to draw a box diagram , Show the housing prices in different regions
plt.figure(figsize=(8,5))
sns.boxplot(x = 'neighbourhood_group',
y = 'price',
data = flat_data[flat_data['price']<=500] # Take the price at 500 Analysis of houses within
)
plt.title(' Box chart of house prices in different regions ')
It is observed from the box diagram that : House prices in the central area are more widely distributed , The average price is also higher than other places . The average price in the north is the lowest .
2.7 Box diagram of relationship between house type and price
Use Seaborn Medium boxplot() Function to draw a box diagram , Show the relationship between house type and price
plt.figure(figsize=(8,5))
sns.boxplot(x = 'room_type',
y = 'price',
data = flat_data[flat_data['price']<=500] # Take the price at 500 Analysis of houses within
)
plt.title(' Box diagram of relationship between house type and price ')
The price distribution area of the whole rent type of houses is wider , And the average price is higher than the other two types , The average price of shared houses is the lowest .
2.8 Scatter diagram of longitude and latitude distribution of houses
Use Seaborn Medium scatterplot() Function to draw a scatter plot , Show the longitude and latitude distribution of the house
plt.figure(figsize=(10,7))
#x The axis is the longitude value ,y The axis is the latitude value
sns.scatterplot(flat_data['longitude'],flat_data['latitude'],
hue=flat_data['neighbourhood_group'])
plt.title(' Scatter diagram of longitude and latitude distribution of houses ')
The orange part shows the houses in the central area , Green is the housing situation in the eastern region , Red is the housing situation in the western region , Purple is the housing situation in the northeast , The blue color shows the houses in the northern area . The number of houses in the central area is large and densely distributed , The number of houses in the northern region is the least and the distribution is relatively scattered .
2.9 Scatter chart of housing price distribution
Use Seaborn Medium scatterplot() Function to draw a scatter plot , Show the distribution of house prices
# Visualize prices
plt.figure(figsize=(10,7))
#x The axis is the longitude value ,y The axis is the latitude value
sns.scatterplot(flat_data['longitude'], flat_data['latitude'],
hue=flat_data['price'])
plt.title(' Scatter chart of housing price distribution ')
Most of the houses with higher prices are distributed in the central and western regions , The East 、 The number of houses with higher prices in the northeast and northern regions is very small .
3. Data preprocessing
3.1 Delete unnecessary Columns
By calling DataFrame Object's drop() Method , And set up axis=1, Delete room number id、 Room name name、 Landlord number host_id Isochronous .
# Delete some unnecessary Columns
flat_data = flat_data.drop(['id', 'name','host_id','host_name', 'last_review', 'neighbourhood'],
axis=1)
3.2 Missing value processing
By calling DataFrame Object's fillna() Method , use 0 Fill in missing values .
# use 0 Fill in missing data , The average number of comments per month 0 Fill in
flat_data = flat_data.fillna(0)
flat_data.isnull().sum()
3.3 Numerical coding
Import sklearn In the library preprocessing Modular LabelEncoder class
from sklearn.preprocessing import LabelEncoder
cols = ["neighbourhood_group","room_type"] # Columns that require numeric encoding
for col in cols:
# Use LabelEncoder() Create a new object , Name it le
le = LabelEncoder()
# call fit() Method , Create a mapping between feature values and coding results
le.fit(flat_data[col])
# call transform() Method to convert the data , Converted to encoded results
flat_data[col] = le.transform(flat_data[col])
flat_data.head()
neighbourhood_group and room_type The column is transformed into 0、1、2 Equal value .
4. LightGBM model building
4.1 Logarithmic transformation
Logarithmic transformation is a common feature engineering method , Generally, for values greater than 0 Long tailed distribution data , Logarithmic transformation can be used to convert eigenvalues , As a whole, it slows down the extreme distribution state of long tail distribution , Strive for more space for the low value end , Compress the high end as much as possible , Make the overall distribution more reasonable . And then enhance the effect of the model .
# Remove the price as 0 The data of
flat_data = flat_data[flat_data['price']>0]
flat_data['price'] = np.log10(flat_data['price'])
flat_data.head()
flat_data['price'].describe()
4.2 The goal is 、 Feature Division
X = flat_data.drop('price', axis=1)
y = flat_data['price']
4.3 model building
import lightgbm
# Create a new one LGBMRegressor() object , Name it model
model = lightgbm.LGBMRegressor()
# Set parameters
params = {'n_estimators': [10,20,30,50,100,200,500], # Number of base learners
'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], # A certain proportion of data is sampled during training
'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0], # Proportion of feature sampling
'learning_rate' : [0.01,0.03,0.1,0.2,0.3], # Learning rate
'reg_lambda':[0,0.1,0.2,0.5,0.7,0.9,1] #L2 Regularization
}
# Random parameter search
from sklearn.model_selection import RandomizedSearchCV
# Create a new one RandomizedSearchCV object
#cv: The fold of cross validation , The default is 5
lgbm_search_cv = RandomizedSearchCV(model, params, cv=5, scoring='neg_mean_absolute_error')
Yes lgbm_search_cv call fit Method , Into the X,y Training .
lgbm_search_cv.fit(X,y)
By calling best_estimator_ Look at the best model .
lgbm_search_cv.best_estimator_
adopt lgbm_search_cv.best_score_ Get the best score, i.e MAE The negative number of the value , adopt abs() Method to get the absolute value , The model of MAE value
abs(lgbm_search_cv.best_score_)
By calling best_estimator_.feature_importances_ See the importance of each feature .
# Importance of features , null
pd.Series(lgbm_search_cv.best_estimator_.feature_importances_, index=X.columns).sort_values(ascending=False)
# By calling barplot() function , Draw a bar chart to show the importance of the feature
plt.figure(figsize=(8,5))
sns.barplot(x=lgbm_search_cv.best_estimator_.feature_importances_,y=X.columns,palette="Set2")
The importance of features is first and foremost the longitude of the house longitude, latitude latitude, There is little difference in importance between the two . The second is the number of days that can be rented in a year availability_365, And the number of rentable houses owned by the landlord calculated_host_listings_count. The least important feature is the regional group neighbourhood_group.
# Restore to the true predicted value and calculate the absolute error
y_true = 10**y
y_predict = 10**(lgbm_search_cv.best_estimator_.predict(X))
absolute_error = abs(y_true.values-y_predict) # Calculate the absolute error
# convert to DataFrame object
pd.DataFrame({"true":y_true, "predict": y_predict , "absolute error":absolute_error}).head()
5. summary
First we read the data set , View the basic information of the data , Have a basic understanding of datasets . Then statistics and visualization of the data , Draw histogram of price distribution 、 Histogram of the number of houses of different room types 、 Box chart of house prices in different regions 、 Scatter diagram of longitude and latitude distribution of houses, etc . Then preprocess the data , Including missing value processing 、 Numerical coding 、 Logarithmic transformation of target column, etc . And then build LightGBM The regression model , Parameter tuning through random search , And see the optimal model MAE Value .
Love number class (iDataCourse) It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .
边栏推荐
- CSDN skill tree experience and product analysis (1)
- 抗洪救灾,共克时艰,城联优品驰援英德捐赠爱心物资
- 一场分销裂变活动,不止是发发朋友圈这么简单
- Navicat Premium连接问题--- Host ‘xxxxxxxx‘ is not allowed to connect to this MySQL server
- Mongodb introduction and typical application scenarios
- 本周二晚19:00战码先锋第8期直播丨如何多方位参与OpenHarmony开源贡献
- Character interception triplets of data warehouse: substrb, substr, substring
- 爱数课实验 | 第九期-利用机器学习方法进行健康智能诊断
- 北汽制造全新皮卡曝光,安全、舒适一个不落
- mime. Type file content
猜你喜欢

Recommended practice sharing of Zhilian recruitment based on Nebula graph

Massive data attended the Lanzhou opengauss meetup (ECOLOGICAL NATIONAL trip) activity, enabling users to upgrade their applications with enterprise level databases

数仓的字符截取三胞胎:substrb、substr、substring

At 19:00 on Tuesday evening, the 8th live broadcast of battle code Pioneer - how to participate in openharmony's open source contribution in multiple directions

Runmaide medical opened the offering: without the participation of cornerstone investors, the amount of loss doubled

BAIC makes a brand new pickup truck, which is safe and comfortable

NVIDIA three piece environment configuration
![[STL programming] [common competition] [Part 2]](/img/67/2a2d787680c0984f6c294a9ec3355b.png)
[STL programming] [common competition] [Part 2]

"Good voice" has been singing for 10 years. How can the Chinese language in the starry sky sing well in HKEx?
一段时间没用思源,升级到最新的 24 版后反复显示数据加密问题
随机推荐
On the drawing skills of my writing career
SQL reported an unusual error, which confused the new interns
基于 TensorRT 的模型推理加速
Grasp the detailed procedure of function call stack from instruction reading
Unity3D Button根据文本内容自适应大小
一段时间没用思源,升级到最新的 24 版后反复显示数据加密问题
[STL programming] [common competition] [Part 2]
北汽制造全新皮卡曝光,安全、舒适一个不落
The meta universe virtual digital human is closer to us | Sinovel interaction
JPA踩坑系列之save方法
Graylog 新一代日志收集预警系统安装配置
eval函数,全局、本地变量
Select auto increment or sequence for primary key selection?
基于微信小程序的高校党员之家服务管理系统系统小程序#毕业设计,党员,积极分子,学习,打卡,论坛
KDD 2022 | 图“预训练、提示、微调”范式下的图神经网络泛化框架
I haven't thought about the source for some time. After upgrading to the latest version 24, the data encryption problem is repeatedly displayed
Ble Bluetooth module nrf518/nrf281/nrf528/nrf284 chip scheme comparison
Practice of combining rook CEPH and rainbow, a cloud native storage solution
CSDN 技能樹使用體驗與產品分析(1)
海量数据出席兰州openGauss Meetup(生态全国行)活动,以企业级数据库赋能用户应用升级