当前位置：网站首页>Love math experiment | Issue 8 - building of Singapore house price prediction model

Love math experiment | Issue 8 - building of Singapore house price prediction model

2022-06-27 21:00:00 【Data science artificial intelligence】

Love number class ：idatacourse.cn

field ： consumption

brief introduction ： The data comes from the data of aibiying B & B in Singapore , Total data 7907 strip ,16 A field . We passed this experiment Python Visual analysis of the drawing library , View the value distribution of features and the relationship between features . Building regression models , According to the longitude of B & B 、 latitude 、 Type of house 、 The administrative division and other characteristics predict the price of B & B .

data ：

./dataset/listings.csv

Catalog

1. Data preparation

1.1 Data set introduction

The data comes from the data of aibiying B & B in Singapore , Total data 7907 strip ,16 A field . We passed this experiment Python The data set is visually analyzed by the drawing library of , View the value distribution of features and the relationship between features . Building regression models , According to the longitude of B & B 、 latitude 、 Type of house 、 The administrative division and other characteristics predict the price of B & B . The meaning of each data field is shown in the following table ：

Name	meaning
id	Room number
name	Room name
host_id	Landlord number
host_name	Landlord's name
neighbourhood_group	Region group to which it belongs
neighbourhood	The administrative area
latitude	latitude
longitude	longitude
room_type	Room type （ A complete set of 、 A separate room 、 Joint tenancy ）
price	Price
minimum_nights	At least a few nights
number_of_reviews	comments
last_review	Last comment time
reviews_per_month	Average monthly comments
calculated_host_listings_count	The number of rentable houses owned by the landlord
availability_365	The number of days that can be rented in a year

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('white')

import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif']='SimHei'
%config InlineBackend.figure_format = 'svg'
%matplotlib inline

1.2 data fetch

Load the data first , Have some knowledge of data .

flat_data = pd.read_csv('./dataset/listings.csv')

Use Pandas Medium read_csv() Function can read csv file , The result is saved as a DataFrame or Series object , By calling DataFrame or Series Object's shape Method to view the data set size , call head() Method before viewing n Row data , The default is 5.

print(flat_data.shape)
flat_data.head()

(7907, 16)

By calling DataFrame Object's info() Methods the print DataFrame Summary of objects , Include the data type of the column dtype、 Name and missing values , The dimension of the data frame and the memory occupied .

flat_data.info()

Dataset room name name, Last comment time last_review, Average number of comments per month reviews_per_month Three fields have missing values , So we need to deal with the missing values .

# View comments 、 Last comment time 、 The average number of comments per month is three columns 
flat_data[['number_of_reviews','last_review','reviews_per_month']]

You can see the last comment time last_review, Average number of comments per month reviews_per_month The reason why the two fields have missing values is that the number of comments is 0, No comment . Data preprocessing before modeling , You can delete the last comment time last_review This column , Average number of comments per month reviews_per_month The missing value is used 0 Fill in .

2. Statistics and visualization

2.1 The number histogram of each classification feature

Use Seaborn Medium barplot() Function to draw a histogram , Show the number of each classification feature

plt.figure(figsize=(8,5))

# Visualize the number of each classification feature 
count_uniq = []
columns = ['neighbourhood_group','neighbourhood', 'room_type']
for column in columns:
    # Count the number of different values of these three fields 
    count_uniq.append(flat_data[column].nunique())
print(count_uniq)
    
sns.barplot(x=columns, y=count_uniq,palette='Set3')
plt.title(' Histogram of the number of each classification feature ')

As can be seen from the above figure , The division group includes 5 Different values , The administrative regions share 43 Different values , The types of houses are 3 Different values .

2.2 Price distribution histogram

Use Seaborn Medium distplot() Function to draw histogram , Show the distribution of prices

plt.figure(figsize=(8,5))
sns.distplot(flat_data["price"])  #  Histogram 
plt.title(' Price distribution histogram ')

On the whole , Prices are distributed in 0 To 10000 Between , But house prices are 1000 To 10000 There are very few rooms between , House prices are mostly in 1000 following .

2.3 Check the number of houses of different types

Use Seaborn Medium countplot() Function to draw a histogram , Show the number of houses of different types

plt.figure(figsize=(8,5))
sns.countplot(flat_data['room_type'],palette='Set2')
plt.title(' The number of houses of different types ')

The largest number of houses is the whole rent , The second is the type of independent room , The least is the type of room shared . The two types of rooms, whole rent and independent room, account for a large proportion , Maybe more popular , The number of rooms shared is the least .

2.4 Histogram of regional distribution of houses

Use Seaborn Medium countplot() Function to draw a histogram , Show the regional distribution of the house

plt.figure(figsize=(8,5))
sns.countplot(flat_data["neighbourhood_group"])
plt.title(' Histogram of regional distribution of houses ')

It can be seen from the results above that more houses are located in the central area , The second is the western region 、 The East 、 Northeast China , The number of rooms in the northern area is the least .

2.5 Histogram of housing types in different areas

Use Seaborn Medium countplot() Function to draw a histogram , Show the types of houses in different areas

plt.figure(figsize=(8,5))
sns.countplot(data = flat_data,x='room_type',hue='neighbourhood_group')
plt.title(' Histogram of housing types in different areas ')

The central area has the largest number of whole rented houses , Other areas have the most types of independent rooms , The vast majority of shared houses are distributed in the central region , It may be due to the higher housing prices in the central area .

2.6 Box chart of house prices in different regions

Use Seaborn Medium boxplot() Function to draw a box diagram , Show the housing prices in different regions

plt.figure(figsize=(8,5))
sns.boxplot(x = 'neighbourhood_group',
            y = 'price',
            data = flat_data[flat_data['price']<=500] # Take the price at 500 Analysis of houses within 
           )
plt.title(' Box chart of house prices in different regions ')

It is observed from the box diagram that ： House prices in the central area are more widely distributed , The average price is also higher than other places . The average price in the north is the lowest .

2.7 Box diagram of relationship between house type and price

Use Seaborn Medium boxplot() Function to draw a box diagram , Show the relationship between house type and price

plt.figure(figsize=(8,5))
sns.boxplot(x = 'room_type',
            y = 'price',
            data = flat_data[flat_data['price']<=500] # Take the price at 500 Analysis of houses within 
           )
plt.title(' Box diagram of relationship between house type and price ')

The price distribution area of the whole rent type of houses is wider , And the average price is higher than the other two types , The average price of shared houses is the lowest .

2.8 Scatter diagram of longitude and latitude distribution of houses

Use Seaborn Medium scatterplot() Function to draw a scatter plot , Show the longitude and latitude distribution of the house

plt.figure(figsize=(10,7))
#x The axis is the longitude value ,y The axis is the latitude value 
sns.scatterplot(flat_data['longitude'],flat_data['latitude'],
                hue=flat_data['neighbourhood_group'])
plt.title(' Scatter diagram of longitude and latitude distribution of houses ')

The orange part shows the houses in the central area , Green is the housing situation in the eastern region , Red is the housing situation in the western region , Purple is the housing situation in the northeast , The blue color shows the houses in the northern area . The number of houses in the central area is large and densely distributed , The number of houses in the northern region is the least and the distribution is relatively scattered .

2.9 Scatter chart of housing price distribution

Use Seaborn Medium scatterplot() Function to draw a scatter plot , Show the distribution of house prices

# Visualize prices 
plt.figure(figsize=(10,7))
#x The axis is the longitude value ,y The axis is the latitude value 
sns.scatterplot(flat_data['longitude'], flat_data['latitude'],
                hue=flat_data['price'])
plt.title(' Scatter chart of housing price distribution ')

Most of the houses with higher prices are distributed in the central and western regions , The East 、 The number of houses with higher prices in the northeast and northern regions is very small .

3. Data preprocessing

3.1 Delete unnecessary Columns

By calling DataFrame Object's drop() Method , And set up axis=1, Delete room number id、 Room name name、 Landlord number host_id Isochronous .

# Delete some unnecessary Columns 
flat_data = flat_data.drop(['id', 'name','host_id','host_name', 'last_review', 'neighbourhood'],
                           axis=1)

3.2 Missing value processing

By calling DataFrame Object's fillna() Method , use 0 Fill in missing values .

# use 0 Fill in missing data , The average number of comments per month 0 Fill in 
flat_data = flat_data.fillna(0)

flat_data.isnull().sum()

3.3 Numerical coding

Import sklearn In the library preprocessing Modular LabelEncoder class

from sklearn.preprocessing import LabelEncoder
cols = ["neighbourhood_group","room_type"] # Columns that require numeric encoding 

for col in cols:
    # Use LabelEncoder() Create a new object , Name it le
    le = LabelEncoder() 
    # call fit() Method , Create a mapping between feature values and coding results 
    le.fit(flat_data[col]) 
    # call transform() Method to convert the data , Converted to encoded results 
    flat_data[col] = le.transform(flat_data[col])

flat_data.head()

neighbourhood_group and room_type The column is transformed into 0、1、2 Equal value .

4. LightGBM model building

4.1 Logarithmic transformation

Logarithmic transformation is a common feature engineering method , Generally, for values greater than 0 Long tailed distribution data , Logarithmic transformation can be used to convert eigenvalues , As a whole, it slows down the extreme distribution state of long tail distribution , Strive for more space for the low value end , Compress the high end as much as possible , Make the overall distribution more reasonable . And then enhance the effect of the model .

# Remove the price as 0 The data of 
flat_data = flat_data[flat_data['price']>0]
flat_data['price'] = np.log10(flat_data['price'])
flat_data.head()

flat_data['price'].describe()

4.2 The goal is 、 Feature Division

X = flat_data.drop('price', axis=1)
y = flat_data['price']

4.3 model building

import lightgbm

# Create a new one LGBMRegressor() object , Name it model
model = lightgbm.LGBMRegressor()

# Set parameters 
params = {'n_estimators': [10,20,30,50,100,200,500], # Number of base learners 
               'subsample': [0.6, 0.7, 0.8, 0.9, 1.0], # A certain proportion of data is sampled during training 
               'colsample_bytree': [0.6, 0.7, 0.8, 0.9, 1.0], # Proportion of feature sampling 
               'learning_rate' : [0.01,0.03,0.1,0.2,0.3], # Learning rate 
               'reg_lambda':[0,0.1,0.2,0.5,0.7,0.9,1] #L2 Regularization 
              }

# Random parameter search 
from sklearn.model_selection import RandomizedSearchCV
# Create a new one RandomizedSearchCV object 
#cv: The fold of cross validation , The default is 5
lgbm_search_cv = RandomizedSearchCV(model, params, cv=5, scoring='neg_mean_absolute_error')

Yes lgbm_search_cv call fit Method , Into the X,y Training .

lgbm_search_cv.fit(X,y)

By calling best_estimator_ Look at the best model .

lgbm_search_cv.best_estimator_

adopt lgbm_search_cv.best_score_ Get the best score, i.e MAE The negative number of the value , adopt abs() Method to get the absolute value , The model of MAE value

abs(lgbm_search_cv.best_score_)

By calling best_estimator_.feature_importances_ See the importance of each feature .

#  Importance of features , null 
pd.Series(lgbm_search_cv.best_estimator_.feature_importances_, index=X.columns).sort_values(ascending=False)

# By calling barplot() function , Draw a bar chart to show the importance of the feature 
plt.figure(figsize=(8,5))
sns.barplot(x=lgbm_search_cv.best_estimator_.feature_importances_,y=X.columns,palette="Set2")

The importance of features is first and foremost the longitude of the house longitude, latitude latitude, There is little difference in importance between the two . The second is the number of days that can be rented in a year availability_365, And the number of rentable houses owned by the landlord calculated_host_listings_count. The least important feature is the regional group neighbourhood_group.

#  Restore to the true predicted value and calculate the absolute error 
y_true = 10**y
y_predict = 10**(lgbm_search_cv.best_estimator_.predict(X))
absolute_error = abs(y_true.values-y_predict) # Calculate the absolute error 
# convert to DataFrame object 
pd.DataFrame({"true":y_true, "predict": y_predict , "absolute error":absolute_error}).head()

5. summary

First we read the data set , View the basic information of the data , Have a basic understanding of datasets . Then statistics and visualization of the data , Draw histogram of price distribution 、 Histogram of the number of houses of different room types 、 Box chart of house prices in different regions 、 Scatter diagram of longitude and latitude distribution of houses, etc . Then preprocess the data , Including missing value processing 、 Numerical coding 、 Logarithmic transformation of target column, etc . And then build LightGBM The regression model , Parameter tuning through random search , And see the optimal model MAE Value .

Love number class （iDataCourse） It is a big data and artificial intelligence course and resource platform for colleges and universities . The platform provides authoritative course resources 、 Data resources 、 Case experiment resources , Help universities build big data and artificial intelligence majors , Curriculum construction and teacher capacity-building .

原网站

版权声明
本文为[Data science artificial intelligence]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/178/202206271840139588.html