当前位置：网站首页>Second hand housing data analysis and prediction system

Second hand housing data analysis and prediction system

2022-07-02 19:59:00 【Data analysis cases】

author | leo

With the progress of science and technology , Information has become an important element to promote the development of science and Technology . Through the analysis of massive data, we can better serve the future production and life , And can adjust the strategy in time , Save against a rainy day .

Today we show you a comprehensive , Multi dimensional data analysis scenario —— Second hand housing data analysis and prediction system . The system comprehensively demonstrates the data acquisition in the process of data analysis , Data preprocessing , Data analysis , Visual presentation and generation of analysis results .

01 Data acquisition

There are two ways to obtain data ： Reliable data channel procurement and Python Web crawler and other technical means .

Dataset Links ：

link :
https://pan.baidu.com/s/1-rGGM6tuoDbxtaG9gV4B2w Extraction code : ftvk

Reptile implementation ： Mainly through requests Kuhe xpath Data analysis technology to extract relevant field data .

02 Data loading

Integrate the above extracted information ：

2.1 Import related python package

2.2 Load data

Data presentation ：

Check the basic information of the data , This is a very important step in data analysis , You need to view the data type , Missing data, etc .

Observe through data , You can find Elevator（ The elevator ） There is a serious lack of data in the field ,Size The maximum and minimum values in the field, that is, the size of the house, appear 1019 Square meters and 2 Square meters , According to common sense, it can be judged that there are abnormal values .

However, the lack of elevator fields may be caused by information not collected or uploaded , We can not deal with it for the time being . For the treatment of too large area , We'll see in the next analysis step .

03 Data analysis

3.1 Add the average house price field

This field shows the average price per square meter of houses in a certain area , It can provide more basis for the next data analysis .

Through the above data , You can find ID Fields are meaningless for analysis , Extract key fields by rearranging column names , The unit price of the house is the total price / The number of square meters can be calculated .

3.2 Regional feature analysis

Regional features mainly use pandas in groupby Methods classify regions , Then calculate the summary information of total price and unit price in different regions , Finally through Seaborn The histogram and box diagram of the library are visualized and the final conclusion is drawn .

Visual code ：

Final output graphics ：

Summary of analysis results ：

a. Average price of second-hand house （ Every square meter ）： The city with the highest average price is Xicheng District ,11 ten thousand / Square meters , The main reason is that Xicheng District is the most prosperous area in Beijing , At the same time, it is the concentration of key middle schools , Therefore, the highest house price is more reasonable ; The area with the second highest average price is Dongcheng District ,10 ten thousand / Square meters , Haidian District 8.5 ten thousand / Square meters , The remaining areas are lower than 8 ten thousand / Square meters .

b. Number of second-hand houses ： The areas with a large number are Haidian District and Chaoyang District , All close to 3000 set , Fengtai District followed .

c. Total price distribution of second-hand houses ： Through the box diagram , It can be seen that the median housing prices in various regions are mainly concentrated in 1000 All of the following , The dispersion is relatively high , The highest discrete value in Xicheng District reaches 6000 ten thousand , It shows that the total price data distribution of second-hand housing is not ideal .

3.3 Building area Size analysis

The distribution of house area is shown by histogram , Scatter chart to describe the correlation between house price and area .

Summary of analysis results ：

Through the above visual diagram You can find , The size types of houses are mainly concentrated in 100 About square meters , The long tailing phenomenon in the box diagram shows that there are a small number of apartment types with large square meters , But the number is limited .

Through the correlation display of the scatter diagram , It is found that the relationship between house price and area is basically linear , This is more in line with the attempt , That is, the larger the area 、 The higher the house price .

Outliers analysis ：

Filter through the above expression , Some areas are less than 10 Square meters, but the selling price is more than 1000 Ten thousand houses .

Compare the header data ：

Compare the two data , It can be found that the data in the first result set has field dislocation , And check the house category , Houses with small square meters are mostly villas , It does not belong to the analysis of second-hand commercial housing , Therefore, such data can be deleted .

Through the following expression , It is found that there are a small number of properties with unit price of large square meters far lower than the market price .

Further research found that , This information is most likely to represent an office building , It is no longer in the scope of this analysis , Need to get rid of , Finally, filter the above data through the following expression .

Do visual analysis again ：

As shown in the figure , Abnormal data basically disappears .

3.4 Housing pattern analysis

adopt Seaborn The counting chart shows the distribution of the number of houses of different types .

The results of the analysis ：

The main types of houses are 2 room 1 hall ,3 room 1 hall ,2 room 2 hall ,3 room 2 hall . The house type name does not meet the normative requirements , It is not conducive to the subsequent use of machine learning , Therefore, it needs to be characterized .

3.5 Analysis of housing renovation status

Use value_counts() Methods count the number of houses in different renovation States ：

Use count chart , Bar chart , The box diagram presents the above four types of houses visually ：

The results of the analysis ：

Hardbound houses have the largest number of second-hand houses , Paperback comes second , price , Blank is the most expensive , Hardcover repair takes the second place .

3.6 Is there an elevator analysis

adopt info() Function counts the number of different fields 、 Null case 、 data type , Exception fields can be found quickly .

Through the code results , A large number of missing values were found in the elevator field , The options are as follows ：

a. Delete null .

b. Replace , And use the fill value ： Median , Average , Lagrange interpolation, etc .

A simple common sense cannot be ignored here , That is, the floor exceeds 6 There must be an elevator on the floor ,6 There is no , So the number of layers 6 It can be used as a screening condition , It is worth noting that , If you use Floor Field to judge , There may also be problems , because Floor Represents the floor , Not the whole building , Therefore, it can only be used for reference .

Visualize elevator fields ：

The results of the analysis ：

According to the analysis results , There are many second-hand houses with elevators , The main reason is that Beijing has more people and less land , High rise buildings are common .

3.7 Analysis of construction year

Under the condition of renovation and whether there is an elevator , Use FaceGrid Analyze year characteristics , The following results can be obtained ：

a. 1980 There is no data about elevators in second-hand houses years ago , It shows that there was no large number of elevators installed before this era .

b. The whole trend of second-hand housing prices increases with time .

c. 2000 Compared with 2000 There was an obvious price rise before .

3.8 Floor analysis

Analyze the quantity distribution of different floors through the counting diagram ：

The results of the analysis ：

Discover through the visual diagram ,6 The number of second-hand houses on the first floor is the largest , But it doesn't mean that the floor has too much impact on the house price , The floor also needs to be connected with certain folk culture , As the saying goes, seven up and eight down , Maybe the seventh floor is more popular ,4 Layer and the 18 Layers are generally not popular , In addition, the vision of medium and high-rise buildings is relatively good , Therefore, the price is relatively high .

3.9 Housing forecast

This example mainly uses linear regression and random forest models to predict , Limited to space , Feature processing is not demonstrated here .

Running results ：

The mean square error of linear regression is 5.87E8,R Square score 0.482, Three branching models of random forest R Both sides scored more than 0.65, Among them, the extreme random forest model has the best prediction ability , The prediction ability of linear regression is obviously lower than that of random forest model .

04 summary

This case carries out a comprehensive data analysis and visual display of second-hand housing data through common data analysis methods , It completely reflects the whole data analysis process , Through this case, we can master the basics and classics python Data analysis means .

Of course, there are more analysis dimensions that can be added , You are learning , Are you eager to try ？

原网站

版权声明
本文为[Data analysis cases]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202151406126827.html