当前位置:网站首页>Second hand housing data analysis and prediction system
Second hand housing data analysis and prediction system
2022-07-02 19:59:00 【Data analysis cases】
author | leo
With the progress of science and technology , Information has become an important element to promote the development of science and Technology . Through the analysis of massive data, we can better serve the future production and life , And can adjust the strategy in time , Save against a rainy day .
Today we show you a comprehensive , Multi dimensional data analysis scenario —— Second hand housing data analysis and prediction system . The system comprehensively demonstrates the data acquisition in the process of data analysis , Data preprocessing , Data analysis , Visual presentation and generation of analysis results .

01 Data acquisition
There are two ways to obtain data : Reliable data channel procurement and Python Web crawler and other technical means .
Dataset Links :
link :
https://pan.baidu.com/s/1-rGGM6tuoDbxtaG9gV4B2w Extraction code : ftvk
Reptile implementation : Mainly through requests Kuhe xpath Data analysis technology to extract relevant field data .

02 Data loading
Integrate the above extracted information :
2.1 Import related python package

2.2 Load data

Data presentation :

Check the basic information of the data , This is a very important step in data analysis , You need to view the data type , Missing data, etc .


Observe through data , You can find Elevator( The elevator ) There is a serious lack of data in the field ,Size The maximum and minimum values in the field, that is, the size of the house, appear 1019 Square meters and 2 Square meters , According to common sense, it can be judged that there are abnormal values .
However, the lack of elevator fields may be caused by information not collected or uploaded , We can not deal with it for the time being . For the treatment of too large area , We'll see in the next analysis step .
03 Data analysis
3.1 Add the average house price field
This field shows the average price per square meter of houses in a certain area , It can provide more basis for the next data analysis .


Through the above data , You can find ID Fields are meaningless for analysis , Extract key fields by rearranging column names , The unit price of the house is the total price / The number of square meters can be calculated .
3.2 Regional feature analysis
Regional features mainly use pandas in groupby Methods classify regions , Then calculate the summary information of total price and unit price in different regions , Finally through Seaborn The histogram and box diagram of the library are visualized and the final conclusion is drawn .


Visual code :

Final output graphics :

Summary of analysis results :
a. Average price of second-hand house ( Every square meter ): The city with the highest average price is Xicheng District ,11 ten thousand / Square meters , The main reason is that Xicheng District is the most prosperous area in Beijing , At the same time, it is the concentration of key middle schools , Therefore, the highest house price is more reasonable ; The area with the second highest average price is Dongcheng District ,10 ten thousand / Square meters , Haidian District 8.5 ten thousand / Square meters , The remaining areas are lower than 8 ten thousand / Square meters .
b. Number of second-hand houses : The areas with a large number are Haidian District and Chaoyang District , All close to 3000 set , Fengtai District followed .
c. Total price distribution of second-hand houses : Through the box diagram , It can be seen that the median housing prices in various regions are mainly concentrated in 1000 All of the following , The dispersion is relatively high , The highest discrete value in Xicheng District reaches 6000 ten thousand , It shows that the total price data distribution of second-hand housing is not ideal .
3.3 Building area Size analysis
The distribution of house area is shown by histogram , Scatter chart to describe the correlation between house price and area .


Summary of analysis results :
Through the above visual diagram You can find , The size types of houses are mainly concentrated in 100 About square meters , The long tailing phenomenon in the box diagram shows that there are a small number of apartment types with large square meters , But the number is limited .
Through the correlation display of the scatter diagram , It is found that the relationship between house price and area is basically linear , This is more in line with the attempt , That is, the larger the area 、 The higher the house price .
Outliers analysis :

Filter through the above expression , Some areas are less than 10 Square meters, but the selling price is more than 1000 Ten thousand houses .

Compare the header data :

Compare the two data , It can be found that the data in the first result set has field dislocation , And check the house category , Houses with small square meters are mostly villas , It does not belong to the analysis of second-hand commercial housing , Therefore, such data can be deleted .
Through the following expression , It is found that there are a small number of properties with unit price of large square meters far lower than the market price .


Further research found that , This information is most likely to represent an office building , It is no longer in the scope of this analysis , Need to get rid of , Finally, filter the above data through the following expression .

Do visual analysis again :

As shown in the figure , Abnormal data basically disappears .
3.4 Housing pattern analysis
adopt Seaborn The counting chart shows the distribution of the number of houses of different types .


The results of the analysis :
The main types of houses are 2 room 1 hall ,3 room 1 hall ,2 room 2 hall ,3 room 2 hall . The house type name does not meet the normative requirements , It is not conducive to the subsequent use of machine learning , Therefore, it needs to be characterized .
3.5 Analysis of housing renovation status
Use value_counts() Methods count the number of houses in different renovation States :

Use count chart , Bar chart , The box diagram presents the above four types of houses visually :



The results of the analysis :
Hardbound houses have the largest number of second-hand houses , Paperback comes second , price , Blank is the most expensive , Hardcover repair takes the second place .
3.6 Is there an elevator analysis
adopt info() Function counts the number of different fields 、 Null case 、 data type , Exception fields can be found quickly .

Through the code results , A large number of missing values were found in the elevator field , The options are as follows :
a. Delete null .
b. Replace , And use the fill value : Median , Average , Lagrange interpolation, etc .
A simple common sense cannot be ignored here , That is, the floor exceeds 6 There must be an elevator on the floor ,6 There is no , So the number of layers 6 It can be used as a screening condition , It is worth noting that , If you use Floor Field to judge , There may also be problems , because Floor Represents the floor , Not the whole building , Therefore, it can only be used for reference .

Visualize elevator fields :


The results of the analysis :
According to the analysis results , There are many second-hand houses with elevators , The main reason is that Beijing has more people and less land , High rise buildings are common .
3.7 Analysis of construction year


Under the condition of renovation and whether there is an elevator , Use FaceGrid Analyze year characteristics , The following results can be obtained :
a. 1980 There is no data about elevators in second-hand houses years ago , It shows that there was no large number of elevators installed before this era .
b. The whole trend of second-hand housing prices increases with time .
c. 2000 Compared with 2000 There was an obvious price rise before .
3.8 Floor analysis
Analyze the quantity distribution of different floors through the counting diagram :


The results of the analysis :
Discover through the visual diagram ,6 The number of second-hand houses on the first floor is the largest , But it doesn't mean that the floor has too much impact on the house price , The floor also needs to be connected with certain folk culture , As the saying goes, seven up and eight down , Maybe the seventh floor is more popular ,4 Layer and the 18 Layers are generally not popular , In addition, the vision of medium and high-rise buildings is relatively good , Therefore, the price is relatively high .
3.9 Housing forecast
This example mainly uses linear regression and random forest models to predict , Limited to space , Feature processing is not demonstrated here .


Running results :
The mean square error of linear regression is 5.87E8,R Square score 0.482, Three branching models of random forest R Both sides scored more than 0.65, Among them, the extreme random forest model has the best prediction ability , The prediction ability of linear regression is obviously lower than that of random forest model .
04 summary
This case carries out a comprehensive data analysis and visual display of second-hand housing data through common data analysis methods , It completely reflects the whole data analysis process , Through this case, we can master the basics and classics python Data analysis means .
Of course, there are more analysis dimensions that can be added , You are learning , Are you eager to try ?
边栏推荐
- c语言链表--待补充
- Implementation of 453 ATOI function
- AcWing 181. Turnaround game solution (search ida* search)
- 高并发下如何避免产生重复数据?
- Attack and defense world PWN question: Echo
- Py's interpret: a detailed introduction to interpret, installation, and case application
- API文档工具knife4j使用详解
- [source code analysis] model parallel distributed training Megatron (5) -- pipestream flush
- Sometimes only one line of statements are queried, and the execution is slow
- esp32c3 crash分析
猜你喜欢

Motivation! Big Liangshan boy a remporté le prix Zhibo! Un article de remerciement pour les internautes qui pleurent

After writing 100000 lines of code, I sent a long article roast rust

SQLite 3.39.0 release supports right external connection and all external connection

数据库模式笔记 --- 如何在开发中选择合适的数据库+关系型数据库是谁发明的?

Introduction to mongodb chapter 03 basic concepts of mongodb

upload-labs

How to do interface testing? After reading this article, it will be clear

RPD出品:Superpower Squad 保姆级攻略

C language linked list -- to be added

Postman接口测试实战,这5个问题你一定要知道
随机推荐
pytorch 模型保存的完整例子+pytorch 模型保存只保存可訓練參數嗎?是(+解决方案)
励志!大凉山小伙全奖直博!论文致谢看哭网友
[internship] solve the problem of too long request parameters
Set up sentinel mode. Reids and redis leave the sentinel cluster from the node
AcWing 1129. Heat wave solution (shortest path SPFA)
B-end e-commerce - reverse order process
How to set priorities in C language? Elaborate on C language priorities
Istio1.12: installation and quick start
After writing 100000 lines of code, I sent a long article roast rust
【Hot100】21. Merge two ordered linked lists
NMF-matlab
笔记本安装TIA博途V17后出现蓝屏的解决办法
浏览器缓存机制概述
JASMINER X4 1U deep disassembly reveals the secret behind high efficiency and power saving
分享几个图床网址,便于大家分享图片
Génération automatique de fichiers d'annotation d'images vgg
Data Lake (XII): integration of spark3.1.2 and iceberg0.12.1
KT148A语音芯片使用说明、硬件、以及协议、以及常见问题,和参考代码
接口测试到底怎么做?看完这篇文章就能清晰明了
台湾SSS鑫创SSS1700替代Cmedia CM6533 24bit 96KHZ USB音频编解码芯片