当前位置:网站首页>Second hand housing data analysis and prediction system
Second hand housing data analysis and prediction system
2022-07-02 19:59:00 【Data analysis cases】
author | leo
With the progress of science and technology , Information has become an important element to promote the development of science and Technology . Through the analysis of massive data, we can better serve the future production and life , And can adjust the strategy in time , Save against a rainy day .
Today we show you a comprehensive , Multi dimensional data analysis scenario —— Second hand housing data analysis and prediction system . The system comprehensively demonstrates the data acquisition in the process of data analysis , Data preprocessing , Data analysis , Visual presentation and generation of analysis results .
01 Data acquisition
There are two ways to obtain data : Reliable data channel procurement and Python Web crawler and other technical means .
Dataset Links :
link :
https://pan.baidu.com/s/1-rGGM6tuoDbxtaG9gV4B2w Extraction code : ftvk
Reptile implementation : Mainly through requests Kuhe xpath Data analysis technology to extract relevant field data .
02 Data loading
Integrate the above extracted information :
2.1 Import related python package
2.2 Load data
Data presentation :
Check the basic information of the data , This is a very important step in data analysis , You need to view the data type , Missing data, etc .
Observe through data , You can find Elevator( The elevator ) There is a serious lack of data in the field ,Size The maximum and minimum values in the field, that is, the size of the house, appear 1019 Square meters and 2 Square meters , According to common sense, it can be judged that there are abnormal values .
However, the lack of elevator fields may be caused by information not collected or uploaded , We can not deal with it for the time being . For the treatment of too large area , We'll see in the next analysis step .
03 Data analysis
3.1 Add the average house price field
This field shows the average price per square meter of houses in a certain area , It can provide more basis for the next data analysis .
Through the above data , You can find ID Fields are meaningless for analysis , Extract key fields by rearranging column names , The unit price of the house is the total price / The number of square meters can be calculated .
3.2 Regional feature analysis
Regional features mainly use pandas in groupby Methods classify regions , Then calculate the summary information of total price and unit price in different regions , Finally through Seaborn The histogram and box diagram of the library are visualized and the final conclusion is drawn .
Visual code :
Final output graphics :
Summary of analysis results :
a. Average price of second-hand house ( Every square meter ): The city with the highest average price is Xicheng District ,11 ten thousand / Square meters , The main reason is that Xicheng District is the most prosperous area in Beijing , At the same time, it is the concentration of key middle schools , Therefore, the highest house price is more reasonable ; The area with the second highest average price is Dongcheng District ,10 ten thousand / Square meters , Haidian District 8.5 ten thousand / Square meters , The remaining areas are lower than 8 ten thousand / Square meters .
b. Number of second-hand houses : The areas with a large number are Haidian District and Chaoyang District , All close to 3000 set , Fengtai District followed .
c. Total price distribution of second-hand houses : Through the box diagram , It can be seen that the median housing prices in various regions are mainly concentrated in 1000 All of the following , The dispersion is relatively high , The highest discrete value in Xicheng District reaches 6000 ten thousand , It shows that the total price data distribution of second-hand housing is not ideal .
3.3 Building area Size analysis
The distribution of house area is shown by histogram , Scatter chart to describe the correlation between house price and area .
Summary of analysis results :
Through the above visual diagram You can find , The size types of houses are mainly concentrated in 100 About square meters , The long tailing phenomenon in the box diagram shows that there are a small number of apartment types with large square meters , But the number is limited .
Through the correlation display of the scatter diagram , It is found that the relationship between house price and area is basically linear , This is more in line with the attempt , That is, the larger the area 、 The higher the house price .
Outliers analysis :
Filter through the above expression , Some areas are less than 10 Square meters, but the selling price is more than 1000 Ten thousand houses .
Compare the header data :
Compare the two data , It can be found that the data in the first result set has field dislocation , And check the house category , Houses with small square meters are mostly villas , It does not belong to the analysis of second-hand commercial housing , Therefore, such data can be deleted .
Through the following expression , It is found that there are a small number of properties with unit price of large square meters far lower than the market price .
Further research found that , This information is most likely to represent an office building , It is no longer in the scope of this analysis , Need to get rid of , Finally, filter the above data through the following expression .
Do visual analysis again :
As shown in the figure , Abnormal data basically disappears .
3.4 Housing pattern analysis
adopt Seaborn The counting chart shows the distribution of the number of houses of different types .
The results of the analysis :
The main types of houses are 2 room 1 hall ,3 room 1 hall ,2 room 2 hall ,3 room 2 hall . The house type name does not meet the normative requirements , It is not conducive to the subsequent use of machine learning , Therefore, it needs to be characterized .
3.5 Analysis of housing renovation status
Use value_counts() Methods count the number of houses in different renovation States :
Use count chart , Bar chart , The box diagram presents the above four types of houses visually :
The results of the analysis :
Hardbound houses have the largest number of second-hand houses , Paperback comes second , price , Blank is the most expensive , Hardcover repair takes the second place .
3.6 Is there an elevator analysis
adopt info() Function counts the number of different fields 、 Null case 、 data type , Exception fields can be found quickly .
Through the code results , A large number of missing values were found in the elevator field , The options are as follows :
a. Delete null .
b. Replace , And use the fill value : Median , Average , Lagrange interpolation, etc .
A simple common sense cannot be ignored here , That is, the floor exceeds 6 There must be an elevator on the floor ,6 There is no , So the number of layers 6 It can be used as a screening condition , It is worth noting that , If you use Floor Field to judge , There may also be problems , because Floor Represents the floor , Not the whole building , Therefore, it can only be used for reference .
Visualize elevator fields :
The results of the analysis :
According to the analysis results , There are many second-hand houses with elevators , The main reason is that Beijing has more people and less land , High rise buildings are common .
3.7 Analysis of construction year
Under the condition of renovation and whether there is an elevator , Use FaceGrid Analyze year characteristics , The following results can be obtained :
a. 1980 There is no data about elevators in second-hand houses years ago , It shows that there was no large number of elevators installed before this era .
b. The whole trend of second-hand housing prices increases with time .
c. 2000 Compared with 2000 There was an obvious price rise before .
3.8 Floor analysis
Analyze the quantity distribution of different floors through the counting diagram :
The results of the analysis :
Discover through the visual diagram ,6 The number of second-hand houses on the first floor is the largest , But it doesn't mean that the floor has too much impact on the house price , The floor also needs to be connected with certain folk culture , As the saying goes, seven up and eight down , Maybe the seventh floor is more popular ,4 Layer and the 18 Layers are generally not popular , In addition, the vision of medium and high-rise buildings is relatively good , Therefore, the price is relatively high .
3.9 Housing forecast
This example mainly uses linear regression and random forest models to predict , Limited to space , Feature processing is not demonstrated here .
Running results :
The mean square error of linear regression is 5.87E8,R Square score 0.482, Three branching models of random forest R Both sides scored more than 0.65, Among them, the extreme random forest model has the best prediction ability , The prediction ability of linear regression is obviously lower than that of random forest model .
04 summary
This case carries out a comprehensive data analysis and visual display of second-hand housing data through common data analysis methods , It completely reflects the whole data analysis process , Through this case, we can master the basics and classics python Data analysis means .
Of course, there are more analysis dimensions that can be added , You are learning , Are you eager to try ?
边栏推荐
- KT148A语音芯片使用说明、硬件、以及协议、以及常见问题,和参考代码
- [source code analysis] model parallel distributed training Megatron (5) -- pipestream flush
- 数据库模式笔记 --- 如何在开发中选择合适的数据库+关系型数据库是谁发明的?
- [real case] trap of program design - beware of large data
- One side is volume, the other side is layoff. There are a lot of layoffs in byte commercialization department. What do you think of this wave?
- Implementation of 453 ATOI function
- 攻防世界pwn题:Recho
- 编写完10万行代码,我发了篇长文吐槽Rust
- Refactoring: improving the design of existing code (Part 2)
- Sometimes only one line of statements are queried, and the execution is slow
猜你喜欢
【Hot100】21. Merge two ordered linked lists
[NLP] a detailed generative text Abstract classic paper pointer generator
为什么我对流程情有独钟?
勵志!大凉山小夥全獎直博!論文致謝看哭網友
Set up sentinel mode. Reids and redis leave the sentinel cluster from the node
Outsourcing for three years, abandoned
AcWing 1126. Minimum cost solution (shortest path Dijkstra)
测试人员如何做不漏测?这7点就够了
Shardingsphere jdbc5.1.2 about select last_ INSERT_ ID () I found that there was still a routing problem
Solution to blue screen after installing TIA botu V17 in notebook
随机推荐
[NLP] a detailed generative text Abstract classic paper pointer generator
Introduction to program ape (XII) -- data storage
for(auto a : b)和for(auto &a : b)用法
Postman下载安装
Postman interface test practice, these five questions you must know
B端电商-订单逆向流程
AcWing 1127. Sweet butter solution (shortest path SPFA)
攻防世界pwn题:Recho
功能、作用、效能、功用、效用、功效
【Hot100】21. 合并两个有序链表
从20s优化到500ms,我用了这三招
浏览器缓存机制概述
为什么我对流程情有独钟?
Overview of browser caching mechanism
What is the Bluetooth chip ble, how to select it, and what is the path of subsequent technology development
Solution to blue screen after installing TIA botu V17 in notebook
GCC: Graph Contrastive Coding for Graph Neural NetworkPre-Training
Introduction to mongodb chapter 03 basic concepts of mongodb
AcWing 341. Optimal trade solution (shortest path, DP)
CRM Customer Relationship Management System