当前位置:网站首页>Second hand housing data analysis and prediction system
Second hand housing data analysis and prediction system
2022-07-02 19:59:00 【Data analysis cases】
author | leo
With the progress of science and technology , Information has become an important element to promote the development of science and Technology . Through the analysis of massive data, we can better serve the future production and life , And can adjust the strategy in time , Save against a rainy day .
Today we show you a comprehensive , Multi dimensional data analysis scenario —— Second hand housing data analysis and prediction system . The system comprehensively demonstrates the data acquisition in the process of data analysis , Data preprocessing , Data analysis , Visual presentation and generation of analysis results .
01 Data acquisition
There are two ways to obtain data : Reliable data channel procurement and Python Web crawler and other technical means .
Dataset Links :
link :
https://pan.baidu.com/s/1-rGGM6tuoDbxtaG9gV4B2w Extraction code : ftvk
Reptile implementation : Mainly through requests Kuhe xpath Data analysis technology to extract relevant field data .
02 Data loading
Integrate the above extracted information :
2.1 Import related python package
2.2 Load data
Data presentation :
Check the basic information of the data , This is a very important step in data analysis , You need to view the data type , Missing data, etc .
Observe through data , You can find Elevator( The elevator ) There is a serious lack of data in the field ,Size The maximum and minimum values in the field, that is, the size of the house, appear 1019 Square meters and 2 Square meters , According to common sense, it can be judged that there are abnormal values .
However, the lack of elevator fields may be caused by information not collected or uploaded , We can not deal with it for the time being . For the treatment of too large area , We'll see in the next analysis step .
03 Data analysis
3.1 Add the average house price field
This field shows the average price per square meter of houses in a certain area , It can provide more basis for the next data analysis .
Through the above data , You can find ID Fields are meaningless for analysis , Extract key fields by rearranging column names , The unit price of the house is the total price / The number of square meters can be calculated .
3.2 Regional feature analysis
Regional features mainly use pandas in groupby Methods classify regions , Then calculate the summary information of total price and unit price in different regions , Finally through Seaborn The histogram and box diagram of the library are visualized and the final conclusion is drawn .
Visual code :
Final output graphics :
Summary of analysis results :
a. Average price of second-hand house ( Every square meter ): The city with the highest average price is Xicheng District ,11 ten thousand / Square meters , The main reason is that Xicheng District is the most prosperous area in Beijing , At the same time, it is the concentration of key middle schools , Therefore, the highest house price is more reasonable ; The area with the second highest average price is Dongcheng District ,10 ten thousand / Square meters , Haidian District 8.5 ten thousand / Square meters , The remaining areas are lower than 8 ten thousand / Square meters .
b. Number of second-hand houses : The areas with a large number are Haidian District and Chaoyang District , All close to 3000 set , Fengtai District followed .
c. Total price distribution of second-hand houses : Through the box diagram , It can be seen that the median housing prices in various regions are mainly concentrated in 1000 All of the following , The dispersion is relatively high , The highest discrete value in Xicheng District reaches 6000 ten thousand , It shows that the total price data distribution of second-hand housing is not ideal .
3.3 Building area Size analysis
The distribution of house area is shown by histogram , Scatter chart to describe the correlation between house price and area .
Summary of analysis results :
Through the above visual diagram You can find , The size types of houses are mainly concentrated in 100 About square meters , The long tailing phenomenon in the box diagram shows that there are a small number of apartment types with large square meters , But the number is limited .
Through the correlation display of the scatter diagram , It is found that the relationship between house price and area is basically linear , This is more in line with the attempt , That is, the larger the area 、 The higher the house price .
Outliers analysis :
Filter through the above expression , Some areas are less than 10 Square meters, but the selling price is more than 1000 Ten thousand houses .
Compare the header data :
Compare the two data , It can be found that the data in the first result set has field dislocation , And check the house category , Houses with small square meters are mostly villas , It does not belong to the analysis of second-hand commercial housing , Therefore, such data can be deleted .
Through the following expression , It is found that there are a small number of properties with unit price of large square meters far lower than the market price .
Further research found that , This information is most likely to represent an office building , It is no longer in the scope of this analysis , Need to get rid of , Finally, filter the above data through the following expression .
Do visual analysis again :
As shown in the figure , Abnormal data basically disappears .
3.4 Housing pattern analysis
adopt Seaborn The counting chart shows the distribution of the number of houses of different types .
The results of the analysis :
The main types of houses are 2 room 1 hall ,3 room 1 hall ,2 room 2 hall ,3 room 2 hall . The house type name does not meet the normative requirements , It is not conducive to the subsequent use of machine learning , Therefore, it needs to be characterized .
3.5 Analysis of housing renovation status
Use value_counts() Methods count the number of houses in different renovation States :
Use count chart , Bar chart , The box diagram presents the above four types of houses visually :
The results of the analysis :
Hardbound houses have the largest number of second-hand houses , Paperback comes second , price , Blank is the most expensive , Hardcover repair takes the second place .
3.6 Is there an elevator analysis
adopt info() Function counts the number of different fields 、 Null case 、 data type , Exception fields can be found quickly .
Through the code results , A large number of missing values were found in the elevator field , The options are as follows :
a. Delete null .
b. Replace , And use the fill value : Median , Average , Lagrange interpolation, etc .
A simple common sense cannot be ignored here , That is, the floor exceeds 6 There must be an elevator on the floor ,6 There is no , So the number of layers 6 It can be used as a screening condition , It is worth noting that , If you use Floor Field to judge , There may also be problems , because Floor Represents the floor , Not the whole building , Therefore, it can only be used for reference .
Visualize elevator fields :
The results of the analysis :
According to the analysis results , There are many second-hand houses with elevators , The main reason is that Beijing has more people and less land , High rise buildings are common .
3.7 Analysis of construction year
Under the condition of renovation and whether there is an elevator , Use FaceGrid Analyze year characteristics , The following results can be obtained :
a. 1980 There is no data about elevators in second-hand houses years ago , It shows that there was no large number of elevators installed before this era .
b. The whole trend of second-hand housing prices increases with time .
c. 2000 Compared with 2000 There was an obvious price rise before .
3.8 Floor analysis
Analyze the quantity distribution of different floors through the counting diagram :
The results of the analysis :
Discover through the visual diagram ,6 The number of second-hand houses on the first floor is the largest , But it doesn't mean that the floor has too much impact on the house price , The floor also needs to be connected with certain folk culture , As the saying goes, seven up and eight down , Maybe the seventh floor is more popular ,4 Layer and the 18 Layers are generally not popular , In addition, the vision of medium and high-rise buildings is relatively good , Therefore, the price is relatively high .
3.9 Housing forecast
This example mainly uses linear regression and random forest models to predict , Limited to space , Feature processing is not demonstrated here .
Running results :
The mean square error of linear regression is 5.87E8,R Square score 0.482, Three branching models of random forest R Both sides scored more than 0.65, Among them, the extreme random forest model has the best prediction ability , The prediction ability of linear regression is obviously lower than that of random forest model .
04 summary
This case carries out a comprehensive data analysis and visual display of second-hand housing data through common data analysis methods , It completely reflects the whole data analysis process , Through this case, we can master the basics and classics python Data analysis means .
Of course, there are more analysis dimensions that can be added , You are learning , Are you eager to try ?
边栏推荐
- 【Hot100】22. bracket-generating
- at编译环境搭建-win
- JS how to get integer
- B端电商-订单逆向流程
- Implementation of 453 ATOI function
- R language uses econcharts package to create microeconomic or macroeconomic maps, and indifference function to visualize indifference curve
- How to do interface testing? After reading this article, it will be clear
- Educational codeforces round 129 (rated for Div. 2) supplementary problem solution
- 自动生成VGG图像注释文件
- Common problems and description of kt148a voice chip IC development
猜你喜欢
随机推荐
Génération automatique de fichiers d'annotation d'images vgg
[internship] solve the problem of too long request parameters
从20s优化到500ms,我用了这三招
Zabbix5 client installation and configuration
Burp install license key not recognized
AcWing 1129. Heat wave solution (shortest path SPFA)
C language linked list -- to be added
MySQL function
高并发下如何避免产生重复数据?
PXE installation "recommended collection"
蓝牙芯片ble是什么,以及该如何选型,后续技术发展的路径是什么
Istio1.12: installation and quick start
NMF-matlab
KT148A语音芯片ic的软件参考代码C语言,一线串口
upload-labs
Istio部署:快速上手微服务,
Start practicing calligraphy
KS004 基于SSH通讯录系统设计与实现
How to set priorities in C language? Elaborate on C language priorities
使用IDM下载百度网盘的文件(亲测有用)[通俗易懂]