当前位置:网站首页>How to quickly obtain and analyze the housing price in your city?

How to quickly obtain and analyze the housing price in your city?

2022-06-23 17:42:00 The way of several people

12 month 20 Japan , The central bank authorizes the national interbank lending center to publish , The latest loan market quotation rate (LPR) by :1 The year LPR by 3.8%, Lower than the previous period 5 individual BP,5 More than years LPR by 4.65%, Consistent with the previous issue .

continuity 19 Months “ halt the troops and wait ” after ,1 The year LPR The quotation was lowered for the first time in the year ! But this time 5 The year LPR The quotation remains unchanged , So for mortgage buyers who choose floating interest rates , No substantial changes have taken place .

Although this time 5 The year LPR The quotation remains unchanged because of the current “ Room for not stir ” The keynote of real estate regulation , But I can't afford a house that I can't afford , Housing prices in the core areas of first tier cities are still strong . How can I get the current housing price in my city ?Python Can help you !

Take Guangzhou, where the author is currently, as an example , Due to the tight demand for residential land supply in first tier cities , Not many new stocks are released every year , Therefore, the price of second-hand houses can be more accurate 、 Truly reflect the local housing prices , Then we can use Python Crawling through the information about second-hand houses in Guangzhou on the Internet To analyze . Here we select the data of linkhome to crawl .

1. Environmental preparation

1.1 Tools

Because we need to run the code at the same time , While viewing the results for analysis , So we choose Anaconda + Jupyter Notebook Write and compile the code .

Anaconda Is a package that contains a wealth of science packages and their dependencies Python Open source distribution , It is the first choice for data analysis , So you don't have to use it yourself pip Manually import and install the dependent packages required for various data analysis .

Jupyter Notebook It's a kind of Web application , It allows the user to translate the description text 、 Mathematical equation 、 Code and visualization are all combined into one document that is easy to share . It has become data analysis 、 A necessary tool for machine learning . Because it can Let the data analyst concentrate on explaining the whole analysis process to the user , Instead of sorting through documents .

install Jupyter Notebook The easiest way is to use Anaconda, Anaconda With Jupyter Notebook, It can be used in the default environment .

1.2 modular

The module for crawling and analyzing the house source data is mainly applied ( library ) Include :

  • requests: Crawling Web Information
  • time: Set the rest time after each crawl
  • BeautifulSoup: Parse the crawled page , And get data from it
  • re: Perform regular expression operations
  • pandas, numpy: Process the data
  • matplotlib: Visual analysis of data
  • sklearn: Cluster the data

Import these related modules ( library ):

import requests
import time
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

Pay attention to Jupyter Notebook You need to perform this part of module import before you can execute the following code blocks , Otherwise, an error that the dependent package cannot be found will be reported .

2. Build crawler , Grab information

2.1 Analyze the web

Before starting crawling data , First of all, check the URL Structure and the structure of the data to be crawled in the target page .

2.1.1. URL structural analysis

The second-hand house list page of chain home URL The structure is :

http://gz.lianjia.com/ershoufang/pg1/

among gz Representing City ,ershoufang Is the channel name ,pg1 Is the page code . We only climb the second-hand housing channel in Guangzhou , So the previous part will not change , What changes is the following page number , from 1-100 Press 1 Monotone increasing .

therefore , We will URL Split into two parts :

  • Fixed part :http://gz.lianjia.com/ershoufang/pg
  • Variable part :1 - 100

For variable parts , We can use for Cycle generation 1-100 The number of , It is spliced with the fixed part in the front to form the one that needs to be crawled URL, Then each URL Crawl to the page information recursively integrated and saved .

2.1.2. Page structure analysis

The data information we need to crawl includes : The location of the house 、 The total price 、 The unit price 、 House type 、 area 、 toward 、 Decoration 、 floor 、 Building age 、 Building type 、 Number of people concerned, etc . Element viewer with browser development tools , We can see the data information on the page info div in :

in :

Place the required data in div do DOM The disassembly is as follows :

The location information of the house is positionInfo In the label , The type of the house 、 area 、 toward 、 floor 、 Building age 、 Attribute information such as building type houseInfo In the label , The price information of the house is in priceInfo In the label , Attention information in followInfo In the label .

2.2 Build crawler

In order to try to pretend to be a normal request , We need to stay http Set a header in the request , Otherwise, it is easy to be sealed . There are many ready-made ones available on the Internet , You can also use httpwatch Wait for tools to check .

#  Set request header information 
headers = {
'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate, br',
'Accept-Language':'zh-CN,zh;q=0.8',
'Connection':'keep-alive',
'Referer':'http://www.baidu.com/link?url=_andhfsjjjKRgEWkj7i9cFmYYGsisrnm2A-TN3XZDQXxvGsM9k9ZZSnikW2Yds4s&wd=&eqid=c3435a7d00006bd600000003582bfd1f'
}

Use for Cycle generation 1-100 The number of , Convert the format to the previous URL The fixed parts shall be spliced into the ones that need to be crawled URL, Then recursively save the crawled page information , And set the request interval between each two pages to 0.5 second .

#  Set the second-hand house list page URL Fixed part 
url = 'http://gz.lianjia.com/ershoufang/pg'

#  Cycle through the information on the second-hand house list page 
for i in range(1, 100):
    if i == 1:
        i = str(i)
        entireURL = (url + i + '/')
        res = requests.get(url=entireURL, headers=headers)
        html = res.content
    else:
        i = str(i)
        entireURL = (url + i + '/')
        res = requests.get(url=entireURL, headers=headers)
        html2 = res.content
        html = html + html2
    #  Set the request interval per page 
    time.sleep(0.5)

Because there are too many saved page information data ,Jupyter Notebook Unable to display all output , You can set the number of pages you need to get a little less , for example 1-2 page , Run the verification to check whether the crawl is successful :

Page information has been successfully crawled .

2.3 Extract information

After page crawling is completed, it is impossible to read and extract data directly , Page parsing is also required . We use BeautifulSoup The module parses the page , It can be parsed into what we see when we view the source code in the browser .

#  Analyze the crawled page information 
htmlResolve = BeautifulSoup(html, 'html.parser')

After parsing , According to the analysis of page structure , Extract the data you need . Let's discuss the total price of the houses respectively (priceInfo)、 The unit price (unitPrice)、 Location (positionInfo)、 attribute (houseInfo)、 attention (followInfo) Five parts of data information are extracted .

Put the page div in class=priceInfo Extract the part of the attribute , And use for The loop stores the total price data for each of these listings in an array tp in .

#  Extract the total price information of the house supply 
price = htmlResolve.find_all("div", attrs={"class": "priceInfo"})
tp = []
for p in price:
    totalPrice = p.span.string
    tp.append(totalPrice)

Withdraw the unit price of the house supply 、 Location 、 The method of attribute and attention information is similar to the method of extracting the total price of housing supply .

Extract several of these arrays , Check the extracted information :

See the front of the total price array of houses 20 Data , Normal results , Extraction successful . Other arrays are checked in a similar way .

3. Processing data , Structural features

3.1 Create data table

Use pandas The module will extract the total price of the house supply 、 The unit price 、 Location 、 Attribute and attention , Generate DataFrame Data sheet , Used for data analysis later .

#  Create data table 
house = pd.DataFrame({"totalprice": tp, "unitprice": upi, "positioninfo": pi, "houseinfo": hi, "followinfo": fi})

After creation , You can view the contents of the data table , Check the structure of the data table :

The data table was successfully built .

3.2 Structural features

Although we have constructed the extracted information as DataFrame Structured data , But this dataset table is rough , For example, the house type 、 area 、 Orientation and other attribute information are included in the property information of the housing source (houseinfo) In this field , Can't be used directly . therefore , We also need to do further processing on the data table .

According to the required data information , We construct new features for the fields in the data table , So as to facilitate the analysis in the next step .

from houseinfo Field , Newly constructed features : House type 、 area 、 toward 、 Decoration 、 floor 、 Building age 、 Building type .

from followinfo Field , Newly constructed features : attention .

Here, the method of feature construction is to Column operation , The original large field is cut into a new field with the minimum particle size feature .

Because the attribute information of each house source is separated by vertical lines , Therefore, we only need to column the attribute information with vertical lines .

#  Feature construction of property information of housing source 
houseinfo_split = pd.DataFrame((x.split('|') for x in house.houseinfo), index=house.index, columns=["huxing","mianji","chaoxiang","zhuangxiu","louceng","louling","louxing","spec"])

Use the same method for listings “ attention ” Feature fields are constructed , Note that the character separator here is a slash , Not a vertical line .

#  Construct the feature of the house source attention information 
followinfo_split = pd.DataFrame((y.split('/') for y in house.followinfo), index=house.index, columns=["guanzhu", "fabu"])

After construction , Check the result table constructed by the characteristic field :

The constructed feature fields are spliced into the original data table , In this way, these fields can be used with other data information in the analysis .

#  Splice the constructed feature fields into the original data table 
house = pd.merge(house, houseinfo_split, right_index=True, left_index=True)
house = pd.merge(house, followinfo_split, right_index=True, left_index=True)

Check the data table results after splicing :

The data table after splicing contains the original information and the newly constructed characteristic values .

3.3 Data processing and cleaning

In the data table after completing the eigenvalue structure splicing , The data of some fields are mixed with numbers and Chinese , And the data format is text , Can't be used directly .

3.3.1. The data processing

The data processing work here is to put the numbers Extract from the string . There are two ways : One method is the same as that of sorting , The string after the number is used as a separator for column extraction ; The other is to extract by using regular expressions .

Use the first method , Extract numbers from the following fields : The unit price of the house .

#  Feature construction of numerical housing unit price information 
unitprice_num_split = pd.DataFrame((z.split(' element ') for z in house.unitprice), index=house.index, columns=["danjia_num","danjia_danwei"])

#  Splice the constructed feature fields into the original data table 
house = pd.merge(house, unitprice_num_split, right_index=True, left_index=True)

Use the second way , Extract numbers from the following fields : Housing area 、 attention .

#  Defined function , Extract numbers from strings 
def get_num(string):
    return (re.findall("\d+\.?\d*", string)[0])

#  Extract the number in the area information of the house source 
house["mianji_num"] = house["mianji"].apply(get_num)

#  Extract the number in the listing attention information 
house["guanzhu_num"] = house["guanzhu"].apply(get_num)

3.3.2. Data cleaning

The extracted data needs to be cleaned before use , Common operations include Remove whitespace and format conversion .

For the extracted unit price of the house supply 、 The area and attention numbers are used to remove the spaces at both ends .

#  Remove the spaces at both ends of the extracted numeric field 
house["danjia_num"] = house["danjia_num"].map(str.strip)
house["mianji_num"] = house["mianji_num"].map(str.strip)
house["guanzhu_num"] = house["guanzhu_num"].map(str.strip)

Set all numeric fields ( Including the total price of the house supply ) Convert to float Format , To facilitate subsequent analysis .

#  Converts the format of a numeric field to float
house["danjia_num"] = house["danjia_num"].str.replace(',', '').astype(float)
house["totalprice"] = house["totalprice"].astype(float)
house["mianji_num"] = house["mianji_num"].astype(float)
house["guanzhu_num"] = house["guanzhu_num"].astype(float)

Check the data sheet results after data processing and cleaning :

The unit price of the house 、 area 、 The numbers in the attention information have been extracted , And the total price of the house supply 、 The unit price 、 area 、 The value of attention has been correctly converted to float Format .

4. Analyze the data , Visual output

Extract information from the crawled data 、 cleaning 、 After processing , The final data results can be analyzed .

Here we are more concerned about the housing price and housing area 、 Explore and analyze the situation of attention , And use Matplotlib Module drawing 2D graphics , Visual output of data .

4.1 Distribution of housing area

4.1.1. View scope

View the range of housing area of crawling second-hand houses on sale in Guangzhou .

#  View the range of source area data 
house["mianji_num"].min(), house["mianji_num"].max()

Get the area range of houses :(18.3, 535.23).

4.1.2. The data packet

According to the area range of the house supply , Group the housing area data . Here we use 50 Group spacing , Divide the housing area into 11 Group , And count this 11 Number of listings in the Group .

#  Group the housing area data 
bins = [0, 50, 100, 150, 200, 250, 300, 350, 400, 450, 500, 550]
group_mianji = [' Less than 50', '50-100', '100-150', '150-200', '200-250', '250-300', '300-350', '350-400', '400-450', '450-500', '500-550']
house['group_mianji'] = pd.cut(house['mianji_num'], bins, labels=group_mianji)

#  Summarize the number of houses by their area 
group_mianji = house.groupby('group_mianji')['group_mianji'].agg(len)

The effect of grouping statistics is equivalent to the following SQL sentence :

select group_mianji, count(*) from house group by group_mianji;

You can see , Most of the houses are concentrated in 50-150 Square meters , According to common sense .

4.1.3. Map the distribution

Use Matplotlib The module draws a distribution map for the number of houses grouped and counted according to the area of houses , You need to use numpy The module carries out y Axis group construction .

#  Draw the distribution map of housing area 
plt.rc('font', family='STXihei', size=15)
ygroup = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
plt.barh([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], group_mianji, color='#205bc3', alpha=0.8, align='center', edgecolor='white')
plt.ylabel(' Area grouping ( Company : Square meters )')
plt.xlabel(' Number ')
plt.title(' Distribution map of housing area ')
plt.legend([' Number '], loc='upper right')
plt.grid(color='#92a1a2', linestyle='--', linewidth=1, axis='y', alpha=0.4)
plt.yticks(ygroup, (' Less than 50', '50-100', '100-150', '150-200', '200-250', '250-300', '300-350', '350-400', '400-450', '450-500', '500-550'))

#  View the plotted distribution map 
plt.show()

View the drawing results of the housing area distribution map :

After the processed data is visually output , The analysis will get twice the result with half the effort .

4.1.4. Data analysis

In the captured data of second-hand houses on sale in Guangzhou , The largest number is in the area 50-100 Between houses , Next is 100-150 The supply of . As the area increases, the quantity decreases ; Less than 50 There are also a certain number of small area houses .

4.2 Distribution of the total price of the house supply

4.2.1. View scope

View the total price range of the crawling data of second-hand houses on sale in Guangzhou .

#  View the range of the total price data of the housing source 
house["totalprice"].min(), house["totalprice"].max()

Get the total price range :(29.0, 3500.0).

4.2.2. The data packet

According to the total price range of the house supply , Group the total price data of the house supply . Here we use 500 Group spacing , Divide the housing area into 7 Group , And count this 7 Number of listings in the Group .

#  Group the total price data of the house supply 
bins = [0, 500, 1000, 1500, 2000, 2500, 3000, 3500]
group_totalprice = [' Less than 500', '500-100', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500']
house['group_totalprice'] = pd.cut(house['totalprice'], bins, labels=group_totalprice)

#  Summarize the number of houses by the total price of houses 
group_totalprice = house.groupby('group_totalprice')['group_totalprice'].agg(len)

You can see , Most of the total prices of houses are less than 500 Within ten thousand .

4.2.3. Map the distribution

Use Matplotlib The module draws a distribution map for the number of houses grouped and counted according to the total price of houses .

#  Draw the distribution map of the total price of the house supply 
plt.rc('font', family='STXihei', size=15)
ygroup = np.array([1, 2, 3, 4, 5, 6, 7])
plt.barh([1, 2, 3, 4, 5, 6, 7], group_totalprice, color='#205bc3', alpha=0.8, align='center', edgecolor='white')
plt.ylabel(' Total price grouping ( Company : Ten thousand yuan )')
plt.xlabel(' Number ')
plt.title(' Distribution map of total price of housing supply ')
plt.legend([' Number '], loc='upper right')
plt.grid(color='#92a1a2', linestyle='--', linewidth=1, axis='y', alpha=0.4)
plt.yticks(ygroup, (' Less than 500', '500-100', '1000-1500', '1500-2000', '2000-2500', '2500-3000', '3000-3500'))

#  View the plotted distribution map 
plt.show()

View the mapping results of the total price distribution map of the house supply :

4.2.4. Data analysis

In the captured data of second-hand houses on sale in Guangzhou , The largest quantity is that the total price is less than 500 Million houses , And it is far more than the total price 500 More than 10000 houses , It seems Guangzhou is the friendliest first tier city , At least it is easier to buy a house here than in the other three first tier cities .

The highest house price seen here is 3500 ten thousand , Of course , This is not the real ceiling level of house prices in Guangzhou . This is because the chain home network You can only crawl 100 Pages of data , We cannot obtain records that are not displayed on the page , Therefore, not all the second-hand housing source data are crawled ; in addition , The top luxury houses will not be listed on the public platform .

4.3 Distribution of house source attention

The process of analyzing the distribution of house supply attention is related to the area of house supply 、 The distribution analysis of the total price of houses is similar , The process is not expanded here , Look directly at the results of the distribution map of house source attention :

Here we see that due to the four sets of attention 1000 above , Even to 5000 Attention to the housing supply , Resulting in a very unbalanced distribution , Most of the houses are concerned about 1000 following .

It should be noted that , The attention data here can not accurately represent the popularity of the housing supply . In real business , Hot houses are very popular , It may be a deal just after the launch , Less attention due to the fast selling speed . Let's ignore these complicated situations for the time being , Therefore, the attention data is only for reference .

4.4 Cluster analysis of housing supply

Last , We use machine learning libraries sklearn Crawling through the data of second-hand houses on sale in Guangzhou , At the total price 、 Cluster analysis of area and attention . According to the total price of the second-hand houses on sale 、 The similarity of area and attention is divided into different categories .

#  Use the total listing price 、 Area and attention are used to cluster 
house_type = np.array(house[['totalprice', 'mianji_num', 'guanzhu_num']])

#  Set the centroid quantity parameter value to 3
cls_house = KMeans(n_clusters=3)

#  Calculate the clustering results 
cls_house = cls_house.fit(house_type)

View the coordinates of the center point of each category , And mark the category in the original data sheet .

#  View the center coordinates of the classification results 
cls_house.cluster_centers_

#  Mark the category of the house in the original data sheet 
house['label'] = cls_house.labels_

View the coordinates of the center point of the classification results :

According to the three categories in the total price 、 The central coordinates of the three points of area and degree of interest , The second-hand houses on sale are divided into three categories :

This classification runs counter to our daily experience , Let's look at the central coordinates , The third deviation is very large , We export the data table to Excel View this data in .

#  write in Excel
house.to_excel('ershouHousePrice.xls')

After completion , In the directory where the project is located , You can see that the generated file name is ershouHousePrice Of Excel file . Open file , Check the following 1000 The above data records .

Both are second-hand houses of Huajing new town and the agricultural teaching institute , This is suspected of brush count , Interested friends can eliminate this part of data for analysis , See what the difference is .

But from the above analysis , And from the perspective of marketing and regional characteristics of the market , In the second-hand housing market in Guangzhou , Houses with total price and medium area , In lots 、 Location 、 traffic 、 Medical care 、 education 、 Most of the commercial houses are better than the houses with low total price and low area . First tier cities are in areas with more abundant resources , Its siphon capacity is much stronger than that of non first tier cities . in addition , The medium level of the second-hand housing market in Guangzhou is lower than that in other first tier cities . in summary , The second-hand houses in Guangzhou with a total price and a medium area can attract more users' attention .

5. Conclusion

As mentioned above , Because chain home network can only crawl 100 Pages of data , We cannot crawl records that are not displayed on the page , Therefore, not all the second-hand housing source data are crawled here , Not for commercial reference .

The purpose of this article is , I hope I can throw a brick to attract jade , Let everyone know about website crawlers and data analysis , And on this basis , Explore deeper and more divergent , The lines . for example , Since we can get information about second-hand houses , You can get the information of the first-hand house , All you have to do is observe the Yishoufang URL And the structure of the page ; Housing source analysis from different elements and angles ; wait .

All the code in this article can be found in my github or gitee In order to get :

https://github.com/hugowong88/ershouHouseDA_gzhttps://gitee.com/hugowong88/ershou-house-da_gz

Please pay attention to 『 The way of several people 』

Practice data technology , Explore data thinking , Pay attention to the cutting-edge information of the data industry , The way of data people .

原网站

版权声明
本文为[The way of several people]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/01/202201061754144363.html