当前位置:网站首页>Crawler obtains real estate data
Crawler obtains real estate data
2022-07-06 22:19:00 【sunpro518】
Crawl from the data released by the Bureau of Statistics , Take real estate data as an example , There are some problems , But it was finally solved , The execution time is 2022-2-11.
The basic idea is to use Python Of requests Library to grab . Analyze the web 1 Discovery is dynamic loading . Find the Internet content inside , yes jsquery Loaded ( I don't know how the great God found it , I just looked for it one by one , Just look for something slightly reliable by name ).
With Real estate investment Take the case of :
find out js Request for :
https://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgyd&rowcode=zb&colcode=sj&wds=%5B%5D&dfwds=%5B%7B%22wdcode%22%3A%22zb%22%2C%22valuecode%22%3A%22A0601%22%7D%5D&k1=1644573450992&h=1
Using the browser, you can see that the request returns a json result . All the data we want are here .
According to this idea, you can basically write crawler code 1 了 !
The above reference blog is written in 2018 year ,2022 There are two problems when testing in :
- Dynamic code is required for the first visit
- The access address is not authenticated , Direct access will report an error 400.
For the first question , use cookie Of requests request 2 Access can solve .
cookie That is Request Headers Medium cookie, Notice what I measured copy value There are certain problems , Direct replication can . I don't know which detail I didn't notice .
For the second question , Do not use ssl verification 34 The way , You can try to bypass this verification , But it will report InsecureRequestWarning
, The solution to this problem is to suppress the warning 5.
Synthesize the above discussion , The implementation code is as follows :
# I use requests library
import requests
import time
# Used to obtain Time stamp
def gettime():
return int(round(time.time() * 1000))
# take query Pass in the parameter dictionary parameter
url = 'https://data.stats.gov.cn/easyquery.htm'
keyvalue = {
}
keyvalue['m'] = 'QueryData'
keyvalue['dbcode'] = 'hgnd'
keyvalue['rowcode'] = 'zb'
keyvalue['colcode'] = 'sj'
keyvalue['wds'] = '[]'
# keyvalue['dfwds'] = '[]'
# The one above is changed to the one below
keyvalue['dfwds'] = '[{"wdcode":"zb","valuecode":"A0301"}]'
keyvalue['k1'] = str(gettime())
# Find... From the web page cookie
cookie = '_trs_uv=kzi8jtu8_6_ct3f; JSESSIONID=I5DoNfCzk0knnmQ8j4w7K498Qt_TmpGxdGafdPQqYsL7FzlhyfPn!1909598655; u=6'
# Suppress the alarm caused by non verification
import urllib3
urllib3.disable_warnings()
headers = {
'Cookie':cookie,
'Host':'data.stats.gov.cn',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 05 X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
# Among them headers Contained in the cookie,param Is the request parameter dictionary ,varify That is, bypass authentication
r = requests.get(url,headers = headers,params=keyvalue,verify=False)
# Print status code
print(r.status_code)
# print(r.text)
# Print the results
print(r.text)
Some screenshots of the results are as follows :
The result has not been parsed , But this is used json Just decode it !
Python Crawl the relevant data of the National Bureau of Statistics ( original )︎︎
In reptile requests Advanced usage ( close cookie Make data requests )︎
Python The web crawler reports an error “SSL: CERTIFICATE_VERIFY_FAILED” Solutions for ︎
certificate verify failed:self signed certificate in certificate chain(_ssl.c:1076)︎
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. terms of settlement ︎
边栏推荐
- Applet system update prompt, and force the applet to restart and use the new version
- AI enterprise multi cloud storage architecture practice | Shenzhen potential technology sharing
- C#实现水晶报表绑定数据并实现打印4-条形码
- GPS從入門到放弃(十三)、接收機自主完好性監測(RAIM)
- 数据处理技巧(7):MATLAB 读取数字字符串混杂的文本文件txt中的数据
- VIP case introduction and in-depth analysis of brokerage XX system node exceptions
- [leetcode daily clock in] 1020 Number of enclaves
- GPS du début à l'abandon (XIII), surveillance autonome de l'intégrité du récepteur (raim)
- GPS从入门到放弃(十九)、精密星历(sp3格式)
- 硬件開發筆記(十): 硬件開發基本流程,制作一個USB轉RS232的模塊(九):創建CH340G/MAX232封裝庫sop-16並關聯原理圖元器件
猜你喜欢
GPS从入门到放弃(十六)、卫星时钟误差和卫星星历误差
[daily] win10 system setting computer never sleeps
The nearest common ancestor of binary (search) tree ●●
Adjustable DC power supply based on LM317
PVL EDI 项目案例
Seata aggregates at, TCC, Saga and XA transaction modes to create a one-stop distributed transaction solution
CCNA Cisco network EIGRP protocol
GPS from entry to abandonment (XIV), ionospheric delay
Management background --5, sub classification
GPS from entry to abandonment (XVII), tropospheric delay
随机推荐
[MySQL] online DDL details
MariaDb数据库管理系统的学习(一)安装示意图
硬件開發筆記(十): 硬件開發基本流程,制作一個USB轉RS232的模塊(九):創建CH340G/MAX232封裝庫sop-16並關聯原理圖元器件
二叉(搜索)树的最近公共祖先 ●●
GPS du début à l'abandon (XIII), surveillance autonome de l'intégrité du récepteur (raim)
Classic sql50 questions
GPS从入门到放弃(十一)、差分GPS
Common sense: what is "preservation" in insurance?
重磅新闻 | Softing FG-200获得中国3C防爆认证 为客户现场测试提供安全保障
Management background --1 Create classification
AI 企业多云存储架构实践 | 深势科技分享
Management background --5, sub classification
PVL EDI project case
Management background --2 Classification list
2020 Bioinformatics | GraphDTA: predicting drug target binding affinity with graph neural networks
小程序系统更新提示,并强制小程序重启并使用新版本
十二、启动流程
GPS from getting started to giving up (XVIII), multipath effect
What a new company needs to practice and pay attention to
Oracle性能分析3:TKPROF简介