当前位置:网站首页>Crawler obtains real estate data
Crawler obtains real estate data
2022-07-06 22:19:00 【sunpro518】
Crawl from the data released by the Bureau of Statistics , Take real estate data as an example , There are some problems , But it was finally solved , The execution time is 2022-2-11.
The basic idea is to use Python Of requests Library to grab . Analyze the web 1 Discovery is dynamic loading . Find the Internet content inside , yes jsquery Loaded ( I don't know how the great God found it , I just looked for it one by one , Just look for something slightly reliable by name ).
With Real estate investment Take the case of :

find out js Request for :
https://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgyd&rowcode=zb&colcode=sj&wds=%5B%5D&dfwds=%5B%7B%22wdcode%22%3A%22zb%22%2C%22valuecode%22%3A%22A0601%22%7D%5D&k1=1644573450992&h=1
Using the browser, you can see that the request returns a json result . All the data we want are here .
According to this idea, you can basically write crawler code 1 了 !
The above reference blog is written in 2018 year ,2022 There are two problems when testing in :
- Dynamic code is required for the first visit
- The access address is not authenticated , Direct access will report an error 400.
For the first question , use cookie Of requests request 2 Access can solve .
cookie That is Request Headers Medium cookie, Notice what I measured copy value There are certain problems , Direct replication can . I don't know which detail I didn't notice .
For the second question , Do not use ssl verification 34 The way , You can try to bypass this verification , But it will report InsecureRequestWarning, The solution to this problem is to suppress the warning 5.
Synthesize the above discussion , The implementation code is as follows :
# I use requests library
import requests
import time
# Used to obtain Time stamp
def gettime():
return int(round(time.time() * 1000))
# take query Pass in the parameter dictionary parameter
url = 'https://data.stats.gov.cn/easyquery.htm'
keyvalue = {
}
keyvalue['m'] = 'QueryData'
keyvalue['dbcode'] = 'hgnd'
keyvalue['rowcode'] = 'zb'
keyvalue['colcode'] = 'sj'
keyvalue['wds'] = '[]'
# keyvalue['dfwds'] = '[]'
# The one above is changed to the one below
keyvalue['dfwds'] = '[{"wdcode":"zb","valuecode":"A0301"}]'
keyvalue['k1'] = str(gettime())
# Find... From the web page cookie
cookie = '_trs_uv=kzi8jtu8_6_ct3f; JSESSIONID=I5DoNfCzk0knnmQ8j4w7K498Qt_TmpGxdGafdPQqYsL7FzlhyfPn!1909598655; u=6'
# Suppress the alarm caused by non verification
import urllib3
urllib3.disable_warnings()
headers = {
'Cookie':cookie,
'Host':'data.stats.gov.cn',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 05 X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
# Among them headers Contained in the cookie,param Is the request parameter dictionary ,varify That is, bypass authentication
r = requests.get(url,headers = headers,params=keyvalue,verify=False)
# Print status code
print(r.status_code)
# print(r.text)
# Print the results
print(r.text)
Some screenshots of the results are as follows :
The result has not been parsed , But this is used json Just decode it !
Python Crawl the relevant data of the National Bureau of Statistics ( original )︎︎
In reptile requests Advanced usage ( close cookie Make data requests )︎
Python The web crawler reports an error “SSL: CERTIFICATE_VERIFY_FAILED” Solutions for ︎
certificate verify failed:self signed certificate in certificate chain(_ssl.c:1076)︎
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. terms of settlement ︎
边栏推荐
- 中国VOCs催化剂行业研究与投资战略报告(2022版)
- GPS from getting started to giving up (XI), differential GPS
- Notes de développement du matériel (10): flux de base du développement du matériel, fabrication d'un module USB à RS232 (9): création de la Bibliothèque d'emballage ch340g / max232 SOP - 16 et Associa
- Solve project cross domain problems
- CCNA Cisco network EIGRP protocol
- Applet system update prompt, and force the applet to restart and use the new version
- What a new company needs to practice and pay attention to
- Oracle-控制文件及日志文件的管理
- 2020 Bioinformatics | GraphDTA: predicting drug target binding affinity with graph neural networks
- zabbix 代理服务器 与 zabbix-snmp 监控
猜你喜欢

Management background --4, delete classification

GPS从入门到放弃(十九)、精密星历(sp3格式)

The nearest common ancestor of binary (search) tree ●●

ZABBIX proxy server and ZABBIX SNMP monitoring

Assembly and interface technology experiment 5-8259 interrupt experiment

LeetCode学习记录(从新手村出发之杀不出新手村)----1

LeetCode刷题(十一)——顺序刷题51至55

硬件開發筆記(十): 硬件開發基本流程,制作一個USB轉RS232的模塊(九):創建CH340G/MAX232封裝庫sop-16並關聯原理圖元器件

3DMax指定面贴图

重磅新闻 | Softing FG-200获得中国3C防爆认证 为客户现场测试提供安全保障
随机推荐
Seata聚合 AT、TCC、SAGA 、 XA事务模式打造一站式的分布式事务解决方案
Four data streams of grpc
第4章:再谈类的加载器
GPS from getting started to giving up (19), precise ephemeris (SP3 format)
Oracle-控制文件及日志文件的管理
二叉(搜索)树的最近公共祖先 ●●
Embedded common computing artifact excel, welcome to recommend skills to keep the document constantly updated and provide convenience for others
AI enterprise multi cloud storage architecture practice | Shenzhen potential technology sharing
LeetCode刷题(十一)——顺序刷题51至55
QT | UDP broadcast communication, simple use case
MongoDB(三)——CRUD
ZABBIX proxy server and ZABBIX SNMP monitoring
Some problems about the use of char[] array assignment through scanf..
Wechat red envelope cover applet source code - background independent version - source code with evaluation points function
Solve project cross domain problems
GPS from getting started to giving up (12), Doppler constant speed
CCNA-思科网络 EIGRP协议
GPS从入门到放弃(十三)、接收机自主完好性监测(RAIM)
2021 geometry deep learning master Michael Bronstein long article analysis
[daily] win10 system setting computer never sleeps