当前位置:网站首页>Crawler obtains real estate data
Crawler obtains real estate data
2022-07-06 22:19:00 【sunpro518】
Crawl from the data released by the Bureau of Statistics , Take real estate data as an example , There are some problems , But it was finally solved , The execution time is 2022-2-11.
The basic idea is to use Python Of requests Library to grab . Analyze the web 1 Discovery is dynamic loading . Find the Internet content inside , yes jsquery Loaded ( I don't know how the great God found it , I just looked for it one by one , Just look for something slightly reliable by name ).
With Real estate investment Take the case of :

find out js Request for :
https://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgyd&rowcode=zb&colcode=sj&wds=%5B%5D&dfwds=%5B%7B%22wdcode%22%3A%22zb%22%2C%22valuecode%22%3A%22A0601%22%7D%5D&k1=1644573450992&h=1
Using the browser, you can see that the request returns a json result . All the data we want are here .
According to this idea, you can basically write crawler code 1 了 !
The above reference blog is written in 2018 year ,2022 There are two problems when testing in :
- Dynamic code is required for the first visit
- The access address is not authenticated , Direct access will report an error 400.
For the first question , use cookie Of requests request 2 Access can solve .
cookie That is Request Headers Medium cookie, Notice what I measured copy value There are certain problems , Direct replication can . I don't know which detail I didn't notice .
For the second question , Do not use ssl verification 34 The way , You can try to bypass this verification , But it will report InsecureRequestWarning, The solution to this problem is to suppress the warning 5.
Synthesize the above discussion , The implementation code is as follows :
# I use requests library
import requests
import time
# Used to obtain Time stamp
def gettime():
return int(round(time.time() * 1000))
# take query Pass in the parameter dictionary parameter
url = 'https://data.stats.gov.cn/easyquery.htm'
keyvalue = {
}
keyvalue['m'] = 'QueryData'
keyvalue['dbcode'] = 'hgnd'
keyvalue['rowcode'] = 'zb'
keyvalue['colcode'] = 'sj'
keyvalue['wds'] = '[]'
# keyvalue['dfwds'] = '[]'
# The one above is changed to the one below
keyvalue['dfwds'] = '[{"wdcode":"zb","valuecode":"A0301"}]'
keyvalue['k1'] = str(gettime())
# Find... From the web page cookie
cookie = '_trs_uv=kzi8jtu8_6_ct3f; JSESSIONID=I5DoNfCzk0knnmQ8j4w7K498Qt_TmpGxdGafdPQqYsL7FzlhyfPn!1909598655; u=6'
# Suppress the alarm caused by non verification
import urllib3
urllib3.disable_warnings()
headers = {
'Cookie':cookie,
'Host':'data.stats.gov.cn',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 05 X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
# Among them headers Contained in the cookie,param Is the request parameter dictionary ,varify That is, bypass authentication
r = requests.get(url,headers = headers,params=keyvalue,verify=False)
# Print status code
print(r.status_code)
# print(r.text)
# Print the results
print(r.text)
Some screenshots of the results are as follows :
The result has not been parsed , But this is used json Just decode it !
Python Crawl the relevant data of the National Bureau of Statistics ( original )︎︎
In reptile requests Advanced usage ( close cookie Make data requests )︎
Python The web crawler reports an error “SSL: CERTIFICATE_VERIFY_FAILED” Solutions for ︎
certificate verify failed:self signed certificate in certificate chain(_ssl.c:1076)︎
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. terms of settlement ︎
边栏推荐
- About the professional ethics of programmers, let's talk about it from the way of craftsmanship and neatness
- 搜素专题(DFS )
- 经纪xx系统节点VIP案例介绍和深入分析异常
- 2022年6月国产数据库大事记-墨天轮
- GPS from getting started to giving up (XX), antenna offset
- Unity3d Learning Notes 6 - GPU instantiation (1)
- The nearest common ancestor of binary (search) tree ●●
- C # réalise la liaison des données du rapport Crystal et l'impression du Code à barres 4
- Set status bar style demo
- 中国1,4-环己烷二甲醇(CHDM)行业调研与投资决策报告(2022版)
猜你喜欢

Xiaoman network model & http1-http2 & browser cache
![[daily] win10 system setting computer never sleeps](/img/94/15f5a368e395b6948f409c5f6fc871.jpg)
[daily] win10 system setting computer never sleeps

2020 Bioinformatics | GraphDTA: predicting drug target binding affinity with graph neural networks

AI enterprise multi cloud storage architecture practice | Shenzhen potential technology sharing

CCNA-思科网络 EIGRP协议

【MySQL】Online DDL详解

Seata aggregates at, TCC, Saga and XA transaction modes to create a one-stop distributed transaction solution

墨西哥一架飞往美国的客机起飞后遭雷击 随后安全返航

Shell product written examination related

Search element topic (DFS)
随机推荐
搜素专题(DFS )
Unity3D学习笔记6——GPU实例化(1)
Seata聚合 AT、TCC、SAGA 、 XA事务模式打造一站式的分布式事务解决方案
Management background --3, modify classification
Oracle性能分析3:TKPROF简介
Xiaoman network model & http1-http2 & browser cache
小常识:保险中的“保全”是什么?
C # realizes crystal report binding data and printing 4-bar code
The nearest common ancestor of binary (search) tree ●●
China 1,4-cyclohexanedimethanol (CHDM) industry research and investment decision-making report (2022 Edition)
3DMax指定面贴图
十二、启动流程
[sdx62] wcn685x will bdwlan Bin and bdwlan Txt mutual conversion operation method
Force buckle 575 Divide candy
硬件开发笔记(十): 硬件开发基本流程,制作一个USB转RS232的模块(九):创建CH340G/MAX232封装库sop-16并关联原理图元器件
GPS from getting started to giving up (12), Doppler constant speed
What is the difference between animators and animators- What is the difference between an Animator and an Animation?
What a new company needs to practice and pay attention to
硬件開發筆記(十): 硬件開發基本流程,制作一個USB轉RS232的模塊(九):創建CH340G/MAX232封裝庫sop-16並關聯原理圖元器件
[MySQL] online DDL details