当前位置:网站首页>Crawler obtains real estate data
Crawler obtains real estate data
2022-07-06 22:19:00 【sunpro518】
Crawl from the data released by the Bureau of Statistics , Take real estate data as an example , There are some problems , But it was finally solved , The execution time is 2022-2-11.
The basic idea is to use Python Of requests Library to grab . Analyze the web 1 Discovery is dynamic loading . Find the Internet content inside , yes jsquery Loaded ( I don't know how the great God found it , I just looked for it one by one , Just look for something slightly reliable by name ).
With Real estate investment Take the case of :
find out js Request for :
https://data.stats.gov.cn/easyquery.htm?m=QueryData&dbcode=hgyd&rowcode=zb&colcode=sj&wds=%5B%5D&dfwds=%5B%7B%22wdcode%22%3A%22zb%22%2C%22valuecode%22%3A%22A0601%22%7D%5D&k1=1644573450992&h=1
Using the browser, you can see that the request returns a json result . All the data we want are here .
According to this idea, you can basically write crawler code 1 了 !
The above reference blog is written in 2018 year ,2022 There are two problems when testing in :
- Dynamic code is required for the first visit
- The access address is not authenticated , Direct access will report an error 400.
For the first question , use cookie Of requests request 2 Access can solve .
cookie That is Request Headers Medium cookie, Notice what I measured copy value There are certain problems , Direct replication can . I don't know which detail I didn't notice .
For the second question , Do not use ssl verification 34 The way , You can try to bypass this verification , But it will report InsecureRequestWarning
, The solution to this problem is to suppress the warning 5.
Synthesize the above discussion , The implementation code is as follows :
# I use requests library
import requests
import time
# Used to obtain Time stamp
def gettime():
return int(round(time.time() * 1000))
# take query Pass in the parameter dictionary parameter
url = 'https://data.stats.gov.cn/easyquery.htm'
keyvalue = {
}
keyvalue['m'] = 'QueryData'
keyvalue['dbcode'] = 'hgnd'
keyvalue['rowcode'] = 'zb'
keyvalue['colcode'] = 'sj'
keyvalue['wds'] = '[]'
# keyvalue['dfwds'] = '[]'
# The one above is changed to the one below
keyvalue['dfwds'] = '[{"wdcode":"zb","valuecode":"A0301"}]'
keyvalue['k1'] = str(gettime())
# Find... From the web page cookie
cookie = '_trs_uv=kzi8jtu8_6_ct3f; JSESSIONID=I5DoNfCzk0knnmQ8j4w7K498Qt_TmpGxdGafdPQqYsL7FzlhyfPn!1909598655; u=6'
# Suppress the alarm caused by non verification
import urllib3
urllib3.disable_warnings()
headers = {
'Cookie':cookie,
'Host':'data.stats.gov.cn',
'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac 05 X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
# Among them headers Contained in the cookie,param Is the request parameter dictionary ,varify That is, bypass authentication
r = requests.get(url,headers = headers,params=keyvalue,verify=False)
# Print status code
print(r.status_code)
# print(r.text)
# Print the results
print(r.text)
Some screenshots of the results are as follows :
The result has not been parsed , But this is used json Just decode it !
Python Crawl the relevant data of the National Bureau of Statistics ( original )︎︎
In reptile requests Advanced usage ( close cookie Make data requests )︎
Python The web crawler reports an error “SSL: CERTIFICATE_VERIFY_FAILED” Solutions for ︎
certificate verify failed:self signed certificate in certificate chain(_ssl.c:1076)︎
InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. terms of settlement ︎
边栏推荐
- GPS從入門到放弃(十三)、接收機自主完好性監測(RAIM)
- Notes de développement du matériel (10): flux de base du développement du matériel, fabrication d'un module USB à RS232 (9): création de la Bibliothèque d'emballage ch340g / max232 SOP - 16 et Associa
- Mysql相关术语
- Xiaoman network model & http1-http2 & browser cache
- Assembly and interface technology experiment 5-8259 interrupt experiment
- Shell product written examination related
- Leetcode learning records (starting from the novice village, you can't kill out of the novice Village) ---1
- HDR image reconstruction from a single exposure using deep CNN reading notes
- Barcodex (ActiveX print control) v5.3.0.80 free version
- Shortcut keys in the terminal
猜你喜欢
GPS from getting started to giving up (XIII), receiver autonomous integrity monitoring (RAIM)
CCNA Cisco network EIGRP protocol
Oracle-控制文件及日志文件的管理
Yyds dry goods inventory C language recursive implementation of Hanoi Tower
GPS from getting started to giving up (XV), DCB differential code deviation
GPS from getting started to giving up (12), Doppler constant speed
GPS from getting started to giving up (16), satellite clock error and satellite ephemeris error
【sciter】: 基于 sciter 封装通知栏组件
2020 Bioinformatics | GraphDTA: predicting drug target binding affinity with graph neural networks
ResNet-RS:谷歌领衔调优ResNet,性能全面超越EfficientNet系列 | 2021 arxiv
随机推荐
CCNA Cisco network EIGRP protocol
anaconda安装第三方包
GPS从入门到放弃(十八)、多路径效应
重磅新闻 | Softing FG-200获得中国3C防爆认证 为客户现场测试提供安全保障
HDU 2008 数字统计
Force deduction question 500, keyboard line, JS implementation
GPS from entry to abandonment (XIV), ionospheric delay
GPS from getting started to giving up (12), Doppler constant speed
Shortcut keys in the terminal
[10:00 public class]: basis and practice of video quality evaluation
Memorabilia of domestic database in June 2022 - ink Sky Wheel
2500个常用中文字符 + 130常用中英文字符
AI enterprise multi cloud storage architecture practice | Shenzhen potential technology sharing
设置状态栏样式Demo
Codeforces Round #274 (Div. 2) –A Expression
BarcodeX(ActiveX打印控件) v5.3.0.80 免费版使用
Solution to the problem of UOS boot prompt unlocking login password ring
Leetcode learning records (starting from the novice village, you can't kill out of the novice Village) ---1
Management background --5, sub classification
小满网络模型&http1-http2 &浏览器缓存