当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond
,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond
,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary
②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond
),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:
边栏推荐
猜你喜欢
uniapp动态实现滑动导航效果demo(整理)
情侣牵手[贪心 & 抽象]
美团二面:Redis与MySQL双写一致性如何保证?
入门3D游戏建模师知识必备
小黑leetcode之旅:95. 至少有 K 个重复字符的最长子串
Nuclei (2) Advanced - In-depth understanding of workflows, Matchers and Extractors
NebulaGraph v3.2.0 Release Note,对查询最短路径的性能等多处优化
直接插入排序
深度|医疗行业勒索病毒防治解决方案
uniapp horizontal tab (horizontal scrolling navigation bar) effect demo (organization)
随机推荐
[QNX Hypervisor 2.2用户手册]10.5 vdev ioapic
2022/8/4 树上差分+线段树
Kernel函数解析之kernel_restart
请你说一下final关键字以及static关键字
uniapp动态实现滑动导航效果demo(整理)
npm基本操作及命令详解
七牛云图片上传
Service Mesh landing path
【SSR服务端渲染+CSR客户端渲染+post请求+get请求+总结】
4 - "PyTorch Deep Learning Practice" - Backpropagation
基于内容的图像检索系统设计与实现--颜色信息--纹理信息--形状信息--PHASH--SHFT特征点的综合检测项目,包含简易版与完整版的源码及数据!
矩阵数学原理
The market value of 360 has evaporated by 390 billion in four years. Can government and enterprise security save lives?
头脑风暴:完全背包
社区分享|腾讯海外游戏基于JumpServer构建游戏安全运营能力
未上市就“一举成名”,空间媲美途昂,安全、舒适一个不落
MySQL的安装与卸载
[Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)
[QNX Hypervisor 2.2用户手册]10.4 vdev hpet
情人节---快来学习一下程序员的专属浪漫吧