当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond
,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond
,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary
②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond
),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36
,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:
边栏推荐
- 一点点读懂regulator(三)
- 注解@EnableAutoConfiguration的作用以及如何使用
- uniapp 分享功能-分享给朋友群聊朋友圈效果(整理)
- 一点点读懂regulator(二)
- 2022/8/3
- Implementing class target method exception using proxy object execution
- NebulaGraph v3.2.0 Release Note, many optimizations such as the performance of querying the shortest path
- OPENCV学习DAY8
- 4 - "PyTorch Deep Learning Practice" - Backpropagation
- 自从新来了个字节20K出来的,就见识到了什么是天花板
猜你喜欢
VMware NSX 4.0 -- 网络安全虚拟化平台
使用OpenCV实现一个文档自动扫描仪
2022牛客暑期多校训练营5(BCDFGHK)
【CVA估值训练营】财务建模指南——第一讲
uniapp horizontal tab (horizontal scrolling navigation bar) effect demo (organization)
MySQL的安装与卸载
Since a new byte of 20K came out, I have seen what the ceiling is
基于内容的图像检索系统设计与实现--颜色信息--纹理信息--形状信息--PHASH--SHFT特征点的综合检测项目,包含简易版与完整版的源码及数据!
话题 | 雾计算和边缘计算有什么区别?
如何写好测试用例
随机推荐
Shell expect real cases
typeScript-promise
零基础如何入门软件测试?再到测开(小编心得)
#yyds干货盘点#交换设备丢包严重的故障处理
「津津乐道播客」#397 厂长来了:怎样用科技给法律赋能?
Since a new byte of 20K came out, I have seen what the ceiling is
没有这些「伪需求」,产品经理的 KPI 怎么完成?
SQL关联表更新
NebulaGraph v3.2.0 Release Note, many optimizations such as the performance of querying the shortest path
中日颜色风格
[QNX Hypervisor 2.2用户手册]10.5 vdev ioapic
美团二面:Redis与MySQL双写一致性如何保证?
招标公告 | 海纳百创公众号运维项目
4 - "PyTorch Deep Learning Practice" - Backpropagation
【无标题】线程三连鞭之“线程池”
[Happy Qixi Festival] How does Nacos realize the service registration function?
【CVA估值训练营】财务建模指南——第一讲
安全软件 Avast 与赛门铁克诺顿 NortonLifeLock 合并案获英国批准,市值暴涨 43%
资深游戏建模师告知新手,游戏场景建模师必备软件有哪些?
深度|医疗行业勒索病毒防治解决方案