当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary

②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:


边栏推荐
- 深度|医疗行业勒索病毒防治解决方案
- 使用代理对象执行实现类目标方法异常
- [Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)
- SQL关联表更新
- Cython
- ~ hand AHB - APB Bridge 】 【 AMBA AHB bus
- 如何写好测试用例
- 2022年华数杯数学建模
- C语言实现扫雷 附带源代码
- uniapp horizontal tab (horizontal scrolling navigation bar) effect demo (organization)
猜你喜欢

文献阅读十——Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn

App测试和Web测试的区别

uniapp 分享功能-分享给朋友群聊朋友圈效果(整理)

Literature reading ten - Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn
![[Happy Qixi Festival] How does Nacos realize the service registration function?](/img/df/5793145da45bc80d227b0babfac914.png)
[Happy Qixi Festival] How does Nacos realize the service registration function?

功耗控制之DVFS介绍

未上市就“一举成名”,空间媲美途昂,安全、舒适一个不落

Basic web in PLSQL
情人节---快来学习一下程序员的专属浪漫吧

隐私计算综述
随机推荐
node中package解析、npm 命令行npm详解,node中的common模块化,npm、nrm两种方式查看源和切换镜像
jenkins发送邮件系统配置
[QNX Hypervisor 2.2用户手册]10.5 vdev ioapic
Cython
当panic或者die被执行时,或者发生未定义指令时,如何被回调到
[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)
【CVA估值训练营】财务建模指南——第一讲
The market value of 360 has evaporated by 390 billion in four years. Can government and enterprise security save lives?
功耗控制之DVFS介绍
C5750X7R2E105K230KA(电容器)MSP430F5249IRGCR微控制器资料
隐私计算综述
深度|医疗行业勒索病毒防治解决方案
@Import注解的作用以及如何使用
文献阅读十——Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn
一点点读懂regulator(二)
uniapp横向选项卡(水平滚动导航栏)效果demo(整理)
中日颜色风格
App测试和Web测试的区别
一点点读懂thermal(一)
2022年华数杯数学建模