当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary

②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:


边栏推荐
- 【手撕AHB-APB Bridge】~ AMBA总线 之 AHB
- The Go Programming Language (Introduction)
- Ab3d.PowerToys and Ab3d.DXEngine Crack
- 零基础如何入门软件测试?再到测开(小编心得)
- NebulaGraph v3.2.0 Release Note, many optimizations such as the performance of querying the shortest path
- uniapp 分享功能-分享给朋友群聊朋友圈效果(整理)
- 一点点读懂cpufreq(一)
- 当panic或者die被执行时,或者发生未定义指令时,如何被回调到
- MVCC是什么
- mysql基础
猜你喜欢

Go 语言快速入门指南:什么是 TSL 安全传输层

【七夕情人节特效】-- canvas实现满屏爱心
![[Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)](/img/08/e115e1b0d801fcebef440ad4932610.png)
[Cultivation of internal skills of memory operation functions] memcpy + memmove + memcmp + memset (4)

文献阅读十——Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn

~ hand AHB - APB Bridge 】 【 AMBA AHB bus

大师教你3D实时角色制作流程,游戏建模流程分享

Flutter启动流程(Skia引擎)介绍与使用

统计单词(DAY 101)华中科技大学考研机试题

生产者消费者问题

Literature reading ten - Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn
随机推荐
Kernel函数解析之kernel_restart
Ab3d.PowerToys and Ab3d.DXEngine Crack
头脑风暴:完全背包
安全软件 Avast 与赛门铁克诺顿 NortonLifeLock 合并案获英国批准,市值暴涨 43%
入门3D游戏建模师知识必备
[QNX Hypervisor 2.2用户手册]10.4 vdev hpet
资深游戏建模师告知新手,游戏场景建模师必备软件有哪些?
web3.js
Cython
Service Mesh landing path
Pytorch分布式训练/多卡/多GPU训练DDP的torch.distributed.launch和torchrun
为何越来越多人选择进入软件测试行业?深度剖析软件测试的优势...
PID Controller Improvement Notes No. 7: Improve the anti-overshoot setting of the PID controller
注解@EnableAutoConfiguration的作用以及如何使用
[Happy Qixi Festival] How does Nacos realize the service registration function?
4 - "PyTorch Deep Learning Practice" - Backpropagation
KT148A语音芯片怎么烧录语音进入芯片里面通过串口和电脑端的工具
一点点读懂cpufreq(二)
隐私计算综述
OPENCV学习DAY8