当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary

②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:


边栏推荐
- 2022/8/3
- [QNX Hypervisor 2.2用户手册]10.5 vdev ioapic
- Literature reading ten - Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn
- 手写分布式配置中心(1)
- Bidding Announcement | Operation and Maintenance Project of Haina Baichuang Official Account
- 没有这些「伪需求」,产品经理的 KPI 怎么完成?
- .net(C#)获取两个日期间隔的年月日
- typeScript-promise
- Cython
- 容联云发送短信验证码
猜你喜欢

「津津乐道播客」#397 厂长来了:怎样用科技给法律赋能?
![[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)](/img/b6/5a1c8b675dc7f67f359c25908403e1.png)
[Cultivation of internal skills of string functions] strcpy + strcat + strcmp (1)

学会反射后,我被录取了(干货)

App测试和Web测试的区别

一点点读懂cpufreq(二)

Pytest learning - fixtures

隐私计算综述

uniapp sharing function - share to friends group chat circle of friends effect (sorting)

应用联合、体系化推进。集团型化工企业数字化转型路径

统计单词(DAY 101)华中科技大学考研机试题
随机推荐
线程三连鞭之“线程的状态”
MySQL增删改查基础
「津津乐道播客」#397 厂长来了:怎样用科技给法律赋能?
DNS常见资源记录类型详解
d枚举生成位
Ab3d.PowerToys and Ab3d.DXEngine Crack
测试技术:关于上下文驱动测试的总结
大师教你3D实时角色制作流程,游戏建模流程分享
Implementing class target method exception using proxy object execution
零基础如何入门软件测试?再到测开(小编心得)
Bidding Announcement | Operation and Maintenance Project of Haina Baichuang Official Account
365天深度学习训练营-学习线路
OPENCV学习DAY8
未来我们还需要浏览器吗?(feat. 枫言枫语)
[QNX Hypervisor 2.2用户手册]10.5 vdev ioapic
407. 接雨水 II
uniapp横向选项卡(水平滚动导航栏)效果demo(整理)
社区分享|腾讯海外游戏基于JumpServer构建游戏安全运营能力
3年,从3K涨薪到20k?真是麻雀啄了牛屁股 — 雀食牛逼呀
再肝3天,整理了90个 NumPy 例子,不能不收藏!