当前位置:网站首页>3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
3. Actual combat---crawl the result page corresponding to Baidu's specified entry (a simple page collector)
2022-08-04 23:39:00 【beyond proverb】
在第一篇博文中也提及到User-Agent,Indicates the identity of the request bearer,That is to say, which browser is used to access the server,这一点很重要.
① UA检测
The portal server detects the identity of the request carrier.If the identity of the detected carrier is represented as a request from a browser,It means that this is a normal request;If it is detected that the carrier identifier is not based on any browser,It means that this is an abnormal request, that is, a crawler,There is a good chance that the server will reject the request!!!
② UA伪装
Let the corresponding request carrier identity of the crawler be disguised as a browser
项目
项目概述:The user enters the specified keyword,Afterwards, all relevant pages found through the Baidu search engine are downloaded to the local
步骤:
① 打开百度,Search for any keyword information,View address bar information
For example I search herebeyond,The address bar information is https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=beyond&oq=%25E9%25BB%2584%25E5%25AE%25B6%25E9%25A9%25B9&rsv_pq=86cafe360003cde6&rsv_t=6497SlvSbubKeEQiJKGnLL%2BCucYyWr9OJTHOTd0x%2Bbx0%2BViW%2FN75Q0avW1M&rqlang=cn&rsv_enter=1&rsv_dl=tb&rsv_sug3=6&rsv_sug1=4&rsv_sug7=100&rsv_sug2=0&rsv_btype=t&inputT=964&rsv_sug4=965
Actually useful information onlyhttps://www.baidu.com/s?wd=beyond,You can also enter the URL separately and still receive the same page result information from the server.(The same goes for other search engines)其中beyond为可变参数,When you encounter variable parameters, you need to put them into the dictionary

②整理完url之后,We need to obtain the authentication information of a browser carrier,这里以Chrome为例,随便打开一个网站(例如https://www.baidu.com/s?wd=beyond),F12打开开发者工具,F5Reissue the request to the server,Network下NameEnter at any point,就可以找到User-Agent信息,例如我的是这个User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36,The information isChromeThe browser's unique authentication identifier
③在get方法中,传入User-Agentand the keyword information entered by the user(均为字典形式)
完整代码
import requests
if __name__ == '__main__':
#UA伪装,Get a browser'sUser-AgentUnique carrier identifier
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'
}
#指定url
url = 'https://www.baidu.com/s?'#https://www.baidu.com/s?word=%E9%BB%84%E5%AE%B6%E9%A9%B9
#处理url携带的参数,Encapsulate parameters into a dictionary
keyword = input("please input a word:")
param = {
'wd':keyword
}
#对指定的url发起请求,对应的url是携带参数的,And the parameters have been processed during the request
response = requests.get(url=url,params=param,headers=headers)#若不传入headers这个User-Agent信息,运行程序之后,The server does not return data information for this response.This shows that Baidu search engine has adopted itUADetect anti-reptile mechanisms
#获取响应
page = response.text
filename = keyword+".html"
#持久化存储
with open('E:/Jupyter_workspace/study/python/'+filename,'w',encoding='utf-8') as fp:#Store the page information returned by the server to the local specified path
fp.write(page)
print(filename,"保存成功")
运行效果如下:


边栏推荐
猜你喜欢
![情侣牵手[贪心 & 抽象]](/img/7d/1cafc000dc58f1c5e2e92150be7953.png)
情侣牵手[贪心 & 抽象]

Implementing class target method exception using proxy object execution
手写分布式配置中心(1)

PID Controller Improvement Notes No. 7: Improve the anti-overshoot setting of the PID controller

Will we still need browsers in the future?(feat. Maple words Maple language)

使用OpenCV实现一个文档自动扫描仪

安全软件 Avast 与赛门铁克诺顿 NortonLifeLock 合并案获英国批准,市值暴涨 43%

入门3D游戏建模师知识必备

没有这些「伪需求」,产品经理的 KPI 怎么完成?

Kernel函数解析之kernel_restart
随机推荐
Implementing class target method exception using proxy object execution
从单体架构迁移到 CQRS 后,我觉得 DDD 并不可怕
请你说一下final关键字以及static关键字
文献阅读十——Detect Rumors on Twitter by Promoting Information Campaigns with Generative Adversarial Learn
Ab3d.PowerToys and Ab3d.DXEngine Crack
美团二面:Redis与MySQL双写一致性如何保证?
为何越来越多人选择进入软件测试行业?深度剖析软件测试的优势...
MySQL基础篇【子查询】
MySQL增删改查基础
【转载】kill掉垃圾进程(在资源管理器占用的情况下)
自从新来了个字节20K出来的,就见识到了什么是天花板
407. 接雨水 II
C5750X7R2E105K230KA(电容器)MSP430F5249IRGCR微控制器资料
年薪50W+的测试工程师都在用这个:Jmeter 脚本开发之——扩展函数
OPENCV学习DAY8
情侣牵手[贪心 & 抽象]
Linear DP (bottom)
基于内容的图像检索系统设计与实现--颜色信息--纹理信息--形状信息--PHASH--SHFT特征点的综合检测项目,包含简易版与完整版的源码及数据!
The Go Programming Language (Introduction)
4 - "PyTorch Deep Learning Practice" - Backpropagation