当前位置:网站首页>反爬虫策略(ip代理、设置随机休眠时间、哔哩哔哩视频信息爬取、真实URL的获取、特殊字符的处理、时间戳的处理、多线程处理)
反爬虫策略(ip代理、设置随机休眠时间、哔哩哔哩视频信息爬取、真实URL的获取、特殊字符的处理、时间戳的处理、多线程处理)
2022-06-13 01:56:00 【Triumph19】
常见的反爬虫策略
1.通过Headers反爬虫
- 通过识别用户请求的Headers来反爬虫是网站最常用的反爬虫策略。很多网站都会对HTTP请求头部的User-Agent进行检测(判断是否为浏览器访问);有一部分网站会对Refer进行检测(一些资源网站的防盗链接);还有一部分会对Cookie进行检测(需要登录才能获取更多数据)。
2.基于用户行为反爬虫
- 通过检测用户行为来判断请求是否来自爬虫程序也是一种常用的反爬虫策略。例如,同一IP地址短时间内多次访问,或者同一账户短时间内多次进行相同操作,都有可能使网站采取反爬虫措施。
3.采用动态加载数据反爬虫
- 有一些网站的网页是通过JavaScript动态生成的,无法直接爬取当前网页获取所需数据,这样对爬虫程序的直接爬取造成了一些困难。
应对反爬虫的措施
1.使用代理IP
- 针对网站检测IP访问的反爬虫策略,可以使用代理IP。代理IP是代理用户取得网络信息的IP地址,它可以帮助爬虫程序掩藏真实身份,突破IP访问的限制,隐藏爬虫程序的真实IP,从而避免被网站的反爬虫程序禁止。
- requests库实现使用代理IP非常方便,只需要构造一个代理IP的字典,然后在发送HTTP请求时,使用proxies参数添加代理IP的字典即可。如果需要使用多个代理IP,可将所有的代理IP字典构成列表,然后从列表中随机选择代理IP。
边栏推荐
- 10 days based on stm32f401ret6 smart lock project practice day 1 (environment construction and new construction)
- 六、出库管理功能的实现
- Compiling minicom-2.7.1 under msys2
- Installing pytorch geometric
- Use of Arduino series pressure sensors and detected data displayed by OLED (detailed tutorial)
- Introduction to Google unit testing tools GTEST and gmoke
- In the third quarter, the revenue and net profit increased "against the trend". What did vatti do right?
- Developer contributions amd Xilinx Chinese Forum sharing - wisdom of questioning
- [the second day of actual combat of smart lock project based on stm32f401ret6 in 10 days] (lighting with library function and register respectively)
- Calculation of accuracy, recall rate, F1 value and accuracy rate of pytorch prediction results (simple implementation)
猜你喜欢

指针链表的实现

The scientific innovation board successfully held the meeting, and the IPO of Kuangshi technology ushered in the dawn

Machine learning basic SVM (support vector machine)

Gome's ambition of "folding up" app

What is Google plus large text ads? How to use it?

numpy多维数组转置transpose

Ten thousand words make it clear that synchronized and reentrantlock implement locks in concurrency

分享三个关于CMDB的小故事

Alertwindowmanager pop up prompt window help (Part 1)

水管工遊戲
随机推荐
DFS and BFS to solve Treasure Island exploration
The new wild prospect of JD instant retailing from the perspective of "hour shopping"
Workspace for ROS
STM32 3*3矩阵按键(寄存器版本)
Vscode configuration header file -- Take opencv and its own header file as an example
华为设备配置双反射器优化虚拟专用网骨干层
Devaxpress Chinese description --tcxpropertiesstore (property store recovery control)
What is Google plus large text ads? How to use it?
Pyflink implements custom sourcefunction
Répertoire d'exclusion du transport rsync
Machine learning basic SVM (support vector machine)
pytorch : srcIndex < srcSelectDimSize
Magics 23.0如何激活和使用视图工具页的切片预览功能
[wsl2]wsl2 migrate virtual disk file ext4 vhdx
Logging system in chromium
Server installation jupyterab and remote login configuration
[the fourth day of actual combat of stm32f401ret6 smart lock project in 10 days] voice control is realized by externally interrupted keys
4、 Improvement of warehousing management function
Devexpress implementation flow chart
How to learn C language and share super detailed experience (learning note 1 -- basic data types of C language)