当前位置:网站首页>Crawler of explanation and application of agency theory
Crawler of explanation and application of agency theory
2022-07-06 03:28:00 【Master Zheng fried chestnuts】
Explanation and application of agency theory
One 、 Explanation of agency theory
In the process of being a reptile , This is often the case , for example : The initial crawler will run normally , Capture data normally , But there may be some mistakes after a period of time . Open the corresponding website and have a look , You might see “ Yours IP Too high access frequency ” And so on . The corresponding reasons for these situations , It's some anti climbing measures taken by portal websites .
for example : The server will detect something IP Number of requests per unit time , If within the unit time , Some IP The number of requests has exceeded a certain threshold , Then the server will directly reject the request , It will return some error messages to the client .
This situation can be called the corresponding request IP By server side “ It's sealed off ”( Blacklist ).
The server side detects a IP Number of requests per unit time , If we use some way to disguise our IP, Let the server not recognize that it is our local IP The request made , In this way, we can prevent “ seal IP” Such server-side behavior , Such behavior is an effective anti creep strategy .
1. The core purpose of agency
- Crack the seal IP This anti climbing mechanism .
2. What is agency ?
- Proxy is actually a proxy server . function : Act as an agent for network users to obtain relevant network information .
- Proxy server can be understood as a transit station in network information . When we normally request a server website , We send a request to web The server ,web The server will send the response data back to the corresponding client .
- If we set up a proxy server , In fact, it builds a bridge between our local computer and the server , At this time, the machine is not directly to web Server initiates request , Instead, send a request to the proxy server , The request is sent to the proxy server , The proxy server will relay our request to the server , Then the proxy server sends web The response object returned by the server is sent back to our client ( This machine ). In this way, we can also visit normally web These pages in the server .
- But in the process ,web The server recognizes the real IP It's not our local IP 了 . Because we don't use our machine directly IP Send the request to the server , Instead, the request is sent to the proxy server , The proxy server forwarded our request to web Server side , In this way, we have successfully achieved IP The disguise of . This is also the basic principle of agency .
3. The role of agency :
- Break through yourself IP Restrictions on access .
- Hide your true IP, Avoid being attacked .
4. Agent related websites :
- Come on, agent
- The West Temple agent
- https://m.goubanjia.com/
- https://proxy.seofangfa.com/
5. agent IP The type of :
- http: Applied to the http The agreement corresponds to url in
- https: Applied to the https The agreement corresponds to url in
6. agent IP Anonymity
(1) transparent : The server knows that the request uses a proxy , Also know the real request IP
(2) anonymous : Know that the agent is used , I don't know the truth IP
(3) Gao Yin : I don't know if the agent is used , I don't know the real IP
Two 、 Application of agent in crawler
1. Agent operation :
import requests
url = 'http://www.baidu.com/s?wd=IP'
headers = {
'User-Agent': 'Moz...'
}
page_text = requests.get(url=url, headers=headers,proxies={
"http://":'183.247.199.111'}).text
# Persistent storage open ip.html If you find this IP It's native IP Address , Then there is no agent
with open('ip.html','w',encoding='utf-8') as fp:
fp.write(page_text)
print(' End of storage ')
2. Anti climbing mechanism : seal IP
3. Anti-crawl strategy : Send request using proxy
!!!: Just know the method , Free agent IP Most of them don't work , Specific projects can be used to pay purchase agents IP.
边栏推荐
- 电机控制反Park变换和反Clarke变换公式推导
- Audio audiorecord binder communication mechanism
- SWC介绍
- 数据分析——seaborn可视化(笔记自用)
- 教你用Pytorch搭建一个自己的简单的BP神经网络( 以iris数据集为例 )
- 遥感图像超分辨重建综述
- Record the process of reverse task manager
- Item 10: Prefer scoped enums to unscoped enums.
- 多项目编程极简用例
- Eight super classic pointer interview questions (3000 words in detail)
猜你喜欢
Remote Sensing Image Super-resolution and Object Detection: Benchmark and State of the Art
Remote Sensing Image Super-resolution and Object Detection: Benchmark and State of the Art
2. GPIO related operations
数据分析——seaborn可视化(笔记自用)
IPv6 comprehensive experiment
js凡客banner轮播图js特效
3.1 rtthread 串口设备(V1)详解
Exness foreign exchange: the governor of the Bank of Canada said that the interest rate hike would be more moderate, and the United States and Canada fell slightly to maintain range volatility
Analyze menu analysis
Svg drag point crop image JS effect
随机推荐
Canvas cut blocks game code
Pelosi: Congress will soon have legislation against members' stock speculation
JS music online playback plug-in vsplayaudio js
真机无法访问虚拟机的靶场,真机无法ping通虚拟机
Mysql database operation
canvas切积木小游戏代码
BUUCTF刷题笔记——[极客大挑战 2019]EasySQL 1
SD卡报错“error -110 whilst initialising SD card
pytorch加载数据
遥感图像超分辨率论文推荐
Overview of OCR character recognition methods
如何做好功能测试
遥感图像超分辨重建综述
Cross origin cross domain request
2.2 fonctionnement stm32 GPIO
Four logs of MySQL server layer
Map sorts according to the key value (ascending plus descending)
three. JS page background animation liquid JS special effect
SD card reports an error "error -110 whilst initializing SD card
11. Container with the most water