Preface
Reptiles and anti reptiles are a pair of spears and shields , A very common method of anti crawler is to seal IP, One IP Visit frequently in a short time , You can limit the current or join the blacklist , My previous background development related blog also involves this piece .
But today we are talking about reptiles , So the solution is to use proxy pool , Each request uses a different IP Just go , Plus UA simulation , It's completely normal user behavior , It can avoid current limit and blacklist anti climbing .
Then reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~
Agent pool
A perfect proxy pool , It should be able to realize the following functions
- Batch collection agent ( Or import the agent we purchased through the interface , But it's free to use it occasionally )
- Automatically verify the validity of the agent after collection
- Store valid agents
- Provide the interface to obtain random proxy
- Provide management ( Delete 、 increase ) Agent interface
It's too much trouble to make wheels by yourself , use Python My original intention is ” Life is too short , I use Python“ Do you , And the community didn't disappoint us , Open source and easy to use Python There are many agent pool projects , Here I choose one in GitHub There are 14k+ Stars To use , Name is ProxyPool
.
It's not bad after trial !
Of course, there are many other thread pool projects , I didn't test it , Interested students can see the first link of the resources .
Deployment run
Project address :https://github.com/jhao104/proxy_pool
Official documents provide two deployment methods , Including downloading code to run and docker, Since there are docker That must be the most convenient docker La !
But officially docker Orders are not convenient enough , Because this proxy pool also needs to rely on Redis service , Here I wrote a docker-compose Configure to use :
version: "3"
services:
redis:
image: redis
expose:
- 6379
web:
restart: always
image: jhao104/proxy_pool
environment:
- DB_CONN=redis://redis:6379/0
ports:
- "5010:5010"
depends_on:
- redis
Find a folder to save , Then execute the command to start docker Containers
docker-compose up
The port I configure here is 5010 Just like the official website , Students who need it can modify it by themselves ~
When the project starts , Browser access http://127.0.0.1:5010
, You can get all interfaces , As the name suggests, each interface is easy to understand .
{
"url": [
{
"desc": "get a proxy",
"params": "type: ''https'|''",
"url": "/get"
},
{
"desc": "get and delete a proxy",
"params": "",
"url": "/pop"
},
{
"desc": "delete an unable proxy",
"params": "proxy: 'e.g. 127.0.0.1:8080'",
"url": "/delete"
},
{
"desc": "get all proxy from proxy pool",
"params": "type: ''https'|''",
"url": "/all"
},
{
"desc": "return proxy count",
"params": "",
"url": "/count"
}
]
}
The code uses
Because this proxy pool provides HTTP Interface , Theoretically, it can support the use of any language
Here I use Python To write
Get random proxy
Here I write two methods , Encapsulates the operations of obtaining random agents and deleting agents
import requests
PROXY_POOL_URL = 'http://127.0.0.1:5010'
def get_proxy():
proxy = requests.get(f"{PROXY_POOL_URL}/get/").json().get("proxy")
return {'http': proxy, 'https': proxy}
def delete_proxy(proxy):
requests.get(f"{PROXY_POOL_URL}/delete/?proxy={proxy}")
Get random Header
Use fake_useragent
This library is used to generate random UserAgent
, Simulate different user browser requests
from fake_useragent import UserAgent
def get_header():
return {
"Accept": "application/json, text/plain, */*",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,th;q=0.6",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"User-Agent": ua.random
}
Network request encapsulation
Because we didn't buy a toll agent , So we use free agents that are automatically collected by the agent pool , As we all know, the quality of free agents is not guaranteed , So I wrote the retry function , Delete the agent after the number of failures exceeds the maximum number of retries , Change an agent and start again ~
The maximum number of retries can be configured MAX_RETRY_COUNT
Variable
MAX_RETRY_COUNT = 5
def request_get(url) -> Tuple[Response, str]:
retry_count = 1
proxy = get_proxy()
while retry_count <= MAX_RETRY_COUNT:
logger.debug(f' The first {retry_count} Requests - website {url} - agent {proxy.get("http")}')
try:
resp = requests.get(url, proxies=proxy, headers=get_header(), timeout=15)
return resp, proxy.get('http')
except Exception:
logger.error(f' request was aborted - website {url}')
retry_count += 1
# Delete the agent in the agent pool
logger.warning(f' All {MAX_RETRY_COUNT} All requests failed - Delete agent {proxy.get("http")}')
delete_proxy(proxy.get('http'))
return request_get(url)
This function returns a (Response, str)
Tuples of type , Considering that the data format obtained by different requests may be different , So it doesn't work resp.json()
perhaps resp.text
form , You can call this function to get the data and process it by yourself .
At the same time, it will return a str
Proxy server address of type , yes ip:port
form .
Call method is in this form :resp, proxy = request_get(url)
Because I encapsulated this request_get
Function is just the most basic way to get data , But the data obtained may not be correct and usable , For example, current limit or blacklist is triggered , The data obtained is empty , At this time, you can add a judgment after calling this function to get the data , Suppose this agent IP It has been banned , You can call delete_proxy
Method to delete the proxy .
Thread pool
Reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~
A thread pool is a group of pre instantiated idle threads , Be ready to accept the job . Creating a new thread object for each task to be executed asynchronously is expensive . Use thread pool , You can add tasks to the task queue , The thread pool assigns an available thread to the task . Thread pools help avoid creating or destroying unnecessary threads .
I've used it before threadpool
This pip Package implements thread pool , It feels good , But reptiles have a chance to fake death for unknown reasons , I don't know what went wrong , Look at the online information later and say this threadpool
More suitable for CPU Dense operation …
PS: I look at the
threadpool
Source code implementation , Cow 421 Line of code realizes the function of thread pool ~Then he is based on
threading
Modular implemented , Tolerable
I'll use it this time Python The standard library comes with a thread pool implementation , in fact ,Python There are two kinds in the library “ pool ”
multiprocessing.Pool
multiprocessing.pool.Threadpool
The similarities and differences between the two :
multiprocessing.pool.ThreadPool
Your behavior is similar to multiprocessing.Pool
identical . The difference is multiprocessing.pool.Threadpool
Use threads to run worker The logic of , and multiprocessing.Pool
Use work processes .
But I don't need these two for the time being , Because there are better options .
from Python3.2
Start , The standard library provides us with concurrent.futures
modular , It provides ThreadPoolExecutor
and ProcessPoolExecutor
Two classes , Realized with threading
and multiprocessing
Further abstraction of ( Here we mainly focus on thread pool ), It can not only help us Automatically schedule threads , It can also be done :
- The main thread can get a thread ( Or mission ) The state of , And the return value .
- When a thread completes , The main thread can immediately know .
- Make the coding interface of multithreading and multiprocessing consistent .
So let's look at the code
Code
It is easy to use
def crawl_data(page):
...
from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
pool = ThreadPoolExecutor(8)
logger.info(' Thread pool start ')
tasks = [pool.submit(crawl_data, page) for page in range(1, 100)]
wait(tasks, return_when=ALL_COMPLETED)
logger.info(' End of thread pool ')
The above code analysis :
crawl_data
Functions are crawler functions , The specific code is omittedThreadPoolExecutor(8)
Indicates creating a thread pool , meanwhile 8 Threads in parallel- Then use the list generator ,
pool.submit
Method is used to add tasks to the thread pool wait
Function is used to wait for the end of thread pool execution .
except pool.submit
Out of the way , And support map
Methods batch add tasks
How to use it is as follows :
pool = ThreadPoolExecutor(8)
pool.map(crawl_data, range(1,100))
map
The second parameter of the method is the parameter list to be passed to the task , So it's how many parameters there are in the list , Just create as many tasks ~
After testing, it is very stable , Ha ha ha , It's also easy to use things in the standard library ~
Reference material
- https://suyin-blog.club/2021/2G4HXBY/#proxy-pool- recommend
- python threadpool The past and this life :https://zhangchenchen.github.io/2017/05/18/python-thread-pool/
- https://www.delftstack.com/zh/howto/python/python-threadpool-differences/
- [python] ThreadPoolExecutor Thread pool :https://www.jianshu.com/p/b9b3d66aa0be