当前位置：网站首页>Crawler notes: improve data collection efficiency! Use of proxy pool and thread pool

Crawler notes: improve data collection efficiency! Use of proxy pool and thread pool

2022-07-06 04:13:00 【Programming Laboratory】

Preface

Reptiles and anti reptiles are a pair of spears and shields , A very common method of anti crawler is to seal IP, One IP Visit frequently in a short time , You can limit the current or join the blacklist , My previous background development related blog also involves this piece .

But today we are talking about reptiles , So the solution is to use proxy pool , Each request uses a different IP Just go , Plus UA simulation , It's completely normal user behavior , It can avoid current limit and blacklist anti climbing .

Then reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~

Agent pool

A perfect proxy pool , It should be able to realize the following functions

Batch collection agent （ Or import the agent we purchased through the interface , But it's free to use it occasionally ）
Automatically verify the validity of the agent after collection
Store valid agents
Provide the interface to obtain random proxy
Provide management （ Delete 、 increase ） Agent interface

It's too much trouble to make wheels by yourself , use Python My original intention is ” Life is too short , I use Python“ Do you , And the community didn't disappoint us , Open source and easy to use Python There are many agent pool projects , Here I choose one in GitHub There are 14k+ Stars To use , Name is ProxyPool.

It's not bad after trial ！

Of course, there are many other thread pool projects , I didn't test it , Interested students can see the first link of the resources .

Deployment run

Project address ：https://github.com/jhao104/proxy_pool

Official documents provide two deployment methods , Including downloading code to run and docker, Since there are docker That must be the most convenient docker La ！

But officially docker Orders are not convenient enough , Because this proxy pool also needs to rely on Redis service , Here I wrote a docker-compose Configure to use ：

version: "3"
services:
  redis:
    image: redis
    expose:
      - 6379

  web:
    restart: always
    image: jhao104/proxy_pool
    environment:
      - DB_CONN=redis://redis:6379/0
    ports:
      - "5010:5010"
    depends_on:
      - redis

Find a folder to save , Then execute the command to start docker Containers

docker-compose up

The port I configure here is 5010 Just like the official website , Students who need it can modify it by themselves ~

When the project starts , Browser access http://127.0.0.1:5010, You can get all interfaces , As the name suggests, each interface is easy to understand .

{
  "url": [
    {
      "desc": "get a proxy",
      "params": "type: ''https'|''",
      "url": "/get"
    },
    {
      "desc": "get and delete a proxy",
      "params": "",
      "url": "/pop"
    },
    {
      "desc": "delete an unable proxy",
      "params": "proxy: 'e.g. 127.0.0.1:8080'",
      "url": "/delete"
    },
    {
      "desc": "get all proxy from proxy pool",
      "params": "type: ''https'|''",
      "url": "/all"
    },
    {
      "desc": "return proxy count",
      "params": "",
      "url": "/count"
    }
  ]
}

The code uses

Because this proxy pool provides HTTP Interface , Theoretically, it can support the use of any language

Here I use Python To write

Get random proxy

Here I write two methods , Encapsulates the operations of obtaining random agents and deleting agents

import requests

PROXY_POOL_URL = 'http://127.0.0.1:5010'

def get_proxy():
    proxy = requests.get(f"{PROXY_POOL_URL}/get/").json().get("proxy")
    return {'http': proxy, 'https': proxy}

def delete_proxy(proxy):
    requests.get(f"{PROXY_POOL_URL}/delete/?proxy={proxy}")

Use fake_useragent This library is used to generate random UserAgent, Simulate different user browser requests

from fake_useragent import UserAgent

def get_header():
    return {
        "Accept": "application/json, text/plain, */*",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,th;q=0.6",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Pragma": "no-cache",
        "User-Agent": ua.random
    }

Network request encapsulation

Because we didn't buy a toll agent , So we use free agents that are automatically collected by the agent pool , As we all know, the quality of free agents is not guaranteed , So I wrote the retry function , Delete the agent after the number of failures exceeds the maximum number of retries , Change an agent and start again ~

The maximum number of retries can be configured MAX_RETRY_COUNT Variable

MAX_RETRY_COUNT = 5

def request_get(url) -> Tuple[Response, str]:
    retry_count = 1
    proxy = get_proxy()
    while retry_count <= MAX_RETRY_COUNT:
        logger.debug(f' The first {retry_count} Requests  -  website  {url} -  agent  {proxy.get("http")}')
        try:
            resp = requests.get(url, proxies=proxy, headers=get_header(), timeout=15)
            return resp, proxy.get('http')
        except Exception:
            logger.error(f' request was aborted  -  website  {url}')
            retry_count += 1
    #  Delete the agent in the agent pool 
    logger.warning(f' All {MAX_RETRY_COUNT} All requests failed  -  Delete agent  {proxy.get("http")}')
    delete_proxy(proxy.get('http'))
    return request_get(url)

This function returns a (Response, str) Tuples of type , Considering that the data format obtained by different requests may be different , So it doesn't work resp.json() perhaps resp.text form , You can call this function to get the data and process it by yourself .

At the same time, it will return a str Proxy server address of type , yes ip:port form .

Call method is in this form ：resp, proxy = request_get(url)

Because I encapsulated this request_get Function is just the most basic way to get data , But the data obtained may not be correct and usable , For example, current limit or blacklist is triggered , The data obtained is empty , At this time, you can add a judgment after calling this function to get the data , Suppose this agent IP It has been banned , You can call delete_proxy Method to delete the proxy .

Thread pool

Reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~

A thread pool is a group of pre instantiated idle threads , Be ready to accept the job . Creating a new thread object for each task to be executed asynchronously is expensive . Use thread pool , You can add tasks to the task queue , The thread pool assigns an available thread to the task . Thread pools help avoid creating or destroying unnecessary threads .

I've used it before threadpool This pip Package implements thread pool , It feels good , But reptiles have a chance to fake death for unknown reasons , I don't know what went wrong , Look at the online information later and say this threadpool More suitable for CPU Dense operation …

PS： I look at the threadpool Source code implementation , Cow 421 Line of code realizes the function of thread pool ~
Then he is based on threading Modular implemented , Tolerable

I'll use it this time Python The standard library comes with a thread pool implementation , in fact ,Python There are two kinds in the library “ pool ”

multiprocessing.Pool
multiprocessing.pool.Threadpool

The similarities and differences between the two ：

multiprocessing.pool.ThreadPool Your behavior is similar to multiprocessing.Pool identical . The difference is multiprocessing.pool.Threadpool Use threads to run worker The logic of , and multiprocessing.Pool Use work processes .

But I don't need these two for the time being , Because there are better options .

from Python3.2 Start , The standard library provides us with concurrent.futures modular , It provides ThreadPoolExecutor and ProcessPoolExecutor Two classes , Realized with threading and multiprocessing Further abstraction of （ Here we mainly focus on thread pool ）, It can not only help us Automatically schedule threads , It can also be done ：

The main thread can get a thread （ Or mission ） The state of , And the return value .
When a thread completes , The main thread can immediately know .
Make the coding interface of multithreading and multiprocessing consistent .

So let's look at the code

Code

It is easy to use

def crawl_data(page):
    ...

from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
pool = ThreadPoolExecutor(8)
logger.info(' Thread pool start ')
tasks = [pool.submit(crawl_data, page) for page in range(1, 100)]
wait(tasks, return_when=ALL_COMPLETED)
logger.info(' End of thread pool ')

The above code analysis ：

crawl_data Functions are crawler functions , The specific code is omitted
ThreadPoolExecutor(8) Indicates creating a thread pool , meanwhile 8 Threads in parallel
Then use the list generator ,pool.submit Method is used to add tasks to the thread pool
wait Function is used to wait for the end of thread pool execution .

except pool.submit Out of the way , And support map Methods batch add tasks

How to use it is as follows ：

pool = ThreadPoolExecutor(8)
pool.map(crawl_data, range(1,100))

map The second parameter of the method is the parameter list to be passed to the task , So it's how many parameters there are in the list , Just create as many tasks ~

After testing, it is very stable , Ha ha ha , It's also easy to use things in the standard library ~