当前位置:网站首页>Crawler notes: improve data collection efficiency! Use of proxy pool and thread pool

Crawler notes: improve data collection efficiency! Use of proxy pool and thread pool

2022-07-06 04:13:00 Programming Laboratory

Preface

Reptiles and anti reptiles are a pair of spears and shields , A very common method of anti crawler is to seal IP, One IP Visit frequently in a short time , You can limit the current or join the blacklist , My previous background development related blog also involves this piece .

But today we are talking about reptiles , So the solution is to use proxy pool , Each request uses a different IP Just go , Plus UA simulation , It's completely normal user behavior , It can avoid current limit and blacklist anti climbing .

Then reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~

Agent pool

A perfect proxy pool , It should be able to realize the following functions

  • Batch collection agent ( Or import the agent we purchased through the interface , But it's free to use it occasionally )
  • Automatically verify the validity of the agent after collection
  • Store valid agents
  • Provide the interface to obtain random proxy
  • Provide management ( Delete 、 increase ) Agent interface

It's too much trouble to make wheels by yourself , use Python My original intention is ” Life is too short , I use Python“ Do you , And the community didn't disappoint us , Open source and easy to use Python There are many agent pool projects , Here I choose one in GitHub There are 14k+ Stars To use , Name is ProxyPool.

It's not bad after trial !

Of course, there are many other thread pool projects , I didn't test it , Interested students can see the first link of the resources .

Deployment run

Project address :https://github.com/jhao104/proxy_pool

Official documents provide two deployment methods , Including downloading code to run and docker, Since there are docker That must be the most convenient docker La !

But officially docker Orders are not convenient enough , Because this proxy pool also needs to rely on Redis service , Here I wrote a docker-compose Configure to use :

version: "3"
services:
  redis:
    image: redis
    expose:
      - 6379

  web:
    restart: always
    image: jhao104/proxy_pool
    environment:
      - DB_CONN=redis://redis:6379/0
    ports:
      - "5010:5010"
    depends_on:
      - redis

Find a folder to save , Then execute the command to start docker Containers

docker-compose up

The port I configure here is 5010 Just like the official website , Students who need it can modify it by themselves ~

When the project starts , Browser access http://127.0.0.1:5010, You can get all interfaces , As the name suggests, each interface is easy to understand .

{
  "url": [
    {
      "desc": "get a proxy",
      "params": "type: ''https'|''",
      "url": "/get"
    },
    {
      "desc": "get and delete a proxy",
      "params": "",
      "url": "/pop"
    },
    {
      "desc": "delete an unable proxy",
      "params": "proxy: 'e.g. 127.0.0.1:8080'",
      "url": "/delete"
    },
    {
      "desc": "get all proxy from proxy pool",
      "params": "type: ''https'|''",
      "url": "/all"
    },
    {
      "desc": "return proxy count",
      "params": "",
      "url": "/count"
    }
  ]
}

The code uses

Because this proxy pool provides HTTP Interface , Theoretically, it can support the use of any language

Here I use Python To write

Get random proxy

Here I write two methods , Encapsulates the operations of obtaining random agents and deleting agents

import requests

PROXY_POOL_URL = 'http://127.0.0.1:5010'

def get_proxy():
    proxy = requests.get(f"{PROXY_POOL_URL}/get/").json().get("proxy")
    return {'http': proxy, 'https': proxy}

def delete_proxy(proxy):
    requests.get(f"{PROXY_POOL_URL}/delete/?proxy={proxy}")

Get random Header

Use fake_useragent This library is used to generate random UserAgent, Simulate different user browser requests

from fake_useragent import UserAgent

def get_header():
    return {
        "Accept": "application/json, text/plain, */*",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7,th;q=0.6",
        "Cache-Control": "no-cache",
        "Connection": "keep-alive",
        "Pragma": "no-cache",
        "User-Agent": ua.random
    }

Network request encapsulation

Because we didn't buy a toll agent , So we use free agents that are automatically collected by the agent pool , As we all know, the quality of free agents is not guaranteed , So I wrote the retry function , Delete the agent after the number of failures exceeds the maximum number of retries , Change an agent and start again ~

The maximum number of retries can be configured MAX_RETRY_COUNT Variable

MAX_RETRY_COUNT = 5

def request_get(url) -> Tuple[Response, str]:
    retry_count = 1
    proxy = get_proxy()
    while retry_count <= MAX_RETRY_COUNT:
        logger.debug(f' The first {retry_count} Requests  -  website  {url} -  agent  {proxy.get("http")}')
        try:
            resp = requests.get(url, proxies=proxy, headers=get_header(), timeout=15)
            return resp, proxy.get('http')
        except Exception:
            logger.error(f' request was aborted  -  website  {url}')
            retry_count += 1
    #  Delete the agent in the agent pool 
    logger.warning(f' All {MAX_RETRY_COUNT} All requests failed  -  Delete agent  {proxy.get("http")}')
    delete_proxy(proxy.get('http'))
    return request_get(url)

This function returns a (Response, str) Tuples of type , Considering that the data format obtained by different requests may be different , So it doesn't work resp.json() perhaps resp.text form , You can call this function to get the data and process it by yourself .

At the same time, it will return a str Proxy server address of type , yes ip:port form .

Call method is in this form :resp, proxy = request_get(url)

Because I encapsulated this request_get Function is just the most basic way to get data , But the data obtained may not be correct and usable , For example, current limit or blacklist is triggered , The data obtained is empty , At this time, you can add a judgment after calling this function to get the data , Suppose this agent IP It has been banned , You can call delete_proxy Method to delete the proxy .

Thread pool

Reptiles are a kind of IO Intensive program , If the whole process is executed by a single thread, it will be very slow , Therefore, multithreading can be used to improve the efficiency of data collection , But managing multithreading by yourself is too troublesome , So I chose thread pool ~

A thread pool is a group of pre instantiated idle threads , Be ready to accept the job . Creating a new thread object for each task to be executed asynchronously is expensive . Use thread pool , You can add tasks to the task queue , The thread pool assigns an available thread to the task . Thread pools help avoid creating or destroying unnecessary threads .

I've used it before threadpool This pip Package implements thread pool , It feels good , But reptiles have a chance to fake death for unknown reasons , I don't know what went wrong , Look at the online information later and say this threadpool More suitable for CPU Dense operation …

PS: I look at the threadpool Source code implementation , Cow 421 Line of code realizes the function of thread pool ~

Then he is based on threading Modular implemented , Tolerable

I'll use it this time Python The standard library comes with a thread pool implementation , in fact ,Python There are two kinds in the library “ pool ”

  • multiprocessing.Pool
  • multiprocessing.pool.Threadpool

The similarities and differences between the two :

multiprocessing.pool.ThreadPool Your behavior is similar to multiprocessing.Pool identical . The difference is multiprocessing.pool.Threadpool Use threads to run worker The logic of , and multiprocessing.Pool Use work processes .

But I don't need these two for the time being , Because there are better options .

from Python3.2 Start , The standard library provides us with concurrent.futures modular , It provides ThreadPoolExecutor and ProcessPoolExecutor Two classes , Realized with threading and multiprocessing Further abstraction of ( Here we mainly focus on thread pool ), It can not only help us Automatically schedule threads , It can also be done :

  1. The main thread can get a thread ( Or mission ) The state of , And the return value .
  2. When a thread completes , The main thread can immediately know .
  3. Make the coding interface of multithreading and multiprocessing consistent .

So let's look at the code

Code

It is easy to use

def crawl_data(page):
    ...

from concurrent.futures import ThreadPoolExecutor, wait, ALL_COMPLETED
pool = ThreadPoolExecutor(8)
logger.info(' Thread pool start ')
tasks = [pool.submit(crawl_data, page) for page in range(1, 100)]
wait(tasks, return_when=ALL_COMPLETED)
logger.info(' End of thread pool ')

The above code analysis :

  • crawl_data Functions are crawler functions , The specific code is omitted
  • ThreadPoolExecutor(8) Indicates creating a thread pool , meanwhile 8 Threads in parallel
  • Then use the list generator ,pool.submit Method is used to add tasks to the thread pool
  • wait Function is used to wait for the end of thread pool execution .

except pool.submit Out of the way , And support map Methods batch add tasks

How to use it is as follows :

pool = ThreadPoolExecutor(8)
pool.map(crawl_data, range(1,100))

map The second parameter of the method is the parameter list to be passed to the task , So it's how many parameters there are in the list , Just create as many tasks ~

After testing, it is very stable , Ha ha ha , It's also easy to use things in the standard library ~

Reference material

原网站

版权声明
本文为[Programming Laboratory]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202132241128357.html