当前位置:网站首页>5. Scrapy middleware & distributed crawler

5. Scrapy middleware & distributed crawler

2022-08-04 01:33:00 Python_21.

1. scrapy中间件

两大中间件:
1. 爬虫中间件: 位于爬虫与引擎之间, As long as the job is to process the crawler's inputrequests和输出.(使用少)
2. 下载中间件: between the engine and the downloader, Add proxy header, 加头, 集成selenium.(使用多)
Both middlewares are therescrapy项目的middlewares.py文件中, 使用前需要在settings.py中配置.

1.1 爬虫中间件

Using crawler middleware requires configuration first, 在使用.
# settings.py
SPIDER_MIDDLEWARES = {
    
    # 中间件类 : 数据(优先级)
   'cnblogs.middlewares.CnblogsSpiderMiddleware': 543,
}
# middlewares.py
class CnblogsSpiderMiddleware:
    """ Not all methods need to be defined.If no method is defined, scrapy It's as if the spider middleware didn't modify the passed object """

    @classmethod
    def from_crawler(cls, crawler):
        # Scrapy Use this method to create your spider.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
	
    
    def process_spider_input(self, response, spider):
     """ Invoke every response that goes through the spider middleware and goes into the spider. 应该返回 None 或引发异常. """
        return None

    def process_spider_output(self, response, result, spider):
    """ After processing the response,使用从 Spider 返回的结果调用. Must return an iterable Request 或 item 对象 """
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
    """ When a spider or process_spider_input() 方法(from other spider middleware)引发异常时调用. 应该返回 None 或一个可迭代的 Request 或 item 对象 """
        pass

    def process_start_requests(self, start_requests, spider):
    """ Called with the spider's start request,与 process_spider_output() 方法类似, It's just that it doesn't have an associated response.Only requests must be returned(而不是项目). """
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

1.2下载中间件

To use the download middleware, you need to configure it first, 在使用.
# settings.py
DOWNLOADER_MIDDLEWARES = {
    
   'cnblogs.middlewares.CnblogsDownloaderMiddleware': 543,
}
class CnblogsDownloaderMiddleware:
	""" Not all methods need to be defined.If no method is defined, scrapy It's as if the downloader middleware doesn't modify the passed object. """

    @classmethod
    def from_crawler(cls, crawler):
        # Scrapy Use his method to create your spider
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
		""" Called for every request that goes through the downloader middleware.downloader Middleware is required: - 返回 None:Continue processing this request - 或返回 Response 对象 - 或返回 Request 对象 - 或引发 IgnoreRequest:Will call the installed downloader middleware process_exception() 方法 """
        return None

    def process_response(self, request, response, spider):
		""" Called with the response returned from the downloader. must either; - 返回一个 Response 对象 - 返回一个 Request 对象 - 或引发 IgnoreRequest """
        return response
	
    def process_exception(self, request, exception, spider):
		""" 当下载处理程序或 process_request()(from other downloader middleware) 引发异常时调用. 必须: - 返回无:Continue to handle this exception - 返回响应对象:停止 process_exception() 链 - 返回请求对象:停止 process_exception() 链 """
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

1.3 创建测试环境

* 1. 创建项目
C:\Users\13600\Desktop\synchro\Project\test1
New Scrapy project 'test1', using template directory 'c:\program\python38\lib\site-packages\scrapy\templates\project', created in:
   C:\Users\13600\Desktop\synchro\Project\test1

You can start your first spider with:
   cd C:\Users\13600\Desktop\synchro\Project\test1
   scrapy genspider example example.com

* 2. 使用pycharm打开项目
* 3. Create a crawler script
PS C:\Users\13600\Desktop\synchro\Project\test1> scrapy genspider cnblog www.cnblogs.com 
Created spider 'cnblog' using template 'basic' in module:
 test1.spiders.cnblog
* 4. Create a new startup script file in the project directorymain.py
# main.py
from scrapy.cmdline import execute

execute(['scrapy', 'crawl', 'cnblog'])
* 5. 在配置文件中配置日志级别
# settings.py
LOG_LEVEL = 'ERROR'
* 6 . 在settings.pyConfigure middleware parameters in .
# settings.py
DOWNLOADER_MIDDLEWARES = {
    
   'test1.middlewares.Test1DownloaderMiddleware': 543,
}
* 7. Add test code in the middle of downloading intermediate values
# middlewares.py
class Test1DownloaderMiddleware:
    ...
        # 请求处理
    def process_request(self, request, spider):
        print(request.url)
        return None

# settings.pyThe crawler protocol is not turned off in , Crawl four times:
""" http://www.cnblogs.com/robots.txt https://www.cnblogs.com/robots.txt http://www.cnblogs.com/ https://www.cnblogs.com/ The crawler will first crawl the crawler protocol, 如果httpIf the request of the protocol cannot get the data, it will be addedsThe request occurs again. """
* 8. Close to follow the crawler protocol
# settings.py
ROBOTSTXT_OBEY = False
* 9. Modify the class of the crawler scriptstart_urls属性, 改为https协议.
# cnblog.py

start_urls = ['https://www.cnblogs.com/']

1.4 Replace random request headers

* 1. 从request对象中headersAttributes to get request headers
 The property value obtained is a list of field sets
# middlewares.py

class Test1DownloaderMiddleware:
	...
	
    # 请求处理
    def process_request(self, request, spider):
        print(request.headers)
        print(request.headers['User-Agent'])
        return None
     """ { b'Accept': [b'text/html, application/xhtml+xml, application/xml;q=0.9, */*;q=0.8'], b'Accept-Language': [b'en'], b'User-Agent': [b'Scrapy/2.6.2 (+https://scrapy.org)'] } b'Scrapy/2.6.2 (+https://scrapy.org)' User-Agent 为 Scrapy/2.6.2 ... The horse's feet are directly exposed """

* 2.  使用fake_useragent模块随机生成User-Agent字符串.
pip install fake_useragent
# middlewares.py

class Test1DownloaderMiddleware:
    ...
    
    # 请求处理
    def process_request(self, request, spider):
        from fake_useragent import UserAgent

        request.headers['User-Agent'] = UserAgent().random
        return None

1.5 添加随机cookie值

# middlewares.py

class Test1DownloaderMiddleware:
    ...
    
    # 请求处理
    def process_request(self, request, spider):
        from random import randint
        
        # cookie池
    	cookie_list = [{
    'username': 'xx'}, {
    'username': 'oo'}, ...]
        
	    request.cookie = cookie_list[randint(0, y)]
        
        return None

1.6 添加代理IP

# middlewares.py
class Test1DownloaderMiddleware:


    # 请求处理
    def process_request(self, request, spider):
        print(request.meta) 
        # {'download_timeout': 180.0} By default there is only a timeout period
        
        # 代理ip在meta属性中添加一个key为proxy的字典. 
        # (If there is a problem with the proxy, it will retry sending the request)
        request.meta['proxy'] = 'https://ip:端口'
        return None

1.7 集成selenium

流程(When the crawler runs, both use the same vagrant object, Just open a different address in the middleware):
1. Integrate in crawler scriptselenium, First generate a browser object,
2. Request method usage in download middleware
3. Close the browser object in the crawler script
* 1. 将chromedriver.exeGoogle Chrome Control Plugin copied toscrapyunder the framework's project directory.
* 2. Browser objects are always generated in crawler scripts
import scrapy


class CnblogSpider(scrapy.Spider):
    name = 'cnblog'
    allowed_domains = ['www.cnblogs.com']
    start_urls = ['https://www.cnblogs.com/']

    # 集成selenium
    from selenium import webdriver
    bro = webdriver.Chrome(executable_path='chromedriver.exe')

    # 解析数据
    def parse(self, response):
        print(response.text)

    # closeThe method is executed at the end of the crawler script
    def close(self, reason):
        # 关闭浏览器对象
        self.bro.close()

* 3. Use the browser object in the middle of the download
class Test1DownloaderMiddleware:
	...
    
    # 请求处理
    def process_request(self, request, spider):
        spider.bro.get(request.url)
        spider.bro.implicitly_wait(10)
        # print(spider.bro.page_source)

        # Getting data here needs to be returnedresponse对象而不是None
        # 内置封装了一个HtmlResponse对象用于返回
        from scrapy.http import HtmlResponse
        # 这个HtmlResponseIt is scripted by the crawlerresponse接口, HtmlResponse需要的参数(url, 数据, 请求对象)
        # bodyThe data needs to be decoded and added later.encode('utf-8')
        response = HtmlResponse(request.url, body=spider.bro.page_source.encode('utf-8'), request=request)
        return response


1.8 注意事项

Direct modification is not allowed in middlewarerequest的url属性值.
如果修改了, 会报错
AttributeError(属性错误):Request.url 不可修改,
请改用 Request.replace() instead

2. De-duplicate the source code

scrapy内置去重功能, already takenurlWill not crawl again.
在配置文件settings.pyThe class used in the configuration deduplication.
# from scrapy.dupefilters import RFPDupeFilter
DUPEFILTER_CLASS = 'scrapy.dupefilters.RFPDupeFilter'
# 去重类 继承BaseDupeFilter
class RFPDupeFilter(BaseDupeFilter):
	...
  
    def request_seen(self, request: Request) -> bool:
        # yield request, 经过一些算法, 得到fp(指纹)
        fp = self.request_fingerprint(request)
        # 如果fpIn the collection, crawling is not continued
        if fp in self.fingerprints:
            return True
        # 将fp添加到集合中
        self.fingerprints.add(fp)
        # 写入文件
        if self.file:
            self.file.write(fp + '\n')
        return False
request_fingerprint函数的使用
在项目目录下新建一个py文件用于测试.
from scrapy.utils.request import request_fingerprint
from scrapy import Request

# 先生成两个request对象
url_1 = Request('https://www.baidu.con/xx?name=kid&age=18')
url_2 = Request('https://www.baidu.con/xx?age=18&name=kid')

fingerprint_1 = request_fingerprint(url_1)
fingerprint_2 = request_fingerprint(url_2)

print(fingerprint_1)  # d3625990212837cb7ef7a02c4ccd8859daa24b82
print(fingerprint_2)  # d3625990212837cb7ef7a02c4ccd8859daa24b82

""" 参数顺序问题 ?name=kid&age=18 ?age=18&name=kid The data obtained after separation is sorted alphabetically, Calculate a fingerprint(类型MD5加密) Take the value to the collection for comparison. """
The value of the fingerprint is too long, When the crawled data is in units of billions, it takes up a lot of resources.

3. 布隆过滤器

3.1 介绍

bloomfilter It is a data structure that is mapped to a table through multiple hash functions, Can quickly determine whether an element is in a set,
It has good space and timeliness. (Commonly used in reptilesurl去重.)
原理: bloomfilter开辟一个m位的bitArray(位数组), Although the data is all set at the beginning0, 当一个元素过来时,
能过多个哈希函数(h1, h2, h3..)Calculate different hashes, 
并通过哈希值找到对应的bitArray下标, 将里面的值0, 置为1.
关于哈希函数, Their calculated value must be in [0, m] 之中.

Bloom filter which takes up less space and is more efficient, 但是缺点是其返回的结果是概率性的, 而不是非常准确的.
理论情况下添加到集合中的元素越多, 误报的可能性就越大.
并且, 存放在布隆过滤器的数据不容易删除.

3.2安装模块

* 1. 安装依赖的包
    pip install bitarray   
* 2. 安装布隆过滤器
    pip install pybloom_live

3.3 固定长度

# BloomFilter 固定长度
from pybloom_live import BloomFilter

# 容量
bf = BloomFilter(capacity=1000)

# 测试url
url_1 = 'https://www.baidu.com'
url_2 = 'https://cnblogs.com'

# 将url添加到过滤器中
bf.add(url_1)

print(url_1 in bf)  # True
print(url_2 in bf)  # False

3.4 自动扩量

# ScalableBloomFilter 自动扩量
from pybloom_live import ScalableBloomFilter

""" initial_capacity 初始容量 error_rate 错误率 mode 模式, ScalableBloomFilter.LARGE_SET_GROWTH 大规模增长 """
bloom = ScalableBloomFilter(
    initial_capacity=100,
    error_rate=0.001,
    mode=ScalableBloomFilter.LARGE_SET_GROWTH
)

# 测试url
url_1 = 'https://www.baidu.com'
url_2 = 'https://cnblogs.com'

# 将url添加到bloom过滤中
bloom.add(url_1)

print(url_1 in bloom)  # True
print(url_2 in bloom)  # False

4. 自定义去重规则

* 1. Create a new one in the project directorypy文件bloom
from scrapy.dupefilters import BaseDupeFilter
from pybloom_live import ScalableBloomFilter


# Custom deduplication inheritanceBaseDupeFilter,

# Imitate custom writing,
# 在__init__ Generates a bloom filter in 
# 重写request_seen方法
class CustomDeduplication(BaseDupeFilter):
    def __init__(self):
        self.bloom = ScalableBloomFilter(
            initial_capacity=100,
            error_rate=0.001,
            mode=ScalableBloomFilter.LARGE_SET_GROWTH
        )

    def request_seen(self, request):
        # 从request中获取出url
        url = request.url
        if url in self.bloom:
            return True
        self.bloom.add(url)

* 2. 配置文件中配置DUPEFILTER_CLASS属性, Use a custom deduplication class.
DUPEFILTER_CLASS = 'test1.bloom_deduplication.CustomDeduplication'

5. 分布式爬虫

5.1 介绍

Put a crawler task on many machines for execution, 提高爬取效率.
关键: 共享队列.
原来scrapy的Scheduler维护的是本机的任务队列
(存放Request对象及其回调函数等信息),
+ 本机的去重队列(存放访问过的url地址)

image-20220801212933426

所以实现分布式爬取的关键就是, Find a dedicated host to run a shared queue(使用Redis)然后重写Scrapy的Schedulerto the queueRequest, 并且去除重复的request请求.
总结:
1. 共享队列
2. 重写Scheduler, Let it go heavy,Or getting tasks is to access the shared queue
3. 为Scheduler定制去重规则(利用redis的集合类型)

2022-08-01_00865

5.2 分布式爬取案例

* 1. 创建scrapy项目
    命令: scrapy startproject cnblogs_distributed C:\Users\13600\Desktop\synchro\Project\cnblogs_distributed
New Scrapy project 'cnblogs_distributed', using template directory 'c:\program\python38\lib\site-packages\scrapy\templates\project', created in:
   C:\Users\13600\Desktop\synchro\Project\cnlogs_distributed

You can start your first spider with:
   cd C:\Users\13600\Desktop\synchro\Project\cnlogs_distributed
   scrapy genspider example example.com

C:\Users\13600\Desktop>
* 2. 使用PyCharm打开scrapyproject and create a crawler script(The crawler script name and the project name cannot be the same)
    命令: scrapy genspider cnblogs www.cnblogs.com 
PS C:\Users\13600\Desktop\synchro\Project\cnlogs_distributed\cnblogs_distributed> scrapy genspider cnblogs www.cnblogs.com 
Created spider 'cnblogs' using template 'basic' in module:
 cnblogs_distributed.spiders.cnblogs
PS C:\Users\13600\Desktop\synchro\Project\cnlogs_distributed\cnblogs_distributed> 


* 3. 安装scrapy_redis模块
    命令: pip install scrapy_redis
* 4. Create and run the main program of the crawler script in the project directorymain.py
# main.py
from scrapy.cmdline import execute

execute(['scrapy', 'crawl', 'cnblogs'])
* 5. Modify the crawler configuration file
# Does not follow the crawler protocol
ROBOTSTXT_OBEY = False

# 展示错误日志
LOG_LEVEL = 'ERROR'

# 全局USER_AGENT
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) ' \             'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.71'

# Configuration of distributed crawler

# redis的连接(Do not write the default is to use this)
# REDIS_HOST = 'localhost' # 主机名
# REDIS_PORT = 6379 # 端口

# 使用scrapy-redis的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 使用scrapy-redis的Scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Persistent can be configured,也可以不配置
ITEM_PIPELINES = {
    
   'scrapy_redis.pipelines.RedisPipeline': 299
}

* 6. 在item.py中创建item对象.
# item.py
class CnblogsItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    article_url = scrapy.Field()
    summary = scrapy.Field()
    content = scrapy.Field()
    
* 7. crawler script
import scrapy
from scrapy import Request
# 使用RedisSpider
from scrapy_redis.spiders import RedisSpider

# 继承 RedisSpider
class CnblogsSpider(RedisSpider):
    name = 'cnblogs'
    allowed_domains = ['www.cnblogs.com']
    # 指定Redis中集合的key名, key=存放不重复request字符串的集合
    redis_key = 'myspider:start_urls'

    def parse(self, response):
        # 获取item对象
        from items import CnblogsItem
        item = CnblogsItem()
        # 获取所有的article标签
        article_list = response.css('article.post-item')

        # 遍历article标签
        for article in article_list:
            # 获取标签
            title = article.css('a.post-item-title::text').extract_first()
            item['title'] = title

            # 获取文章链接
            article_url = article.css('a.post-item-title::attr(href)').extract_first()
            item['article_url'] = article_url

            # 获取文章摘要
            summary = article.css('p.post-item-summary::text')[-1].extract().strip()
            item['summary'] = summary

            yield Request(article_url, callback=self.parse_detail, meta={
    'item': item})

    def parse_detail(self, response, **kwargs):
        # 从response中获取出item对象
        item = response.meta.get('item')

        # 获取到html标签的文档, Otherwise, if you download it again, it will be the text without typesetting.
        content = response.css('#cnblogs_post_body').extract_first()
        item['content'] = content

        # 将数据返回
        yield item

redis_key = 'myspider:start_urls' Multiple machines use one starting address,
往redisPut in after writing the starting address in, Whoever gets the address first among the three machines, Whoever performs the task first,Crawl this address
Then return a bunch of addresses into the starting address, Three machines to grab again, Grab one and execute one...
* 8. 在scrapy的__init__.pyAdd the project path to the environment variable under
# __init__.py
import os
import sys
# Add the project path to the environment variables
BASE_PATH = os.path.dirname(__file__)
sys.path.append(BASE_PATH)
* 9. 启动程序
     默认使用本地的redis, 无须配置
     Simulate three machines to run distributed crawler, 开三个终端, Start three crawler programs
     A process is a machine.
     
     命令: scrapy crawl cnblogs

2022-08-02_00867

* 10. 往redis中写入起始地址
127.0.0.1:6379> lpush myspider:start_urls https://www.cnblogs.com/

Start crawling data after startup(Information is not displayedprintThe function is displayed to the terminal, Just view the data directly.)

image-20220802163448004

5.3 总结

* 1. pip3 install scrapy-redis
* 2. Originally inheritedSpider,现在继承RedisSpider
* 3. 不能写start_urls = ['https:/www.cnblogs.com/']
    需要写redis_key = 'myspider:start_urls'
* 4. setting中配置↓
# redis的连接
# 主机名
REDIS_HOST = 'localhost' 
# 端口
REDIS_PORT = 6379           

# 使用scrapy-redis的去重
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# 使用scrapy-redis的Scheduler
# Configuration of distributed crawler

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Persistent can be configured,也可以不配置
ITEM_PIPELINES = {
    
   'scrapy_redis.pipelines.RedisPipeline': 299
}
* 5. 使用cmd命令启动scrapy项目, Add the project address to the environment variable, 否则, scrapyThe module in can also be prompted not to find.
# scrapy项目的__init__.py
import os
import sys
# Add the project path to the environment variables
BASE_PATH = os.path.dirname(__file__)
sys.path.append(BASE_PATH)
* 6. redis中为myspider:start_urlsInsert a starting address
lpush myspider:start_urls https://www.cnblogs.com/
原网站

版权声明
本文为[Python_21.]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/216/202208040124480652.html