当前位置：网站首页>Crawler (9) - scrape framework (1) | scrape asynchronous web crawler framework

Crawler (9) - scrape framework (1) | scrape asynchronous web crawler framework

2022-07-05 10:39:00 【Old Ge】

What is? Scrapy

be based on Twisted The asynchronous processing framework of
pure python Implementation of the crawler framework
The basic structure ：5+2 frame ,5 A component ,2 Middleware

5 A component ：

Scrapy Engine： engine , Responsible for the communication of other components Carry out signal and data transmission ; be responsible for Scheduler、Downloader、Spiders、Item Pipeline The transmission of intermediate communication signals and data , This component is equivalent to a crawler “ The brain ”, It's the dispatch center of the whole reptile
Scheduler： Scheduler , take request Request to queue up , When the engine needs to be returned to the engine , Pass the request through the engine to Downloader; Simply put, it's a queue , Be responsible for receiving the information sent by the engine request request , Then queue the request , When the engine needs to request data , Give the data in the request queue to the engine . Initial crawling URL And the information to be crawled obtained later in the page URL Put into the scheduler , Waiting for crawling , At the same time, the scheduler will automatically remove the duplicate URL（ If specific URL It does not need to be de duplicated, but can also be realized by setting , Such as post Requested URL）
Downloader： Downloader , Put the engine engine Sent request Receive , And will response The result is returned to the engine engine, Then the engine passes it to Spiders Handle
Spiders： Parser , It handles everything responses, Analyze and extract data from it , obtain Item The data required for the field , And will need to follow up URL Submit to engine , Once again into the Scheduler( Scheduler ); It is also the entrance URL The place of
Item Pipeline： Data pipeline , That is, we encapsulate de duplication 、 Where classes are stored , Responsible for handling Spiders Data obtained in and post-processing , Filter or store, etc . When the page is parsed by the crawler, the data required is stored in Item after , Will be sent to project pipeline (Pipeline), And process the data in several specific order , Finally, it is stored in local file or database

2 Middleware ：

Downloader Middlewares： Download Middleware , It can be regarded as a component that can customize and extend the download function , It is a specific hook between the engine and the downloader (specific hook), Handle Downloader To the engine response. The crawler can be automatically replaced by setting the downloader middleware user-agent、IP And so on .
Spider Middlewares： Crawler middleware ,Spider Middleware is in the engine and Spider Specific hooks between (specific hook), Handle spider The input of (response) And the output (items And requests). Custom extension 、 The engine and Spider Components of communication function between , Extend by inserting custom code Scrapy function .

Scrapy Operation document ( Chinese )：https://www.osgeo.cn/scrapy/topics/spider-middleware.html

Scrapy Installation of frame

cmd window ,pip Installation

pip install scrapy

Scrapy Common problems during frame installation

Can't find win32api modular ----windows Common in the system

pip install pypiwin32

establish Scrapy Reptile project

New projects

scrapy startproject xxx Project name

example :

scrapy startproject tubatu_scrapy_project

Project directory

scrapy.cfg： The configuration file for the project , Defines the path of the project configuration file and other configuration information

【settings】： Defines the path of the project's configuration file , namely ./tubatu_scrapy_project/settings file
【deploy】： Deployment information

items.py： That's what we define item Where data structures ; That is, which fields we want to grab , be-all item Definitions can be put into this file
pipelines.py： Pipeline files for the project , It is what we call data processing pipeline file ; Used to write data storage , Cleaning and other logic , For example, store data in json file , You can write logic here
settings.py： The setup file for the project , You can define the global settings of the project , For example, set the crawler USER_AGENT , You can set... Here ; Common configuration items are as follows ：
- ROBOTSTXT_OBEY ： Is it followed ROBTS agreement , Generally set as False
- CONCURRENT_REQUESTS ： Concurrency , The default is 32 concurrent
- COOKIES_ENABLED ： Is it enabled? cookies, The default is False
- DOWNLOAD_DELAY ： Download delay
- DEFAULT_REQUEST_HEADERS ： Default request header
- SPIDER_MIDDLEWARES ： Is it enabled? spider middleware
- DOWNLOADER_MIDDLEWARES ： Is it enabled? downloader middleware
- For others, see link
spiders Catalog ： Include the implementation of each crawler , Our parsing rules are written in this directory , That is, the crawler parser is written in this directory
middlewares.py： Defined SpiderMiddleware and DownloaderMiddleware The rules of middleware ; Custom request 、 Customize other data processing methods 、 Proxy access, etc

Automatic generation spiders Template file

cd To spiders Under the table of contents , Output the following command , Generate crawler files ：

scrapy genspider  file name   Crawling address

Run crawler

Mode one ：cmd start-up

cd To spiders Under the table of contents , Execute the following command , Start the crawler ：

scrapy crawl  Reptile name

Mode two ：py File to start the

Create under Project main.py file , Create startup script , perform main.py Startup file , The code example is as follows ：

code- Crawler file

import scrapy


class TubatuSpider(scrapy.Spider):
    # The name cannot be repeated 
    name = 'tubatu'
    # Allow crawlers to crawl the domain name 
    allowed_domains = ['xiaoguotu.to8to.com']
    # Crawler file to be started after the project is started 
    start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1']

    # The default parsing method 
    def parse(self, response):
        print(response.text)

code- Startup file

from scrapy import cmdline

# In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project 
# Use cmdlie.execute() Method to execute the crawler start command ：scrapy crawl  Reptile name 
cmdline.execute("scrapy crawl tubatu".split())  #execute Each command that the method needs to run is a separate string , Such as ：cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#

code- Running results

The sample project

Crawl the information of Tu Batu decoration website . Store the crawled data locally MongoDB In the database ;

The following figure shows the project organization , The document marked in blue is this time code Code for

tubatu.py

 1 import scrapy
 2 from tubatu_scrapy_project.items import TubatuScrapyProjectItem
 3 import re
 4 
 5 class TubatuSpider(scrapy.Spider):
 6 
 7    # The name cannot be repeated 
 8     name = 'tubatu'
 9     # Allow crawlers to crawl the domain name , Beyond this directory, you are not allowed to crawl 
10     allowed_domains = ['xiaoguotu.to8to.com','wx.to8to.com','sz.to8to.com']
11     # Crawler file to be started after the project is started 
12     start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1']
13 
14 
15     # The default parsing method 
16     def parse(self, response):
17         # response It can be used directly in the back xpath Method 
18         # response It's just one. Html object 
19         pic_item_list = response.xpath("//div[@class='item']")
20         for item in pic_item_list[1:]:
21             info = {}
22             #  Here is a point to not lose , It means that at present Item Now use again xpath
23             #  It's not just xpath In positioning text() Content , Need to filter again ; Return to ：[<Selector xpath='.//div/a/text()' data='0 Yuan gets the design plan , Limit the number of people to receive '>] <class 'scrapy.selector.unified.SelectorList'>
24             # content_name = item.xpath('.//div/a/text()')
25 
26             # Use extract() Method to get item Back to data Information , Back to the list 
27             # content_name = item.xpath('.//div/a/text()').extract()
28 
29             # Use extract_first() Method to get the name , data ; The return is str type 
30             # Get the name of the project , Project data 
31             info['content_name'] = item.xpath(".//a[@target='_blank']/@data-content_title").extract_first()
32 
33             # Get the URL
34             info['content_url'] = "https:"+ item.xpath(".//a[@target='_blank']/@href").extract_first()
35 
36             # project id
37             content_id_search = re.compile(r"(\d+)\.html")
38             info['content_id'] = str(content_id_search.search(info['content_url']).group(1))
39 
40             # Use yield To send asynchronous requests , It uses scrapy.Request() Method to send , This method can be passed on to cookie etc. , You can enter this method to check 
41             # Callback function callback, Write only the method name , Do not call methods 
42             yield scrapy.Request(url=info['content_url'],callback=self.handle_pic_parse,meta=info)
43 
44         if response.xpath("//a[@id='nextpageid']"):
45             now_page = int(response.xpath("//div[@class='pages']/strong/text()").extract_first())
46             next_page_url="https://xiaoguotu.to8to.com/pic_space1?page=%d" %(now_page+1)
47             yield scrapy.Request(url=next_page_url,callback=self.parse)
48 
49 
50     def handle_pic_parse(self,response):
51         tu_batu_info = TubatuScrapyProjectItem()
52         # The address of the picture 
53         tu_batu_info["pic_url"]=response.xpath("//div[@class='img_div_tag']/img/@src").extract_first()
54         # nickname 
55         tu_batu_info["nick_name"]=response.xpath("//p/i[@id='nick']/text()").extract_first()
56         # The name of the picture 
57         tu_batu_info["pic_name"]=response.xpath("//div[@class='pic_author']/h1/text()").extract_first()
58         # Name of the project 
59         tu_batu_info["content_name"]=response.request.meta['content_name']
60         #  project id
61         tu_batu_info["content_id"]=response.request.meta['content_id']
62         # Project URL
63         tu_batu_info["content_url"]=response.request.meta['content_url']
64         #yield To piplines, We go through settings.py It's enabled inside , If not enabled , Will not be available 
65         yield tu_batu_info

items.py

 1 # Define here the models for your scraped items
 2 #
 3 # See documentation in:
 4 # https://docs.scrapy.org/en/latest/topics/items.html
 5 
 6 import scrapy
 7 
 8 
 9 class TubatuScrapyProjectItem(scrapy.Item):
10     # define the fields for your item here like:
11     # name = scrapy.Field()
12 
13     # Decoration name 
14     content_name=scrapy.Field()
15     # decorate id
16     content_id = scrapy.Field()
17     # request url
18     content_url=scrapy.Field()
19     # nickname 
20     nick_name=scrapy.Field()
21     # The image url
22     pic_url=scrapy.Field()
23     # The name of the picture 
24     pic_name=scrapy.Field()

piplines.py

 1 # Define your item pipelines here
 2 #
 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 
 6 
 7 # useful for handling different item types with a single interface
 8 from itemadapter import ItemAdapter
 9 
10 from pymongo import MongoClient
11 
12 class TubatuScrapyProjectPipeline:
13 
14     def __init__(self):
15         client = MongoClient(host="localhost",
16                              port=27017,
17                              username="admin",
18                              password="123456")
19         mydb=client['db_tubatu']
20         self.mycollection = mydb['collection_tubatu']
21 
22     def process_item(self, item, spider):
23         data = dict(item)
24         self.mycollection.insert_one(data)
25         return item

settings.py

main.py

1 from scrapy import cmdline
2 
3 # In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project 
4 # Use cmdlie.execute() Method to execute the crawler start command ：scrapy crawl  Reptile name 
5 cmdline.execute("scrapy crawl tubatu".split())  #execute Each command that the method needs to run is a separate string , Such as ：cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#