当前位置:网站首页>Crawler (9) - scrape framework (1) | scrape asynchronous web crawler framework

Crawler (9) - scrape framework (1) | scrape asynchronous web crawler framework

2022-07-05 10:39:00 Old Ge

What is? Scrapy

  • be based on Twisted The asynchronous processing framework of
  • pure python Implementation of the crawler framework
  • The basic structure :5+2 frame ,5 A component ,2 Middleware

 

5 A component :

  • Scrapy Engine: engine , Responsible for the communication of other components Carry out signal and data transmission ; be responsible for Scheduler、Downloader、Spiders、Item Pipeline The transmission of intermediate communication signals and data , This component is equivalent to a crawler “ The brain ”, It's the dispatch center of the whole reptile
  • Scheduler: Scheduler , take request Request to queue up , When the engine needs to be returned to the engine , Pass the request through the engine to Downloader; Simply put, it's a queue , Be responsible for receiving the information sent by the engine request request , Then queue the request , When the engine needs to request data , Give the data in the request queue to the engine . Initial crawling URL And the information to be crawled obtained later in the page URL Put into the scheduler , Waiting for crawling , At the same time, the scheduler will automatically remove the duplicate URL( If specific URL It does not need to be de duplicated, but can also be realized by setting , Such as post Requested URL)
  • Downloader: Downloader , Put the engine engine Sent request Receive , And will response The result is returned to the engine engine, Then the engine passes it to Spiders Handle
  • Spiders: Parser , It handles everything responses, Analyze and extract data from it , obtain Item The data required for the field , And will need to follow up URL Submit to engine , Once again into the Scheduler( Scheduler ); It is also the entrance URL The place of
  • Item Pipeline: Data pipeline , That is, we encapsulate de duplication 、 Where classes are stored , Responsible for handling Spiders Data obtained in and post-processing , Filter or store, etc . When the page is parsed by the crawler, the data required is stored in Item after , Will be sent to project pipeline (Pipeline), And process the data in several specific order , Finally, it is stored in local file or database

 

2 Middleware :

  • Downloader Middlewares: Download Middleware , It can be regarded as a component that can customize and extend the download function , It is a specific hook between the engine and the downloader (specific hook), Handle Downloader To the engine response. The crawler can be automatically replaced by setting the downloader middleware user-agent、IP And so on .
  • Spider Middlewares: Crawler middleware ,Spider Middleware is in the engine and Spider Specific hooks between (specific hook), Handle spider The input of (response) And the output (items And requests). Custom extension 、 The engine and Spider Components of communication function between , Extend by inserting custom code Scrapy function .

 Scrapy Operation document ( Chinese ):https://www.osgeo.cn/scrapy/topics/spider-middleware.html

 

Scrapy Installation of frame

cmd window ,pip Installation

pip install scrapy

Scrapy Common problems during frame installation

Can't find win32api modular ----windows Common in the system

pip install pypiwin32

 

establish Scrapy Reptile project

New projects

scrapy startproject xxx Project name 

example :

scrapy startproject tubatu_scrapy_project

 

Project directory

 

scrapy.cfg: The configuration file for the project , Defines the path of the project configuration file and other configuration information

  • 【settings】: Defines the path of the project's configuration file , namely ./tubatu_scrapy_project/settings file
  • 【deploy】: Deployment information

 

  • items.py: That's what we define item Where data structures ; That is, which fields we want to grab , be-all item Definitions can be put into this file
  • pipelines.py: Pipeline files for the project , It is what we call data processing pipeline file ; Used to write data storage , Cleaning and other logic , For example, store data in json file , You can write logic here
  • settings.py: The setup file for the project , You can define the global settings of the project , For example, set the crawler  USER_AGENT , You can set... Here ; Common configuration items are as follows :
    • ROBOTSTXT_OBEY : Is it followed ROBTS agreement , Generally set as False
    • CONCURRENT_REQUESTS : Concurrency , The default is 32 concurrent
    • COOKIES_ENABLED : Is it enabled? cookies, The default is False
    • DOWNLOAD_DELAY : Download delay
    • DEFAULT_REQUEST_HEADERS : Default request header
    • SPIDER_MIDDLEWARES : Is it enabled? spider middleware
    • DOWNLOADER_MIDDLEWARES : Is it enabled? downloader middleware
    • For others, see link
  • spiders Catalog : Include the implementation of each crawler , Our parsing rules are written in this directory , That is, the crawler parser is written in this directory
  • middlewares.py: Defined  SpiderMiddleware and DownloaderMiddleware  The rules of middleware ; Custom request 、 Customize other data processing methods 、 Proxy access, etc

 

Automatic generation spiders Template file

cd To spiders Under the table of contents , Output the following command , Generate crawler files :

scrapy genspider  file name   Crawling address 

 

Run crawler

Mode one :cmd start-up

cd To spiders Under the table of contents , Execute the following command , Start the crawler :

scrapy crawl  Reptile name 

 

Mode two :py File to start the

Create under Project main.py file , Create startup script , perform main.py Startup file , The code example is as follows :

code- Crawler file

import scrapy


class TubatuSpider(scrapy.Spider):
    # The name cannot be repeated 
    name = 'tubatu'
    # Allow crawlers to crawl the domain name 
    allowed_domains = ['xiaoguotu.to8to.com']
    # Crawler file to be started after the project is started 
    start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1']

    # The default parsing method 
    def parse(self, response):
        print(response.text)

code- Startup file

from scrapy import cmdline

# In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project 
# Use cmdlie.execute() Method to execute the crawler start command :scrapy crawl  Reptile name 
cmdline.execute("scrapy crawl tubatu".split())  #execute Each command that the method needs to run is a separate string , Such as :cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#

code- Running results

The sample project

Crawl the information of Tu Batu decoration website . Store the crawled data locally MongoDB In the database ;

The following figure shows the project organization , The document marked in blue is this time code Code for

tubatu.py

 1 import scrapy
 2 from tubatu_scrapy_project.items import TubatuScrapyProjectItem
 3 import re
 4 
 5 class TubatuSpider(scrapy.Spider):
 6 
 7    # The name cannot be repeated 
 8     name = 'tubatu'
 9     # Allow crawlers to crawl the domain name , Beyond this directory, you are not allowed to crawl 
10     allowed_domains = ['xiaoguotu.to8to.com','wx.to8to.com','sz.to8to.com']
11     # Crawler file to be started after the project is started 
12     start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1']
13 
14 
15     # The default parsing method 
16     def parse(self, response):
17         # response It can be used directly in the back xpath Method 
18         # response It's just one. Html object 
19         pic_item_list = response.xpath("//div[@class='item']")
20         for item in pic_item_list[1:]:
21             info = {}
22             #  Here is a point to not lose , It means that at present Item Now use again xpath
23             #  It's not just xpath In positioning text() Content , Need to filter again ; Return to :[<Selector xpath='.//div/a/text()' data='0 Yuan gets the design plan , Limit the number of people to receive '>] <class 'scrapy.selector.unified.SelectorList'>
24             # content_name = item.xpath('.//div/a/text()')
25 
26             # Use extract() Method to get item Back to data Information , Back to the list 
27             # content_name = item.xpath('.//div/a/text()').extract()
28 
29             # Use extract_first() Method to get the name , data ; The return is str type 
30             # Get the name of the project , Project data 
31             info['content_name'] = item.xpath(".//a[@target='_blank']/@data-content_title").extract_first()
32 
33             # Get the URL
34             info['content_url'] = "https:"+ item.xpath(".//a[@target='_blank']/@href").extract_first()
35 
36             # project id
37             content_id_search = re.compile(r"(\d+)\.html")
38             info['content_id'] = str(content_id_search.search(info['content_url']).group(1))
39 
40             # Use yield To send asynchronous requests , It uses scrapy.Request() Method to send , This method can be passed on to cookie etc. , You can enter this method to check 
41             # Callback function callback, Write only the method name , Do not call methods 
42             yield scrapy.Request(url=info['content_url'],callback=self.handle_pic_parse,meta=info)
43 
44         if response.xpath("//a[@id='nextpageid']"):
45             now_page = int(response.xpath("//div[@class='pages']/strong/text()").extract_first())
46             next_page_url="https://xiaoguotu.to8to.com/pic_space1?page=%d" %(now_page+1)
47             yield scrapy.Request(url=next_page_url,callback=self.parse)
48 
49 
50     def handle_pic_parse(self,response):
51         tu_batu_info = TubatuScrapyProjectItem()
52         # The address of the picture 
53         tu_batu_info["pic_url"]=response.xpath("//div[@class='img_div_tag']/img/@src").extract_first()
54         # nickname 
55         tu_batu_info["nick_name"]=response.xpath("//p/i[@id='nick']/text()").extract_first()
56         # The name of the picture 
57         tu_batu_info["pic_name"]=response.xpath("//div[@class='pic_author']/h1/text()").extract_first()
58         # Name of the project 
59         tu_batu_info["content_name"]=response.request.meta['content_name']
60         #  project id
61         tu_batu_info["content_id"]=response.request.meta['content_id']
62         # Project URL
63         tu_batu_info["content_url"]=response.request.meta['content_url']
64         #yield To piplines, We go through settings.py It's enabled inside , If not enabled , Will not be available 
65         yield tu_batu_info

items.py

 1 # Define here the models for your scraped items
 2 #
 3 # See documentation in:
 4 # https://docs.scrapy.org/en/latest/topics/items.html
 5 
 6 import scrapy
 7 
 8 
 9 class TubatuScrapyProjectItem(scrapy.Item):
10     # define the fields for your item here like:
11     # name = scrapy.Field()
12 
13     # Decoration name 
14     content_name=scrapy.Field()
15     # decorate id
16     content_id = scrapy.Field()
17     # request url
18     content_url=scrapy.Field()
19     # nickname 
20     nick_name=scrapy.Field()
21     # The image url
22     pic_url=scrapy.Field()
23     # The name of the picture 
24     pic_name=scrapy.Field()

piplines.py

 1 # Define your item pipelines here
 2 #
 3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 5 
 6 
 7 # useful for handling different item types with a single interface
 8 from itemadapter import ItemAdapter
 9 
10 from pymongo import MongoClient
11 
12 class TubatuScrapyProjectPipeline:
13 
14     def __init__(self):
15         client = MongoClient(host="localhost",
16                              port=27017,
17                              username="admin",
18                              password="123456")
19         mydb=client['db_tubatu']
20         self.mycollection = mydb['collection_tubatu']
21 
22     def process_item(self, item, spider):
23         data = dict(item)
24         self.mycollection.insert_one(data)
25         return item

settings.py

main.py

1 from scrapy import cmdline
2 
3 # In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project 
4 # Use cmdlie.execute() Method to execute the crawler start command :scrapy crawl  Reptile name 
5 cmdline.execute("scrapy crawl tubatu".split())  #execute Each command that the method needs to run is a separate string , Such as :cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#

 

原网站

版权声明
本文为[Old Ge]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/186/202207051011012201.html