当前位置：网站首页>Scratch crawler framework

Scratch crawler framework

2022-07-27 16:05:00 【fresh_ nam】

List of articles

Preface
One 、 The goal is
Two 、 Use steps
summary

Preface

Scrapy Is a very easy to use crawler framework , Its asynchronous processing feature makes us get the content we want faster , Now let's use it to crawl some pictures .

One 、 The goal is

I want to crawl some pictures now , I looked it up on the Internet , The website decided to crawl is http://www.mmonly.cc/mmtp/, There are many beautiful pictures in it , Crawl all the pictures on its homepage .

Two 、 Use steps

1. download scrapy

Use pip Download it

pip install scrapy

2. Create project

Use cmd Enter the folder where you want to create the crawler , Enter the following command ：

scrapy startproject scrapy_demo

A folder named scrapy_demo Folder , Inside is the crawler file , Use pychram Open its directory structure as follows ：
Insert picture description here
then cd Command to enter the crawler folder , Enter the following command ：

scrapy genspider MMspider www.mmonly.cc/mmtp/

among MMspider It's the reptile name , When starting the crawler, use , The website behind is the website to be crawled . After executing the command, it will be in spiders A folder named MMspide Of py file , Later crawler logic should be written inside , The generated domain name should be changed , The results are as follows ：
MMspider.py

# -*- coding: utf-8 -*-
import scrapy


class MmspiderSpider(scrapy.Spider):
    name = 'MMspider'  #  Reptile name 
    allowed_domains = ['mmonly.cc']   #  Allow crawler domain name 
    start_urls = ['http://www.mmonly.cc/mmtp//']   #  The web address of the starting page where the crawler crawls for information 

    def parse(self, response):
        pass

Insert picture description here

2. Analyze the structure of the web page

To crawl a web page , First understand its structure ：
Insert picture description here

The target page has many pages , Every page has many photo albums , Each photo album will also have multiple pictures , Each picture is also a page . So get all the photo album pictures , Reptiles should be designed like this ： Get links to all photo albums on the main page , Then crawl through the link to get all the pictures of each photo album , Then judge whether the main page has the next page , If yes, continue to crawl , Until you crawl all the pictures of the photo page corresponding to all the home pages .

3 Writing Crawlers

First of all items.py It defines the information to crawl ：
items.py

import scrapy

class ScrapyDemoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    siteURL = scrapy.Field()  #  Picture website address 
    detailURL = scrapy.Field()  #  The address of the original picture 
    title = scrapy.Field()  #  Picture series name 
    fileName = scrapy.Field()  #  Image storage full path name 
    path = scrapy.Field()  #  Image series storage path

Then write a crawler parsing page ：
MMspider.py

import os
import scrapy
import datetime
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_demo.items import ScrapyDemoItem

# Inherit CrawlSpider
class MmspiderSpider(CrawlSpider):
    name = 'MMspider'  # Reptile name 
    base = r'D:\python Grab pictures \MMspider\ '  #  Define picture storage path 
    allowed_domains = ['mmonly.cc']  #  The domain name that the crawler allows to crawl information 
    start_urls = ['http://www.mmonly.cc/mmtp//']
    #  Define main page crawling rules , On the next page, continue to dig deeply , Other qualified links call parse_item Resolve the original address 
    rules = (
        Rule(LinkExtractor(allow=('https://www.mmonly.cc/(.*?).html'), restrict_xpaths=(u"//div[@class='ABox']")),
             callback="parse_item", follow=False),
        Rule(LinkExtractor(allow=(''), restrict_xpaths=(u"//a[contains(text(),' The next page ')]")), follow=True),
    )

    def parse_item(self, response):
        item = ScrapyDemoItem()
        item['siteURL'] = response.url
        item['title'] = response.xpath('//h1/text()').extract_first()  # xpath Parse title 
        item['path'] = self.base + item['title']  #  Define the storage path , The same series is stored in the same directory 
        path = item['path']
        if not os.path.exists(path):
            os.makedirs(path)  #  If the storage path does not exist, create 
        item['detailURL'] = response.xpath('//a[@class="down-btn"]/@href').extract_first()  #  Parse original URL
        num = response.xpath('//span[@class="nowpage"]/text()').extract_first()  #  Analyze the number of the same series of pictures 
        item['fileName'] = item['path'] + '/' + str(num) + '.jpg'  #  Splice picture name 
        print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), item['fileName'], u' Parsing succeeded ！')
        yield item
        try:
            #  Determine whether the current page is the last page , If not , Then get the next page 
            if num != total_page:
                #  Incoming parsing item If there is a link to the next page , Continue to call parse_item
                next_page = response.xpath(u"//a[contains(text(),' The next page ')]/@href").extract_first()
                if next_page is not None:
                    next_page = response.urljoin(next_page)
                    yield scrapy.Request(next_page, callback=self.parse_item)
        except:
            pass

Is it a little confused to see this string of code , Don't worry , Listen to me slowly explain .

rule It's the crawler rule I define , I have defined two crawler rules ： The first is from every page class by ’ABOX’ Of div The request in the tag is satisfied ’https://www.mmonly.cc/(.*?).html’ Regular expression links , Achieve the purpose of getting links to each page of photo album .
Insert picture description here
callback="parse_item" It means using parse_item() Function to parse the request page .

The second rule is to get a link to the next page and request , Find all of the pages a The corresponding text in the label is... On the next page a label , Request its corresponding link , It realizes the acquisition of the next page .

follow The parameter is a Boolean (boolean) value , Specified from... According to the rule response Whether the extracted links need to be followed up . If callback It's empty ,follow The default setting is True, Otherwise default to False.

Xpath yes html The parser , Use it to quickly locate html Some element of , for example response.xpath(’//a[@class=“down-btn”]/@href’).extract_first() Is to get all class by down-btn Of a The label corresponds to href Contents of Li
Insert picture description here
The code to download the image is written in the pipeline file . It uses requests Library requests picture links and saves pictures , First download requests:

pip install requests

The code is as follows ：
pipelines.py

import requests
import datetime

class ScrapyDemoPipeline(object):
    def process_item(self, item, spider):
        detailURL = item['detailURL']  # Get links to pictures 
        fileName = item['fileName']   # Get the full path of file saving 
        try:
            print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), u' Saving picture ：', detailURL)
            print(u' file ：', fileName)
            image = requests.get(detailURL)  #  According to the analysis item Original picture link to download pictures 
            f = open(fileName, 'wb')  #  Open the picture 
            f.write(image.content)  #  Write picture 
            f.close()
            print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), fileName, u' Saved successfully ！')
        except Exception:
            print(fileName, 'other fault:', Exception)
        return item

Last , Also open the pipeline file in the configuration ：
settings.py

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   #True Change to False
# Remove annotations 
ITEM_PIPELINES = {
    
   'scrapy_demo.pipelines.ScrapyDemoPipeline': 300,
}

Last command input ：

scrapy crawl MMspider

Reptiles can run , Be accomplished ！
Insert picture description here
Pictures are also saved ：

summary

scrapy Is a very powerful framework , I only use part of its functions here . Among them, I made some modifications to the code to enable the crawler to start , There may be no steps written above , If you have any questions, please leave a message in the comment area .

原网站

版权声明
本文为[fresh_ nam]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271435174297.html