当前位置：网站首页>Crawler scrapy framework learning 2

Crawler scrapy framework learning 2

2022-06-13 04:32:00 【gohxc】

Extract the data

Learn how to use Scrapy The best way to extract data is to use shell Scrapy shell Try selector .

scrapy shell "http://quotes.toscrape.com/page/1/"

Use shell, You can try CSS And response object selection elements ：

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']

Result of operation response.css(‘title’) Is an object similar to a list SelectorList, It represents a Selector contain XML / HTML The object list of the element
One is that we have added ::text To CSS Querying , This means that we just want to select the text element directly within the element . If we don't specify ::text, We will get a complete title Elements , Include its label
The result of the call .extract() It's a list , Because we are dealing with an instance SelectorList. Use when you know you only want the first result extract_first()

 response.css('title::text').extract_first()
 response.css('title::text')[0].extract()
 // this 2 One is the same 
 // however , When it cannot find any elements that match the selection , Use .extract_first() avoid IndexError And return  None.

except extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract ：
了 extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract ：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

XPath： Brief introduction

except CSS,Scrapy Selectors also support the use of XPath expression ：

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath The expression is very powerful , yes Scrapy Selectors The basis of . actually ,CSS The selector is converted to... Under the hood XPath. If you read it carefully shell Text representation of the selector object in , You can see .
Use XPath, You can select the following ： Select include text “ The next page ” Link to . This makes XPath Very suitable for grabbing tasks , We encourage you to learn XPath, Even if you already know how to construct CSS Selectors , It also makes grabbing easier .
whole ：

$ scrapy shell 'http://quotes.toscrape.com'
>>>quote = response.css("div.quote")[0]
>>>title = quote.css("span.text::text").extract_first()
>>>tags = quote.css("div.tags a.tag::text").extract()

Extract data from our spiders

up to now , It does not specifically extract any data , Just put the whole HTML Save the page to a local file . Let's integrate the above extraction logic into our spider .

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

Output is as follows ：

scrapy crawl quotes -o quotes.json

For historical reasons ,Scrapy Will be attached to a given file instead of overwriting its contents .
You can also use other formats , for example JSON Lines：

scrapy crawl quotes -o quotes.jl

Get the link in the page

response.css('li.next a::attr(href)').extract_first()

Now let's see that our spider has been modified to recursively follow links to the next page , Extract data from ：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

Create a shortcut to the request

As the creation of Request Shortcuts to objects , You can use response.follow：

 next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

And scrapy.Request Different , it response.follow Direct support for relative URL - Don't need to call urljoin. Be careful ,response.follow Only return one Request example ; You still need to make this request .

You can also pass selectors response.follow Instead of strings ; This selector should extract the necessary properties ：

for href in response.css('li.next a::attr(href)'):
    yield response.follow(href, callback=self.parse)

For elements , There is a shortcut ：response.follow Automatically use its href attribute . So the code can be further shortened ：

for a in response.css('li.next a'):
    yield response.follow(a, callback=self.parse)

More examples and patterns

import scrapy


class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

This spider will start from the main page , It will follow all links to the author page ,parse_author Call a callback for each page , as well as parse The paging link we saw earlier with the callback .
Another interesting thing this spider demonstrates is , Even if the same author has many citations , We don't have to worry about visiting the same author's page multiple times . By default ,Scrapy Will filter the visited URL Repeated requests for , So as to avoid the problem of too many servers caused by programming errors .

Use spider parameters

You can -a Use this option to provide command line parameters for your spider when it runs ：
scrapy crawl quotes -o quotes-humor.json -a tag=humor
These parameters are passed to Spider Of init Method , Silent Change to spider attribute .
In this example , The value provided for the parameter tag Will pass through self.tag. You can use this option to make your spider only get quotes with specific tags , And build according to the parameters URL：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

If you will tag=humor Parameter passed to this spider , You will notice that it only accesses humor In the tag URL , for example http://quotes.toscrape.com/tag/humor.

原网站

版权声明
本文为[gohxc]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280522159592.html