当前位置：网站首页>Site data collection -scrapy usage notes

Site data collection -scrapy usage notes

2022-07-29 10:37:00 【c01dkit】

Preface

There are many ways to collect website data , For example, the most basic requests, You can get web information in a few lines . Use selenium Simulated web page clicking can bypass many anti crawl strategies , Writing ideas are also different from other methods . use scrapy The framework can clearly split the target , And using the built-in thread pool can get information very efficiently .

This article takes scrapy Target , Summarize the usage of the foundation , For follow-up review .

To configure

Locally configured python And pip after , Use pip install scrapy You can install scrapy.

Basic use

New project

scrapy When use , It needs to be in the host command line scrapy startproject <projectname> Create a project , Like running scrapy startproject example Post generation example Folder , The contents are shown in the figure .

Insert picture description here

Add target site

The command line will also prompt to enter example Catalog , And run scrapy genspider To create a spider. Like running scrapy genspider example_spider example.com, After the spiders Create a example_spider.py file . The code of the crawler needs to be written in this file .

GET request

import scrapy


class ExampleSpiderSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com/']

    def parse(self, response):
        pass

POST request

import scrapy


class ExampleSpiderSpider(scrapy.Spider):
    name = 'example_spider'
    allowed_domains = ['example.com']
    urls = [
            'https://example.com/page/1/',
            'https://example.com/page/2/',
        ]
    
    def start_requests(self)：
		for target in urls:
    	   # send out 'Content-Type':'application/x-www-form-urlencoded' Request 
           yield scrapy.FormRequest(
            	url=url, 
            	formdata={
    'arg1':'xxx','arg2':'xxx'},
            	callback=self.parse, 
            	meta={
    'arg1':1,'arg2':2}
            )
           # send out 'Content-Type':'application/json' Request 
           yield scrapy.Request(
            	url=url, 
            	method='POST',
            	body = json.dumps({
    'arg1':'xxx','arg2':'xxx'}),
            	headers = {
    'Content-Type':'application/json'},
            	callback=self.parse, 
            	meta={
    'arg1':1,'arg2':2}
            )
            
    def parse(self, response):
        pass

It should be noted that ：

name It's the name of the reptile , namely spidername, After running, you need to specify this name .
allowed_domains Specify the domain name allowed to crawl , It's OK not to .
start_urls Specify which websites to crawl , The runtime will send requests to these websites one by one , And send the response to parse function . If you need to dynamically generate the target website , You can delete this start_urls Variable , And add a start_requests(self) Member functions （ Need to use yield scrapy.Request(url = <targetwebsite>, callback=self.parse) As return value . When the crawler runs, if it finds that there is no definition start_urls Variable , This function will be called .
scrapy.Request Used for sending GET request . You can add one cb_kwargs Parameters , It accepts a dictionary , And can be in parse(self, response, **kwargs) Pass through kwargs To get this dictionary , To achieve custom parameter passing . You can also use meta Parameters , And then in parse In the use response.meta Get the dictionary passed .
scrapy.FormRequest Used for sending POST request , The request body is placed in formdata in , Parameters should be of string type . have access to meta To pass parameters （ It can also be used. cb_kwargs? No test ）

Here is an example of modifying the official document ：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'https://quotes.toscrape.com/page/1/',
            'https://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse, cb_kwargs={
    'this_url':url})

    def parse(self, response, **kwargs):
        page = response.url.split("/")[-2]
        url = kwargs['this_url']
        filename = f'quotes-{
      page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log(f'Saved file {
      filename}')

Start crawling

In the outermost layer example Run in directory scrapy crawl <spidername>, You can start crawling .

Take the official website documents as an example ：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://quotes.toscrape.com/page/1/',
        'https://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
    
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

response.css To extract elements , Its meaning is self-evident . It can also be done through response.text Get text information .

there parse function yield A dictionary , You can specify to save the file at run time ：scrapy crawl <spidername> -O <output.jl> To save it to a file , Convenient for subsequent processing .jl yes jsonline Single line json, Can be in python Use simple file line by line traversal coordination json To deal with it . among -O Indicates overwriting the output file ,-o Indicates that... Is appended after the output file . You can add -L ERROR To ignore irrelevant output at runtime .

To the public API Continue to crawl ,jl It works wonders .

原网站

版权声明
本文为[c01dkit]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207291025178701.html