当前位置:网站首页>Can't get data for duplicate urls using Scrapy framework, dont_filter=True

Can't get data for duplicate urls using Scrapy framework, dont_filter=True

2022-08-03 09:32:00 The moon give me copy code

Scenario: The code reports no errors, and the xpath expression is determined to be parsed correctly.

Possible cause: You are using Scrapy to request duplicate urls.

Scrapy has duplicate filtering built in, which is turned on by default.

The following example, parse2 cannot be called:

import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2)def parse2(self, response):print(response.url)

When Scrapy enters parse, it will request start_urls[0] by default, and when you request start_urls[0] again in parse, the bottom layer of Scrapy will filter out duplicate urls by default, and will not process the request.commit, that's why parse2 is not called.

Workaround:

Add dont_filter=True parameter so that Scrapy doesn't filter out duplicate requests.

import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)def parse2(self, response):print(response.url)

At this point, parse2 will be called normally.

原网站

版权声明
本文为[The moon give me copy code]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/215/202208030926394956.html