当前位置:网站首页>Can't get data for duplicate urls using Scrapy framework, dont_filter=True
Can't get data for duplicate urls using Scrapy framework, dont_filter=True
2022-08-03 09:32:00 【The moon give me copy code】
Scenario: The code reports no errors, and the xpath expression is determined to be parsed correctly.
Possible cause: You are using Scrapy to request duplicate urls.
Scrapy has duplicate filtering built in, which is turned on by default.
The following example, parse2 cannot be called:
import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2)def parse2(self, response):print(response.url)
When Scrapy enters parse, it will request start_urls[0] by default, and when you request start_urls[0] again in parse, the bottom layer of Scrapy will filter out duplicate urls by default, and will not process the request.commit, that's why parse2 is not called.
Workaround:
Add dont_filter=True parameter so that Scrapy doesn't filter out duplicate requests.
import scrapyclass ExampleSpider(scrapy.Spider):name="test"# allowed_domains = ["https://www.baidu.com/"]start_urls = ["https://www.baidu.com/"]def parse(self, response):yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)def parse2(self, response):print(response.url)
At this point, parse2 will be called normally.
边栏推荐
猜你喜欢
随机推荐
【网络安全】Kail操作系统
flush tables
LeetCode第三题(Longest Substring Without Repeating Characters)三部曲之二:编码实现
Redis和Mysql数据同步的两种方案
The window of the chosen data flow
ClickHouse删除数据之delete问题详解
10分钟带你入门chrome(谷歌)浏览器插件开发
mysql8安装步骤教程
【LeetCode】226.翻转二叉树
别人都不知道的“好用”网站,让你的效率飞快
MySQL8重置root账户密码图文教程
多媒体数据处理实验1:算术编码
常见STP生成树调整命令
Scala parallel collections, parallel concurrency, thread safety issues, ThreadLocal
SQL试题
系统io统计
xtrabackup
Index (3)
Oracle 数据如何迁移到MySQL 图文详细教程
验证浮点数输入