当前位置:网站首页>Scrapy crawler encounters redirection 301/302 problem solution
Scrapy crawler encounters redirection 301/302 problem solution
2022-08-02 04:00:00 【BIG_right】
Scrapy aborts redirects
When scrapy crawls data, it encounters redirection 301/302, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same
Abort redirect
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)If the crawling is crawling with yield Request in parse, then the filter dont_filter=True needs to be added. For details, see the following scenarioTwo
Get the Location value in the response
The redirected link will be placed in the Location in the header of the response, here is how to get the value
location = response.headers.get("Location")Scenario One
If the crawling URL is executed sequentially in start_urls, just add it directly in the start_requests method
def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")Scenario Two
If the crawling is crawling with yield Request in parse, then you need to add the filter dont_filter=True
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)边栏推荐
- [sebastian/diff] A historical change extension library for comparing two texts
- 2.PHP变量、输出、EOF、条件语句
- PHP8.2将会有哪些新东西?
- 13.JS输出内容和语法
- (5) 模块与包、编码格式、文件操作、目录操作
- (7) superficial "crawlers" process (concept + practice)
- What are the PHP framework?
- Stable and easy-to-use short connection generation platform, supporting API batch generation
- Scrapy爬虫遇见重定向301/302问题解决方法
- PHP有哪些框架?
猜你喜欢
随机推荐
[league/climate]一个功能健全的命令行功能操作库
SQL: DDL, DML, DQL, DCL corresponding introduction and demonstration
CTF入门之md5
17.JS条件语句和循环,以及数据类型转换
Function hoisting and variable hoisting
CTF-网鼎杯往届题目
v-bind用法:类动态绑定对象 数组 style样式 及函数方法
[trendsoft/capital]金额转中文大写库
批量替换文件字体,简体->繁体
Thread Pool (Introduction and Use of Thread Pool)
(4) Function, Bug, Class and Object, Encapsulation, Inheritance, Polymorphism, Copy
[mikehaertl/php-shellcommand]一个用于调用外部命令操作的库
[sebastian/diff] A historical change extension library for comparing two texts
Eric靶机渗透测试通关全教程
[mikehaertl/php-shellcommand] A library for invoking external command operations
Dom实现input的焦点触发
DVWA靶机安装教程
(6) 学生信息管理系统设计
ES6迭代器解释举例
(1) print()函数、转义字符、二进制与字符编码 、变量、数据类型、input()函数、运算符









