当前位置:网站首页>Scrapy crawler encounters redirection 301/302 problem solution
Scrapy crawler encounters redirection 301/302 problem solution
2022-08-02 04:00:00 【BIG_right】
Scrapy aborts redirects
When scrapy crawls data, it encounters redirection 301/302, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same
Abort redirect
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)If the crawling is crawling with yield Request in parse, then the filter dont_filter=True needs to be added. For details, see the following scenarioTwo
Get the Location value in the response
The redirected link will be placed in the Location in the header of the response, here is how to get the value
location = response.headers.get("Location")Scenario One
If the crawling URL is executed sequentially in start_urls, just add it directly in the start_requests method
def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")Scenario Two
If the crawling is crawling with yield Request in parse, then you need to add the filter dont_filter=True
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)边栏推荐
- 14. JS Statements and Comments, Variables and Data Types
- hackmyvm: juggling walkthrough
- v-bind用法:类动态绑定对象 数组 style样式 及函数方法
- PHP基金会三月新闻公告发布
- [symfony/finder] The best file manipulation library
- PHP入门(自学笔记)
- [phpunit/php-timer] A timer for code execution time
- [phpunit/php-timer]一个用于代码执行时间的计时器
- ES6三点运算符、数组方法、字符串扩展方法
- CTF入门之php文件包含
猜你喜欢
随机推荐
The Error in the render: "TypeError: always read the properties of null '0' (reading)" Error solution
PHP基金会三月新闻公告发布
Multithreading (implementing multithreading, thread synchronization, producer and consumer)
(7) 浅学 “爬虫” 过程 (概念+练习)
IO流、字节流、字节缓冲流
(4) Function, Bug, Class and Object, Encapsulation, Inheritance, Polymorphism, Copy
The focus of the Dom implementation input triggers
(5) Modules and packages, encoding formats, file operations, directory operations
14.JS语句和注释,变量和数据类型
TCP通信程序
JS objects, functions and scopes
PHP有哪些框架?
[league/climate] A robust command-line function manipulation library
Phpstudy安装Thinkphp6(问题+解决)
CTF-网鼎杯往届题目
IO streams, byte stream and byte stream buffer
Add a full image watermark to an image in PHP
Stable and easy-to-use short connection generation platform, supporting API batch generation
(4) 函数、Bug、类与对象、封装、继承、多态、拷贝
2.PHP变量、输出、EOF、条件语句
![[sebastian/diff]一个比较两段文本的历史变化扩展库](/img/c7/ea79db7a5003523ece7cf4f39e4987.png)







