当前位置:网站首页>Scrapy crawler encounters redirection 301/302 problem solution
Scrapy crawler encounters redirection 301/302 problem solution
2022-08-02 04:00:00 【BIG_right】
Scrapy aborts redirects
When scrapy crawls data, it encounters redirection 301/302, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same
Abort redirect
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)If the crawling is crawling with yield Request in parse, then the filter dont_filter=True needs to be added. For details, see the following scenarioTwo
Get the Location value in the response
The redirected link will be placed in the Location in the header of the response, here is how to get the value
location = response.headers.get("Location")Scenario One
If the crawling URL is executed sequentially in start_urls, just add it directly in the start_requests method
def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")Scenario Two
If the crawling is crawling with yield Request in parse, then you need to add the filter dont_filter=True
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)边栏推荐
- easyswoole 使用redis执行geoRadiusByMember Count无效修复
- [campo/random-user-agent]随机伪造你的User-Agent
- (2) Thinkphp6 template engine ** tag
- PHP image compression to specified size
- 4.PHP数组与数组排序
- TCP communications program
- (1) introduction to Thinkphp6, installation view, template rendering, variable assignment
- (4) Function, Bug, Class and Object, Encapsulation, Inheritance, Polymorphism, Copy
- 3.PHP数据类型、常量、字符串和运算符
- Eric靶机渗透测试通关全教程
猜你喜欢
随机推荐
PHP8.2的版本发布管理员和发布计划
ES6介绍+定义变量+不同情况下箭头函数的this指向
hackmyvm-random walkthrough
hackmyvm: juggling walkthrough
What are the PHP framework?
12.什么是JS
批量替换文件字体,简体->繁体
PHP有哪些杀手级超厉害框架或库或应用?
Scrapy爬虫遇见重定向301/302问题解决方法
GreenOptic: 1 vulnhub walkthrough
Using PHPMailer send mail
[phpunit/php-timer]一个用于代码执行时间的计时器
(1) introduction to Thinkphp6, installation view, template rendering, variable assignment
hackmyvm: may walkthrough
PHP有哪些框架?
[league/flysystem] An elegant and highly supported file operation interface
[campo/random-user-agent]随机伪造你的User-Agent
PHP8.2将会有哪些新东西?
17. JS conditional statements and loops, and data type conversion
Shuriken: 1 vulnhub walkthrough

![[league/climate]一个功能健全的命令行功能操作库](/img/ce/39114b1c74af649223db97e5b0e29c.png)






