当前位置：网站首页>Scrapy crawler encounters redirection 301/302 problem solution

Scrapy crawler encounters redirection 301/302 problem solution

2022-08-02 04:00:00 【BIG_right】

Scrapy aborts redirects

When scrapy crawls data, it encounters redirection 301/302, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same

Abort redirect

yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)

If the crawling is crawling with yield Request in parse, then the filter dont_filter=True needs to be added. For details, see the following scenarioTwo

Get the Location value in the response

The redirected link will be placed in the Location in the header of the response, here is how to get the value

location = response.headers.get("Location")

`Scenario One`

If the crawling URL is executed sequentially in start_urls, just add it directly in the start_requests method

def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)

`Complete example`

import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")

`Scenario Two`

If the crawling is crawling with yield Request in parse, then you need to add the filter dont_filter=True

yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)

`Complete example`

import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)

原网站版权声明
 本文为[BIG_right]所创，转载请带上原文链接，感谢
 https://yzsam.com/2022/214/202208020322395353.html


    
        


        
边栏推荐
[symfony/finder]最好用的文件操作库
Praying: 1 vulnhub walkthrough
点名系统和数组元素为对象的排序求最大值和最小值
解决uni-app 打包H5网站 下载图片问题
Query the indexes of all tables in the database and parse them into sql
[symfony/mailer] An elegant and easy-to-use mail library
(1)Thinkphp6入门、安装视图、模板渲染、变量赋值
(3) Thinkphp6 database
Stable and easy-to-use short connection generation platform, supporting API batch generation
When PHP initiates Alipay payment, the order information is garbled and solved



        
猜你喜欢
Kali install IDEA
[sebastian/diff]一个比较两段文本的历史变化扩展库
TCP communications program
(3)Thinkphp6数据库
Phonebook
PHP8.2中字符串变量解析的新用法
Eric靶机渗透测试通关全教程
[sebastian/diff] A historical change extension library for comparing two texts
ES6迭代器解释举例
IO流、 编码表、 字符流、 字符缓冲流
        


        
随机推荐
ES6三点运算符、数组方法、字符串扩展方法
Praying: 1 vulnhub walkthrough
[campo/random-user-agent]随机伪造你的User-Agent
easyswoole uses redis to perform geoRadiusByMember Count invalid fix
QR code generation API interface, which can be directly connected as an A tag
SQL：DDL、DML、DQL、DCL相应介绍以及演示
Various ways of AES encryption
TCP通信程序
(2) 顺序结构、对象的布尔值、选择结构、循环结构、列表、字典、元组、集合
PHP8.2将会有哪些新东西？
14. JS Statements and Comments, Variables and Data Types
CTF入门之md5
Kali install IDEA
[symfony/finder] The best file manipulation library
[symfony/finder]最好用的文件操作库
14.JS语句和注释,变量和数据类型
4.表单与输入
(3) 字符串
[league/flysystem] An elegant and highly supported file operation interface
使用PHPMailer发送邮件