当前位置:网站首页>Scrapy crawler encounters redirection 301/302 problem solution
Scrapy crawler encounters redirection 301/302 problem solution
2022-08-02 04:00:00 【BIG_right】
Scrapy aborts redirects
When scrapy crawls data, it encounters redirection 301/302
, especially when crawling a download link, he will redirect directly and start downloading, and will return to crawling after downloadingThe link you retrieved, you need to stop the reset at this time
The following 302 can be replaced with 301, which is the same
Abort redirect
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)
If the crawling is crawling with yield Request
in parse
, then the filter dont_filter=True
needs to be added. For details, see the following scenarioTwo
Get the Location value in the response
The redirected link will be placed in the Location in the header of the response, here is how to get the value
location = response.headers.get("Location")
Scenario One
If the crawling URL is executed sequentially in start_urls
, just add it directly in the start_requests
method
def start_requests(self):yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)
Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def start_requests(self):# Abort the 302 redirect directly hereyield Request(start_urls[0],meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse)def parse(self, response):# Get the returned redirect valuelocation = response.headers.get("Location")
Scenario Two
If the crawling is crawling with yield Request
in parse
, then you need to add the filter dont_filter=True
yield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)
Complete example
import scrapyclass xxSpider(scrapy.Spider):name = 'xx'allowed_domains = ['www.xxx.com']start_urls = ['http://www.xxx.com/download']def parse(self, response):url = "xxxxxxxxxx"# need to add filter hereyield Request(url,meta={'dont_redirect': True,'handle_httpstatus_list': [302]},callback=self.parse,dont_filter=True)
边栏推荐
- The focus of the Dom implementation input triggers
- ES6迭代器解释举例
- The Error in the render: "TypeError: always read the properties of null '0' (reading)" Error solution
- 4. PHP array and array sorting
- Dom实现input的焦点触发
- GreenOptic: 1 vulnhub walkthrough
- 12. What is JS
- Praying: 1 vulnhub walkthrough
- MySql Advanced -- Constraints
- PHP8.2 version release administrator and release plan
猜你喜欢
IP access control: teach you how to implement an IP firewall with PHP
DVWA drone installation tutorial
What are the PHP framework?
PHP有哪些框架?
[sebastian/diff] A historical change extension library for comparing two texts
hackmyvm: may walkthrough
CTF入门笔记之ping
The Error in the render: "TypeError: always read the properties of null '0' (reading)" Error solution
web渗透必玩的靶场——DVWA靶场 1(centos8.2+phpstudy安装环境)
Several interesting ways to open PHP: from basic to perverted
随机推荐
New usage of string variable parsing in PHP8.2
After the mailbox of the Pagoda Post Office is successfully set up, it can be sent but not received.
JS objects, functions and scopes
Masashi: 1 vulnhub walkthrough
SQL:DDL、DML、DQL、DCL相应介绍以及演示
About the apache .htaccess file of tp
Function hoisting and variable hoisting
[sebastian/diff]一个比较两段文本的历史变化扩展库
Praying: 1 vulnhub walkthrough
QR code generation API interface, which can be directly connected as an A tag
16. JS events, string and operator
(7) superficial "crawlers" process (concept + practice)
[league/flysystem]一个优雅且支持度非常高的文件操作接口
IP access control: teach you how to implement an IP firewall with PHP
VIKINGS: 1 vulnhub walkthrough
[symfony/mailer] An elegant and easy-to-use mail library
kali安装IDEA
Orasi: 1 vulnhub walkthrough
[sebastian/diff] A historical change extension library for comparing two texts
PHP有哪些杀手级超厉害框架或库或应用?