当前位置:网站首页>Source code analysis of scripy spider
Source code analysis of scripy spider
2022-07-28 19:53:00 【True · skysys】
Reptile basic reference Getting started with enterprise distributed crawler framework
The source code is shown below :
- spider Of name Defined Scrapy How to locate ( And initialization ) spider, So its name Must be unique .name yes spider The most important attribute . The general practice is to use the website (domain) Named after the spider. for example , If spider Crawling mywebsite.com , The spider It's usually named mywebsite.
# The base class of all reptiles , User defined crawlers must inherit from this class
class Spider(object_ref):
name = None
# initialization , Extract the crawler name ,start_urls
def __init__(self, name=None, **kwargs):
if name is not None:
self.name = name
elif not getattr(self, 'name', None): # If a reptile has no name , If the subsequent operation is interrupted, an error will be reported
raise ValueError("%s must have a name" % type(self).__name__)
# python Object or type through built-in members __dict__ To store member information
self.__dict__.update(kwargs)
if not hasattr(self, 'start_urls'):
self.start_urls = []
# Print Scrapy After execution log Information
def log(self, message, level=log.DEBUG, **kw):
log.msg(message, spider=self, level=level, **kw)
# Judge the object object Does the property of exist , There is no assertion processing
def set_crawler(self, crawler):
assert not hasattr(self, '_crawler'), "Spider already bounded to %s" % crawler
self._crawler = crawler
@property
def crawler(self):
assert hasattr(self, '_crawler'), "Spider not bounded to any crawler"
return self._crawler
@property
def settings(self):
return self.crawler.settings
# This method will read start_urls The address in , And generate a... For each address Request object , hand Scrapy Download and return to Response This method is called only once
def start_requests(self):
for url in self.start_urls:
yield self.make_requests_from_url(url)
#start_requests() Call in , Actually generate Request Function of .
#Request Object's default callback function is parse(), The way to submit is get
def make_requests_from_url(self, url):
return Request(url, dont_filter=True)
# default Request Object callback function , Process the returned response.
# Generate Item perhaps Request object . The user must implement this class
def parse(self, response):
raise NotImplementedError
@classmethod
def handles_request(cls, request):
return url_is_from_spider(request.url, cls)
def __str__(self):
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
__repr__ = __str__
- allowed_domains: Contains spider Domain name allowed to crawl (domain) A list of , Optional .
- start_urls: initial URL Yuan Zu / list
- start_requests(self): This method must return an iteratable object (iterable), When spider Start crawling and do not specify start_urls when , The method is called .
parse(self, response): When requested url When no callback function is specified for the return page , default Request Object callback function . Used to process web page returns response, And generate Item perhaps Request object .log(self, message[, level, component]): Use scrapy.log.msg() Method record (log)message. More data can be found in logging
边栏推荐
- 中国首枚芯片邮票面世:内置120um超薄NFC芯片
- Saltstack system initialization
- 并发程序设计,你真的懂吗?
- Android-第十三节03xUtils-数据库框架(增删改查)详解
- MySQL8 Status Variables: Internal Temporary Tables and Files
- [notes] Networking: Internet product managers change the world
- leetcode day3 超过经理收入的员工
- MySQL性能测试工具sysbench学习
- CodeIgnier框架实现restful API接口编程
- Serial port receiving application ring buffer
猜你喜欢

WPF implements MessageBox message prompt box with mask

App自动化测试是怎么实现H5测试的

Oracle insert数据时字符串中有‘单引号问题

CodeIgnier框架实现restful API接口编程

并发程序设计,你真的懂吗?

Function fitting based on MATLAB

Know small and medium LAN WLAN

English article translation - English article translation software - free batch translation

利用STM32的HAL库驱动1.54寸 TFT屏(240*240 ST7789V)

NetCoreAPI操作Excel表格
随机推荐
Huawei shares in Nanjing core vision, laying out the solid-state laser radar chip field
MySQL8 Status Variables: Internal Temporary Tables and Files
Cell review: single cell methods in human microbiome research
Edge detection and connection of image segmentation realized by MATLAB
认识中小型局域网WLAN
Search problems and technologies
Common APIs in string
Rust Getting Started Guide (rustup, cargo)
How to write the SQL statement of time to date?
Servlet learning notes
Rust Getting Started Guide (crite Management)
After reading the thesis for three years, I learned to read the abstract today
Design of air combat game based on qtgui image interface
Question bank and answers of the latest national fire-fighting facility operators (intermediate fire-fighting facility operators) in 2022
leetcode day3 超过经理收入的员工
The peak rate exceeds 2gbps! Qualcomm first passed 5g millimeter wave MIMO OTA test in China
美国将提供250亿美元补贴,鼓励英特尔等芯片制造商迁回产线
Idea properties file display \u solution of not displaying Chinese
editor.md中markdown编辑器的实现
npm安装和卸载全局包