当前位置:网站首页>Introduction to the scratch crawler framework
Introduction to the scratch crawler framework
2022-07-25 12:20:00 【Tota King Li】
scrapy Introduce
Scrapy,Python A fast development 、 High level screen grabs and web Grabbing framework , Used to grab web Site and extract structured data from the page .Scrapy A wide range of uses , Can be used for data mining 、 Monitoring and automated testing .
Scrapy The attraction is that it's a framework , Anyone can modify it conveniently according to their needs . It also provides a base class for many types of reptiles , Such as BaseSpider、sitemap Reptiles, etc , The latest version offers web2.0 Reptile support .
scrapy data flow( flow chart )
Scrapy Used Twisted As a framework ,Twisted Something special is that it's event driven , And more suitable for asynchronous code . Operations that block threads include accessing files 、 Database or Web、 Generate a new process and need to process the output of the new process ( If running shell command )、 Code that performs system level operations ( Such as waiting for the system queue ),Twisted Provides methods that allow the above operations to be performed without blocking code execution .
The chart below shows Scrapy Architecture components , And run scrapy Data flow when , The red arrow in the figure indicates .

Scrapy Data flow is the core engine executed by (engine) control , The process goes like this :
- The crawler engine gets the initial request and starts crawling
- The crawler engine starts to request the scheduler , And prepare to crawl for the next request .
- The crawler scheduler returns the next request to the crawler engine .
- The engine sends the request to the downloader , Download network data through download middleware
- Once the downloader completes the page download , Return the download results to the crawler engine .
- The engine returns the response of the downloader to the crawler through the middleware for processing .
- The crawler handles the response , And return the processed... Through the middleware items, And new requests to the engine
- The engine sends the processed items Pipeline to the project , Then return the result to the scheduler , The scheduler plans to handle the next request grab
- Repeat the process , Until you've climbed all the url request
The picture above shows scrapy Workflow of all components , The following describes each component separately :
The crawler engine (ENGINE)
The crawler engine is responsible for controlling the data flow between various components , When some operations trigger events, they all pass engine To deal with it .Downloader
adopt engine Request to download Uncle Wang's data and respond the result to engine.Scheduler
Dispatch to receive engine And put the request in the queue , And return to engineSpider
Spider Request , And deal with engine Return to it the data responded by the downloader , With items And data requests within rules (url) Return to enginePipeline project (item pipeline)
Responsible for handling engine return spider The parsed data , And persistent data , For example, store data in a database or file .- Download Middleware
Download middleware is engine Interact with downloader components , With hook ( plug-in unit ) There is a form of , Instead of receiving requests 、 Process the download of data and respond the results to engine. - spider middleware
spider Middleware is engine and spider Interaction components between , With hook ( plug-in unit ) There is a form of , Can replace processing response And return to engine items And a new request set .
How to create scrapy Environment and projects
# Create a virtual environment
virtualenv --no-site-packages Environment name
# Enter the virtual environment folder
cd Virtual environment folder
# Run virtual environment
cd Scrapits
activate
# Can be updated pip
python -m pip install -U pip
# In a virtual environment windows Lower installation Twisted-18.4.0-cp36-cp36m-win32.wh
pip insttall E:\Twisted-18.4.0-cp36-cp36m-win32.wh
# Install... In a virtual environment scrapy
pip install scrapy
# Create your own projects ,( You can create project folders separately ) Switch to the created file in the virtual environment
scrapy startproject Project name
# establish spider( spider )
scrapy genspider Spider name Address allowed to access
for example :
scrapy genspider movie movie.douban.com
scrapy Project structure

- items.py Responsible for the establishment of data model
- middlewares.py middleware
- pipelines.py Responsible for spider Processing of return data .
- settings.py Responsible for the configuration of the whole crawler .
- spiders Catalog Responsible for storing inherited from scrapy Reptiles .
- scrapy.cfg scrapy Basic configuration
边栏推荐
- 技术管理杂谈
- Scott+Scott律所计划对Yuga Labs提起集体诉讼,或将确认NFT是否属于证券产品
- NLP知识----pytorch,反向传播,预测型任务的一些小碎块笔记
- Eureka registration center opens password authentication - record
- How to solve the problem of the error reported by the Flink SQL client when connecting to MySQL?
- R language ggplot2 visualization: visualize the scatter diagram, add text labels to some data points in the scatter diagram, and use geom of ggrep package_ text_ The repl function avoids overlapping l
- [untitled]
- Eureka使用记录
- Power Bi -- these skills make the report more "compelling"“
- 【AI4Code最终章】AlphaCode:《Competition-Level Code Generation with AlphaCode》(DeepMind)
猜你喜欢

从云原生到智能化,深度解读行业首个「视频直播技术最佳实践图谱」

马斯克的“灵魂永生”:一半炒作,一半忽悠

Fault tolerant mechanism record

Knowledge maps are used to recommend system problems (mvin, Ctrl, ckan, Kred, gaeat)

防范SYN洪泛攻击的方法 -- SYN cookie
![[GCN multimodal RS] pre training representations of multi modal multi query e-commerce search KDD 2022](/img/9c/0434d40fa540078309249d415b3659.png)
[GCN multimodal RS] pre training representations of multi modal multi query e-commerce search KDD 2022

【AI4Code】《Pythia: AI-assisted Code Completion System》(KDD 2019)

Can't delete the blank page in word? How to operate?

【AI4Code】《CodeBERT: A Pre-Trained Model for Programming and Natural Languages》 EMNLP 2020

Figure neural network for recommending system problems (imp-gcn, lr-gcn)
随机推荐
【黑马早报】运营23年,易趣网宣布关停;蔚来对大众CEO抛出橄榄枝;华为天才少年曾放弃360万年薪;尹烨回应饶毅炮轰其伪科学...
Behind the screen projection charge: iqiyi's quarterly profit, is Youku in a hurry?
[high concurrency] I summarized the best learning route of concurrent programming with 10 diagrams!! (recommended Collection)
NLP知识----pytorch,反向传播,预测型任务的一些小碎块笔记
Atomic atomic class
【GCN-RS】MCL: Mixed-Centric Loss for Collaborative Filtering (WWW‘22)
循环创建目录与子目录
R language ggplot2 visualization: visualize the scatter diagram, add text labels to some data points in the scatter diagram, and use geom of ggrep package_ text_ The repl function avoids overlapping l
aaaaaaaaaaA heH heH nuN
给生活加点惊喜,做创意生活的原型设计师丨编程挑战赛 x 选手分享
I advise those students who have just joined the work: if you want to enter the big factory, you must master these concurrent programming knowledge! Complete learning route!! (recommended Collection)
WPF项目入门1-简单登录页面的设计和开发
R语言可视化散点图、使用ggrepel包的geom_text_repel函数避免数据点之间的标签互相重叠(设置min.segment.length参数为Inf不添加标签线段)
Eureka registration center opens password authentication - record
Mirror Grid
Technical management essay
Brpc source code analysis (I) -- the main process of RPC service addition and server startup
R语言使用wilcox.test函数执行wilcox符号秩检验获取总体中位数(median)的置信区间(默认输出结果包括95%置信水平的置信区间)
【十】比例尺添加以及调整
Knowledge maps are used to recommend system problems (mvin, Ctrl, ckan, Kred, gaeat)