当前位置:网站首页>Introduction to the scratch crawler framework
Introduction to the scratch crawler framework
2022-07-25 12:20:00 【Tota King Li】
scrapy Introduce
Scrapy,Python A fast development 、 High level screen grabs and web Grabbing framework , Used to grab web Site and extract structured data from the page .Scrapy A wide range of uses , Can be used for data mining 、 Monitoring and automated testing .
Scrapy The attraction is that it's a framework , Anyone can modify it conveniently according to their needs . It also provides a base class for many types of reptiles , Such as BaseSpider、sitemap Reptiles, etc , The latest version offers web2.0 Reptile support .
scrapy data flow( flow chart )
Scrapy Used Twisted As a framework ,Twisted Something special is that it's event driven , And more suitable for asynchronous code . Operations that block threads include accessing files 、 Database or Web、 Generate a new process and need to process the output of the new process ( If running shell command )、 Code that performs system level operations ( Such as waiting for the system queue ),Twisted Provides methods that allow the above operations to be performed without blocking code execution .
The chart below shows Scrapy Architecture components , And run scrapy Data flow when , The red arrow in the figure indicates .

Scrapy Data flow is the core engine executed by (engine) control , The process goes like this :
- The crawler engine gets the initial request and starts crawling
- The crawler engine starts to request the scheduler , And prepare to crawl for the next request .
- The crawler scheduler returns the next request to the crawler engine .
- The engine sends the request to the downloader , Download network data through download middleware
- Once the downloader completes the page download , Return the download results to the crawler engine .
- The engine returns the response of the downloader to the crawler through the middleware for processing .
- The crawler handles the response , And return the processed... Through the middleware items, And new requests to the engine
- The engine sends the processed items Pipeline to the project , Then return the result to the scheduler , The scheduler plans to handle the next request grab
- Repeat the process , Until you've climbed all the url request
The picture above shows scrapy Workflow of all components , The following describes each component separately :
The crawler engine (ENGINE)
The crawler engine is responsible for controlling the data flow between various components , When some operations trigger events, they all pass engine To deal with it .Downloader
adopt engine Request to download Uncle Wang's data and respond the result to engine.Scheduler
Dispatch to receive engine And put the request in the queue , And return to engineSpider
Spider Request , And deal with engine Return to it the data responded by the downloader , With items And data requests within rules (url) Return to enginePipeline project (item pipeline)
Responsible for handling engine return spider The parsed data , And persistent data , For example, store data in a database or file .- Download Middleware
Download middleware is engine Interact with downloader components , With hook ( plug-in unit ) There is a form of , Instead of receiving requests 、 Process the download of data and respond the results to engine. - spider middleware
spider Middleware is engine and spider Interaction components between , With hook ( plug-in unit ) There is a form of , Can replace processing response And return to engine items And a new request set .
How to create scrapy Environment and projects
# Create a virtual environment
virtualenv --no-site-packages Environment name
# Enter the virtual environment folder
cd Virtual environment folder
# Run virtual environment
cd Scrapits
activate
# Can be updated pip
python -m pip install -U pip
# In a virtual environment windows Lower installation Twisted-18.4.0-cp36-cp36m-win32.wh
pip insttall E:\Twisted-18.4.0-cp36-cp36m-win32.wh
# Install... In a virtual environment scrapy
pip install scrapy
# Create your own projects ,( You can create project folders separately ) Switch to the created file in the virtual environment
scrapy startproject Project name
# establish spider( spider )
scrapy genspider Spider name Address allowed to access
for example :
scrapy genspider movie movie.douban.com
scrapy Project structure

- items.py Responsible for the establishment of data model
- middlewares.py middleware
- pipelines.py Responsible for spider Processing of return data .
- settings.py Responsible for the configuration of the whole crawler .
- spiders Catalog Responsible for storing inherited from scrapy Reptiles .
- scrapy.cfg scrapy Basic configuration
边栏推荐
- Brpc source code analysis (II) -- the processing process of brpc receiving requests
- 919. 完全二叉树插入器 : 简单 BFS 运用题
- 【AI4Code】《Pythia: AI-assisted Code Completion System》(KDD 2019)
- 【黑马早报】运营23年,易趣网宣布关停;蔚来对大众CEO抛出橄榄枝;华为天才少年曾放弃360万年薪;尹烨回应饶毅炮轰其伪科学...
- aaaaaaaaaaA heH heH nuN
- [multimodal] hit: hierarchical transformer with momentum contract for video text retrieval iccv 2021
- 【九】坐标格网添加以及调整
- Zuul gateway use
- Median (二分答案 + 二分查找)
- 容错机制记录
猜你喜欢

Transformer variants (routing transformer, linformer, big bird)

2.1.2 application of machine learning

Atomic 原子类

WPF项目入门1-简单登录页面的设计和开发

【GCN-RS】Towards Representation Alignment and Uniformity in Collaborative Filtering (KDD‘22)

Knowledge maps are used to recommend system problems (mvin, Ctrl, ckan, Kred, gaeat)

Intelligent information retrieval (overview of intelligent information retrieval)

【RS采样】A Gain-Tuning Dynamic Negative Sampler for Recommendation (WWW 2022)

Add a little surprise to life and be a prototype designer of creative life -- sharing with X contestants in the programming challenge

GPT plus money (OpenAI CLIP,DALL-E)
随机推荐
[untitled]
R language Visual scatter diagram, geom using ggrep package_ text_ The rep function avoids overlapping labels between data points (set the min.segment.length parameter to inf and do not add label segm
【GCN-RS】Learning Explicit User Interest Boundary for Recommendation (WWW‘22)
Data transmission under the same LAN based on tcp/ip
R语言ggplot2可视化:使用ggpubr包的ggstripchart函数可视化点状条带图、设置palette参数配置不同水平数据点的颜色、设置add参数在点状条带图中添加均值标准差竖线
Zero shot image retrieval (zero sample cross modal retrieval)
循环创建目录与子目录
I advise those students who have just joined the work: if you want to enter the big factory, you must master these concurrent programming knowledge! Complete learning route!! (recommended Collection)
Technical management essay
Mirror Grid
Basic concepts of NLP 1
Brpc source code analysis (IV) -- bthread mechanism
1.1.1 欢迎来到机器学习
R language ggpubr package ggarrange function combines multiple images and annotates_ Figure function adds annotation, annotation and annotation information for the combined image, adds image labels fo
【九】坐标格网添加以及调整
Add a little surprise to life and be a prototype designer of creative life -- sharing with X contestants in the programming challenge
【Debias】Model-Agnostic Counterfactual Reasoning for Eliminating Popularity Bias in RS(KDD‘21)
【Debias】Model-Agnostic Counterfactual Reasoning for Eliminating Popularity Bias in RS(KDD‘21)
那些离开网易的年轻人
Week303 of leetcode (20220724)