当前位置：网站首页>Introduction to the scratch crawler framework

Introduction to the scratch crawler framework

2022-07-25 12:20:00 【Tota King Li】

scrapy Introduce

Scrapy,Python A fast development 、 High level screen grabs and web Grabbing framework , Used to grab web Site and extract structured data from the page .Scrapy A wide range of uses , Can be used for data mining 、 Monitoring and automated testing .
Scrapy The attraction is that it's a framework , Anyone can modify it conveniently according to their needs . It also provides a base class for many types of reptiles , Such as BaseSpider、sitemap Reptiles, etc , The latest version offers web2.0 Reptile support .

scrapy data flow（ flow chart ）

Scrapy Used Twisted As a framework ,Twisted Something special is that it's event driven , And more suitable for asynchronous code . Operations that block threads include accessing files 、 Database or Web、 Generate a new process and need to process the output of the new process ( If running shell command )、 Code that performs system level operations ( Such as waiting for the system queue ),Twisted Provides methods that allow the above operations to be performed without blocking code execution .

The chart below shows Scrapy Architecture components , And run scrapy Data flow when , The red arrow in the figure indicates .

Picture description here

Scrapy Data flow is the core engine executed by (engine) control , The process goes like this ：

The crawler engine gets the initial request and starts crawling
The crawler engine starts to request the scheduler , And prepare to crawl for the next request .
The crawler scheduler returns the next request to the crawler engine .
The engine sends the request to the downloader , Download network data through download middleware
Once the downloader completes the page download , Return the download results to the crawler engine .
The engine returns the response of the downloader to the crawler through the middleware for processing .
The crawler handles the response , And return the processed... Through the middleware items, And new requests to the engine
The engine sends the processed items Pipeline to the project , Then return the result to the scheduler , The scheduler plans to handle the next request grab
Repeat the process , Until you've climbed all the url request

The picture above shows scrapy Workflow of all components , The following describes each component separately ：

The crawler engine （ENGINE）
The crawler engine is responsible for controlling the data flow between various components , When some operations trigger events, they all pass engine To deal with it .
Downloader
adopt engine Request to download Uncle Wang's data and respond the result to engine.
Scheduler
Dispatch to receive engine And put the request in the queue , And return to engine
Spider
Spider Request , And deal with engine Return to it the data responded by the downloader , With items And data requests within rules （url） Return to engine
Pipeline project (item pipeline)
Responsible for handling engine return spider The parsed data , And persistent data , For example, store data in a database or file .
Download Middleware
Download middleware is engine Interact with downloader components , With hook ( plug-in unit ) There is a form of , Instead of receiving requests 、 Process the download of data and respond the results to engine.
spider middleware
spider Middleware is engine and spider Interaction components between , With hook ( plug-in unit ) There is a form of , Can replace processing response And return to engine items And a new request set .

How to create scrapy Environment and projects

#  Create a virtual environment 
virtualenv --no-site-packages   Environment name 
#  Enter the virtual environment folder 
cd   Virtual environment folder 
#  Run virtual environment 
cd Scrapits
activate
#  Can be updated pip
python -m pip install -U pip
#  In a virtual environment  windows Lower installation Twisted-18.4.0-cp36-cp36m-win32.wh
pip insttall E:\Twisted-18.4.0-cp36-cp36m-win32.wh
#  Install... In a virtual environment scrapy
pip install scrapy

#  Create your own projects ,（ You can create project folders separately ）  Switch to the created file in the virtual environment 
scrapy startproject  Project name 
#  establish spider（ spider ）
scrapy genspider  Spider name    Address allowed to access 
 for example ：
scrapy genspider movie movie.douban.com