当前位置:网站首页>Scripy learning
Scripy learning
2022-07-03 06:14:00 【Black~boy】
scrapy introduction
1.scrapy brief introduction
Scrapy It's based on Twisted The asynchronous processing framework of , Is pure python Implementation of the crawler framework . You can quickly grab data with a small amount of code .
Scrapy Is applicable to Python A quick 、 High level screen grabs and web Grabbing framework , Used to grab web Site and extract structured data from the page .Scrapy A wide range of uses , Can be used for data mining 、 Monitoring and automated testing . Anyone can modify it conveniently according to their needs . It also provides a base class for many types of reptiles , Such as BaseSpider、sitemap Reptiles, etc , The latest version offers web2.0 Reptile support .
2.Scrapy Framework and function
2.1 Frame diagram

2.2 Function of each part
| name | function |
|---|---|
| Scrapy Engine(Scrapy engine ) | Scrapy The engine is the core of the framework , be responsible for Spider、ItemPipeline、Downloader、Scheduler Intermediate communication , The signal 、 Data transfer, etc |
| Spiders( Reptiles ) | Be responsible for handling all messages sent by the engine Response, Extract data from , extract URl, And submit to the engine |
| Scheduler( Scheduler ) | Responsible for receiving the engine sent Request request |
| Downloader( Downloader ) | Responsible for downloading Scrapy Engine( engine ) All sent Requests request , And get it Responses Return to Scrapy Engine( engine ), Engine to Spider To deal with it . |
| Item Pipeline( Project pipeline ) | Be responsible for the data sent by the engine , And do post-processing ( Data analysis , Data storage, etc ) |
3.Scrapy install
3.1 Installation command
windows Next :
pip install Scrapy

Check whether the installation is successful :
scrapy startProject Project name


You can start your first spider with:
First step : cd myspider
The second step :scrapy genspider example( Reptile name ) example.com( The website you want to crawl )

For website xxxx Instead of 
After writing the code : Execute the crawler
scrapy crawl Reptile name
边栏推荐
- conda和pip的区别
- Simple understanding of ThreadLocal
- Simple solution of small up main lottery in station B
- 剖析虚幻渲染体系(16)- 图形驱动的秘密
- Phpstudy setting items can be accessed by other computers on the LAN
- 【C#/VB.NET】 将PDF转为SVG/Image, SVG/Image转PDF
- Selenium ide installation recording and local project maintenance
- Oauth2.0 - use database to store client information and authorization code
- 项目总结--2(Jsoup的基本使用)
- Cesium Click to obtain the longitude and latitude elevation coordinates (3D coordinates) of the model surface
猜你喜欢

轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷

Core principles and source code analysis of disruptor

Kubernetes notes (III) controller

有意思的鼠標指針交互探究

Synthetic keyword and NBAC mechanism

Oauth2.0 - using JWT to replace token and JWT content enhancement

Clickhouse learning notes (I): Clickhouse installation, data type, table engine, SQL operation

从小数据量分库分表 MySQL 合并迁移数据到 TiDB
![[set theory] relational closure (relational closure solution | relational graph closure | relational matrix closure | closure operation and relational properties | closure compound operation)](/img/a4/00aca72b268f77fe4fb24ac06289f5.jpg)
[set theory] relational closure (relational closure solution | relational graph closure | relational matrix closure | closure operation and relational properties | closure compound operation)

Kubernetes notes (IV) kubernetes network
随机推荐
Yum is too slow to bear? That's because you didn't do it
轻松上手Fluentd,结合 Rainbond 插件市场,日志收集更快捷
Intel's new GPU patent shows that its graphics card products will use MCM Packaging Technology
Exportation et importation de tables de bibliothèque avec binaires MySQL
Fluentd facile à utiliser avec le marché des plug - ins rainbond pour une collecte de journaux plus rapide
Bio, NiO, AIO details
. Net program configuration file operation (INI, CFG, config)
The server data is all gone! Thinking caused by a RAID5 crash
Jedis source code analysis (I): jedis introduction, jedis module source code analysis
多线程与高并发(7)——从ReentrantLock到AQS源码(两万字大章,一篇理解AQS)
Kubernetes notes (V) configuration management
Kubesphere - Multi tenant management
Mysql
Oracle Database Introduction
Printer related problem record
Luogu problem list: [mathematics 1] basic mathematics problems
1. 兩數之和
Pytorch builds the simplest version of neural network
BeanDefinitionRegistryPostProcessor
Multithreading and high concurrency (7) -- from reentrantlock to AQS source code (20000 words, one understanding AQS)