当前位置:网站首页>Scrapy framework is introduced
Scrapy framework is introduced
2022-07-30 18:24:00 【Cold Lane(*_*)】
Table of Contents
1.Introduction
1) scrapy is a fast, high-level screen scraping and web scraping framework developed in python for scraping web sites and extracting structured data from pages.scrapy=scrach+python
2) scrapy is widely used and can be used as an application framework for extracting structured data in a large number of applications such as data mining, monitoring and automated testing, information processing and historical archives, and is widely used in enterprises
3) scrapy uses the asynchronous network library twisted to handle network communication, with a clear structure and various middleware interfaces, which can flexibly complete variouskind of demand.scrapy is a popular python event-driven networking framework written by twisted that uses non-blocking asynchronous processing
2.Why use scrapy
1. Easier to build and large scale scraping projects
2. The built-in mechanism is called a selector for extracting data from a website (web page)
3. Asynchronous processing of requests, very fast
4. The crawling speed can be automatically adjusted using the automatic adjustment mechanism
5. Ensure developer accessibility
3.scrapy features
1. Is an open source, free to use web crawler framework
2.scrapy generated format export such as: JSON, CSV, XML
3. Built-in from source code, use xpath or css selector to extract data
4.scrapy is crawler-based and allows to extract data from web pages in an automated way
4.Benefits
1. It is easy to expand, fast and powerful
2. This is a cross-platform application framework
3.scrapy request scheduling and asynchronous processing
4.scrapy comes with a built-in service called scrapyd that allows using JSON webspan>Project and control spiders on the service
5. Ability to scrape any website, even if the website does not have element data access api
5. Flowchart
Handwritten:
scrapy engine (engine): Otherwise, the communication between spider, itempipeline, Downloader, scheduler, signal, data transmission, etc.
scheduler (scheduler): responsible for accepting requests sent by engine, and sorting and sorting in a certain way, entering the queue, when Engine hand over to engine
when needed
Downloader: Responsible for downloading all requests sent by engine, and returning the obtained response to engine, which is handed over to spider by the engine for processing
spider (crawler): responsible for processing all responses, analyzing and extracting data from it, obtaining the data required by the item field, and submitting the url that needs to be followed up to the engine , enter Scheduler
again
item pipeline (pipeline): responsible for processing the items obtained in spider and performing post-processing (detailed analysis, filtering, storage, etc.)place
Downloader middlewares (download middleware): can be used as a component that can customize and extend the download function
spider middlewares (spider middleware): can be understood as a custom extension and operation engine and spider intermediate communication functional components (such as entering the response of spider and responding from spider outgoing request)
The simplest single web crawling process is: spiders ->scheduler -> downloader -> spiders -> item pipeline
Attention!Only when there is no request in the scheduler, the whole program will stop, that is to say, for the url that fails to download, scrapy will also download it again.
边栏推荐
- 微信小程序云开发 | 城市信息管理
- 432.4 FPS 快STDC 2.84倍 | LPS-Net 结合内存、FLOPs、CUDA实现超快语义分割模型
- linux 安装mysql8.0 超详细教程(实战多次)
- SwiftUI iOS 精品开源项目之 完整烘焙食品菜谱App基于SQLite(教程含源码)
- 猎豹移动终于递交年报:年营收7.85亿 腾讯持股16.6%
- ESP8266-Arduino programming example-HC-SR04 ultrasonic sensor driver
- ByteArrayInputStream class source code analysis
- NC | 西湖大学陶亮组-TMPRSS2“助攻”病毒感染并介导索氏梭菌出血毒素的宿主入侵...
- MySQL数据类型
- 第14章 类型信息
猜你喜欢
毕业1年从事软件测试拿下11.5k,没有给98后丢脸吧...
【HMS core】【Analytics Kit】【FAQ】如何解决华为分析付费分析中付款金额显示为0的问题?
荐书 | 推荐好评如潮的3本数据库书籍
Hello, my new name is "Bronze Lock/Tongsuo"
Read the "Language Model" in one article
Codeblocks + Widgets create window code analysis
OSPF详解(4)
Quickly build an e-commerce platform based on Amazon cloud technology serverless service - performance
荐号 | 对你有恩的人,不要请吃饭来报答
Recommended Books | Recommend 3 database books with rave reviews
随机推荐
【HarmonyOS】【ARK UI】HarmonyOS ets语言怎么实现双击返回键退出
Pytorch foundation -- tensorboard use (1)
ctf.show_web5
国轩高科瑞交所上市:募资近7亿美元 为瑞士今年最大融资项目
基于b/s架构搭建一个支持多路摄像头的实时处理系统 ---- 使用yolo v5 系列模型
Leetcode数据库系列题解合集(持续更新)
BI报表与数据开发
Test the.net text to Speech module System. Researched
固定资产可视化智能管理系统
怎么样的框架对于开发者是友好的?
微博广告分布式配置中心的构建与实践(有彩蛋)
Mysql执行原理剖析
C# wpf 无边框窗口添加阴影效果
第14章 类型信息
mysql的多实例
LayaBox---TypeScript---函数
LeetCode 练习——关于查找数组元素之和的两道题
针不戳,数据库性能优化八大方案。
SwiftUI iOS 精品开源项目之 完整烘焙食品菜谱App基于SQLite(教程含源码)
Deepen school-enterprise cooperation and build an "overpass" for the growth of technical and skilled talents