What is? Scrapy
- be based on Twisted The asynchronous processing framework of
- pure python Implementation of the crawler framework
- The basic structure :5+2 frame ,5 A component ,2 Middleware
5 A component :
- Scrapy Engine: engine , Responsible for the communication of other components Carry out signal and data transmission ; be responsible for Scheduler、Downloader、Spiders、Item Pipeline The transmission of intermediate communication signals and data , This component is equivalent to a crawler “ The brain ”, It's the dispatch center of the whole reptile
- Scheduler: Scheduler , take request Request to queue up , When the engine needs to be returned to the engine , Pass the request through the engine to Downloader; Simply put, it's a queue , Be responsible for receiving the information sent by the engine request request , Then queue the request , When the engine needs to request data , Give the data in the request queue to the engine . Initial crawling URL And the information to be crawled obtained later in the page URL Put into the scheduler , Waiting for crawling , At the same time, the scheduler will automatically remove the duplicate URL( If specific URL It does not need to be de duplicated, but can also be realized by setting , Such as post Requested URL)
- Downloader: Downloader , Put the engine engine Sent request Receive , And will response The result is returned to the engine engine, Then the engine passes it to Spiders Handle
- Spiders: Parser , It handles everything responses, Analyze and extract data from it , obtain Item The data required for the field , And will need to follow up URL Submit to engine , Once again into the Scheduler( Scheduler ); It is also the entrance URL The place of
- Item Pipeline: Data pipeline , That is, we encapsulate de duplication 、 Where classes are stored , Responsible for handling Spiders Data obtained in and post-processing , Filter or store, etc . When the page is parsed by the crawler, the data required is stored in Item after , Will be sent to project pipeline (Pipeline), And process the data in several specific order , Finally, it is stored in local file or database
2 Middleware :
- Downloader Middlewares: Download Middleware , It can be regarded as a component that can customize and extend the download function , It is a specific hook between the engine and the downloader (specific hook), Handle Downloader To the engine response. The crawler can be automatically replaced by setting the downloader middleware user-agent、IP And so on .
- Spider Middlewares: Crawler middleware ,Spider Middleware is in the engine and Spider Specific hooks between (specific hook), Handle spider The input of (response) And the output (items And requests). Custom extension 、 The engine and Spider Components of communication function between , Extend by inserting custom code Scrapy function .
Scrapy Operation document ( Chinese ):https://www.osgeo.cn/scrapy/topics/spider-middleware.html
Scrapy Installation of frame
cmd window ,pip Installation
pip install scrapy
Scrapy Common problems during frame installation
Can't find win32api modular ----windows Common in the system
pip install pypiwin32
establish Scrapy Reptile project
New projects
scrapy startproject xxx Project name
example :
scrapy startproject tubatu_scrapy_project
Project directory
scrapy.cfg: The configuration file for the project , Defines the path of the project configuration file and other configuration information
- 【settings】: Defines the path of the project's configuration file , namely ./tubatu_scrapy_project/settings file
- 【deploy】: Deployment information
- items.py: That's what we define item Where data structures ; That is, which fields we want to grab , be-all item Definitions can be put into this file
- pipelines.py: Pipeline files for the project , It is what we call data processing pipeline file ; Used to write data storage , Cleaning and other logic , For example, store data in json file , You can write logic here
- settings.py: The setup file for the project , You can define the global settings of the project , For example, set the crawler USER_AGENT , You can set... Here ; Common configuration items are as follows :
- ROBOTSTXT_OBEY : Is it followed ROBTS agreement , Generally set as False
- CONCURRENT_REQUESTS : Concurrency , The default is 32 concurrent
- COOKIES_ENABLED : Is it enabled? cookies, The default is False
- DOWNLOAD_DELAY : Download delay
- DEFAULT_REQUEST_HEADERS : Default request header
- SPIDER_MIDDLEWARES : Is it enabled? spider middleware
- DOWNLOADER_MIDDLEWARES : Is it enabled? downloader middleware
- For others, see link
- spiders Catalog : Include the implementation of each crawler , Our parsing rules are written in this directory , That is, the crawler parser is written in this directory
- middlewares.py: Defined SpiderMiddleware and DownloaderMiddleware The rules of middleware ; Custom request 、 Customize other data processing methods 、 Proxy access, etc
Automatic generation spiders Template file
cd To spiders Under the table of contents , Output the following command , Generate crawler files :
scrapy genspider file name Crawling address
Run crawler
Mode one :cmd start-up
cd To spiders Under the table of contents , Execute the following command , Start the crawler :
scrapy crawl Reptile name
Mode two :py File to start the
Create under Project main.py file , Create startup script , perform main.py Startup file , The code example is as follows :
code- Crawler file
import scrapy class TubatuSpider(scrapy.Spider):
# The name cannot be repeated
name = 'tubatu'
# Allow crawlers to crawl the domain name
allowed_domains = ['xiaoguotu.to8to.com']
# Crawler file to be started after the project is started
start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1'] # The default parsing method
def parse(self, response):
print(response.text)
code- Startup file
from scrapy import cmdline # In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project
# Use cmdlie.execute() Method to execute the crawler start command :scrapy crawl Reptile name
cmdline.execute("scrapy crawl tubatu".split()) #execute Each command that the method needs to run is a separate string , Such as :cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#
code- Running results
The sample project
Crawl the information of Tu Batu decoration website . Store the crawled data locally MongoDB In the database ;
The following figure shows the project organization , The document marked in blue is this time code Code for
tubatu.py
1 import scrapy
2 from tubatu_scrapy_project.items import TubatuScrapyProjectItem
3 import re
4
5 class TubatuSpider(scrapy.Spider):
6
7 # The name cannot be repeated
8 name = 'tubatu'
9 # Allow crawlers to crawl the domain name , Beyond this directory, you are not allowed to crawl
10 allowed_domains = ['xiaoguotu.to8to.com','wx.to8to.com','sz.to8to.com']
11 # Crawler file to be started after the project is started
12 start_urls = ['https://xiaoguotu.to8to.com/pic_space1?page=1']
13
14
15 # The default parsing method
16 def parse(self, response):
17 # response It can be used directly in the back xpath Method
18 # response It's just one. Html object
19 pic_item_list = response.xpath("//div[@class='item']")
20 for item in pic_item_list[1:]:
21 info = {}
22 # Here is a point to not lose , It means that at present Item Now use again xpath
23 # It's not just xpath In positioning text() Content , Need to filter again ; Return to :[<Selector xpath='.//div/a/text()' data='0 Yuan gets the design plan , Limit the number of people to receive '>] <class 'scrapy.selector.unified.SelectorList'>
24 # content_name = item.xpath('.//div/a/text()')
25
26 # Use extract() Method to get item Back to data Information , Back to the list
27 # content_name = item.xpath('.//div/a/text()').extract()
28
29 # Use extract_first() Method to get the name , data ; The return is str type
30 # Get the name of the project , Project data
31 info['content_name'] = item.xpath(".//a[@target='_blank']/@data-content_title").extract_first()
32
33 # Get the URL
34 info['content_url'] = "https:"+ item.xpath(".//a[@target='_blank']/@href").extract_first()
35
36 # project id
37 content_id_search = re.compile(r"(\d+)\.html")
38 info['content_id'] = str(content_id_search.search(info['content_url']).group(1))
39
40 # Use yield To send asynchronous requests , It uses scrapy.Request() Method to send , This method can be passed on to cookie etc. , You can enter this method to check
41 # Callback function callback, Write only the method name , Do not call methods
42 yield scrapy.Request(url=info['content_url'],callback=self.handle_pic_parse,meta=info)
43
44 if response.xpath("//a[@id='nextpageid']"):
45 now_page = int(response.xpath("//div[@class='pages']/strong/text()").extract_first())
46 next_page_url="https://xiaoguotu.to8to.com/pic_space1?page=%d" %(now_page+1)
47 yield scrapy.Request(url=next_page_url,callback=self.parse)
48
49
50 def handle_pic_parse(self,response):
51 tu_batu_info = TubatuScrapyProjectItem()
52 # The address of the picture
53 tu_batu_info["pic_url"]=response.xpath("//div[@class='img_div_tag']/img/@src").extract_first()
54 # nickname
55 tu_batu_info["nick_name"]=response.xpath("//p/i[@id='nick']/text()").extract_first()
56 # The name of the picture
57 tu_batu_info["pic_name"]=response.xpath("//div[@class='pic_author']/h1/text()").extract_first()
58 # Name of the project
59 tu_batu_info["content_name"]=response.request.meta['content_name']
60 # project id
61 tu_batu_info["content_id"]=response.request.meta['content_id']
62 # Project URL
63 tu_batu_info["content_url"]=response.request.meta['content_url']
64 #yield To piplines, We go through settings.py It's enabled inside , If not enabled , Will not be available
65 yield tu_batu_info
items.py
1 # Define here the models for your scraped items
2 #
3 # See documentation in:
4 # https://docs.scrapy.org/en/latest/topics/items.html
5
6 import scrapy
7
8
9 class TubatuScrapyProjectItem(scrapy.Item):
10 # define the fields for your item here like:
11 # name = scrapy.Field()
12
13 # Decoration name
14 content_name=scrapy.Field()
15 # decorate id
16 content_id = scrapy.Field()
17 # request url
18 content_url=scrapy.Field()
19 # nickname
20 nick_name=scrapy.Field()
21 # The image url
22 pic_url=scrapy.Field()
23 # The name of the picture
24 pic_name=scrapy.Field()
piplines.py
1 # Define your item pipelines here
2 #
3 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
4 # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
5
6
7 # useful for handling different item types with a single interface
8 from itemadapter import ItemAdapter
9
10 from pymongo import MongoClient
11
12 class TubatuScrapyProjectPipeline:
13
14 def __init__(self):
15 client = MongoClient(host="localhost",
16 port=27017,
17 username="admin",
18 password="123456")
19 mydb=client['db_tubatu']
20 self.mycollection = mydb['collection_tubatu']
21
22 def process_item(self, item, spider):
23 data = dict(item)
24 self.mycollection.insert_one(data)
25 return item
settings.py
main.py
1 from scrapy import cmdline
2
3 # In us scrapy Inside the project , For the convenience of operation scrapy The files created during the project
4 # Use cmdlie.execute() Method to execute the crawler start command :scrapy crawl Reptile name
5 cmdline.execute("scrapy crawl tubatu".split()) #execute Each command that the method needs to run is a separate string , Such as :cmdline.execute(['scrapy', 'crawl', 'tubatu']), So if the command is an entire string , need split( ) Segmentation ;#
Reptiles (9) - Scrapy frame (1) | Scrapy More articles on asynchronous web crawler framework
- greenev —— Python Asynchronous network service framework
greenev It's based on greenlet coroutines , Event driven , Non blocking socket Model Python Network service framework , It makes it possible to write synchronous code , But get the advantage of asynchronous execution . This project is supported by gevent, openresty ...
- Android The encapsulated asynchronous network request framework
1. brief introduction Android In general, network requests use Apache HTTP Client Or use HttpURLConnection, But directly using these two libraries requires a lot of code to complete the network post and get request , And make ...
- Linux Web crawler of enterprise level project practice (3)—— Design your own web crawler
The network capture system is divided into two parts: core and extension components . The core part is a streamlined . Modular implementation of crawler , And the extension includes some convenient . Practical functions . The goal is to be as modular as possible , And reflect the functional characteristics of reptiles . This section provides simple . agile API, Basically not ...
- Principles and framework of deep learning - Recursive neural network -RNN The basic framework of the network ( Code ?) 1.rnn.LSTMCell( Generate single layer LSTM) 2.rnn.DropoutWrapper( Yes rnn Conduct dropout operation ) 3.tf.contrib.rnn.MultiRNNCell( Stack multiple layers LSTM) 4.mlstm_cell.zero_state(state initialization ) 5.mlstm_cell( Conduct LSTM solve )
problem :LSTM Output value output and state Is it the same 1. rnn.LSTMCell(num_hidden, reuse=tf.get_variable_scope().reuse) # structure ...
- Scrapy ( Web crawler framework ) introduction
One .Scrapy brief introduction : Scrapy It is pure. Python Implement a website data crawling . Application framework for extracting structural data ,Scrapy Used Twisted['twɪstɪd]( Its main opponent is Tornado) ...
- Crawler value scrapy frame foundation
brief introduction Scrapy It's an advanced Python The crawler frame , It's not just about reptiles , It is also convenient to save the crawler data to csv.json In the document . First we install Scrapy. It can be applied to data mining , Information processing or storage history ...
- An article teaches you to understand Scrapy The working principle and data collection process of web crawler framework
Today, I will give you a detailed explanation Scrapy The crawler frame , I hope it will help you in your study . 1.Scrapy The crawler frame Scrapy It's a use Python Crawler framework written in programming language , Anyone can modify it according to their own needs , also ...
- Baptize the soul , Practice python(72)-- Reptile — The crawler frame :Scrapy
Digression : I learned so much before , Believe you are right python I understand very well , I'm also very insightful about reptiles , Then the original plan was like this :( Please ignore the number and date , This is an indefinite number , I changed it at any time at the fair ) The screenshot above is my draft Then when I started blogging ...
- Scrapy Easily customize web crawlers ( turn )
Web crawler (Web Crawler, Spider) It's a robot crawling around on the Internet . Of course, it is usually not a physical robot , Because the network itself is also a virtual thing , So this “ robot ” In fact, it is a program , And it's not crawling , and ...
- Use Scrapy Build a web crawler
come from weixin Remember n Years ago, the project needed a flexible crawler tool , Just organized a small team to use Java Implemented a crawler framework , According to the structure of the target website . Address and what you need , Do simple configuration development , You can realize the crawler function of a specific website . Because I have to take the exam ...
Random recommendation
- 15 individual JavaScript Function library and tools of local storage technology
When building more complex JavaScript It's very useful for applications to run in the user's browser , It can store information in a browser , Such information can be shared on different pages , Browsing sessions . In the recent past , It will be possible only to cookies Text file save ...
- hql between and Inquire about
public IList<PrdtStdEntity> QueryPrdtStd(PrdtStdEntity prdtStdEntity) { try { var hql = " ...
- Luogu P1774 The man closest to God
Title Description Cracked the language of runes , Small FF Opened the way to the underground . When he gets to the bottom , I found a huge stone gate right in front of me , The door is carved with a pattern of some kind of activity by ancient people . Above the stone gate is written in ancient Chinese “ The temple of God ”. Small FF Guess there's a royal family in it ...
- ASP.NET Of SEO:Linq to XML--- Website map and RSS Feed
The purpose of this series of directory website maps is to let search engines search as soon as possible , More pages of the website . Here we first need to understand a basic principle , The way search engines crawl . The whole Internet is like a crisscross " network ": Each node of the network ...
- Android Customize View, High imitation QQ Music lyrics scroll control !
Recently in the form of QQ Make a mobile music player for the model , The next blog post of the source code is released . What I want to talk about today is this QQ The problem of lyrics display control in music player , Let's discuss with our friends how to achieve the rolling effect of the lyrics .OK, I don't say much nonsense , Let's have a look at the renderings : ...
- block And --- Data transfer
block There are two cases of value passing 1. Value passed meaning : Similar to the value passing of function parameters ,block The change of internal pair value has no effect on the outside , External changes are right block There is no internal impact . situation block Accessing external local variables is not supported by any keywords ...
- Python Visual learning (2):Matplotlib Quick drawing basics
Matplotlib Encapsulate most drawing objects into objects , So theoretically all the chart elements ( Such as Line2D, Text,Label etc. ) Are all objects , You can extract it from the diagram and configure the properties of the instance . meanwhile ,Matplotlib ...
- Git Pure command operation of ,Install,Clone , Commit,Push,Pull, Version rollback , Undo update , Branch creation / Switch / to update / Submit / Merge , Code conflict
Git Pure command operation of ,Install,Clone , Commit,Push,Pull, Version rollback , Undo update , Branch creation / Switch / to update / Submit / Merge , Code conflict This is a continuation of the previous distributed version library --Windows Next G ...
- Quick experience Laravel Bring your own registration 、 Login function
Quick experience Laravel Bring your own registration . Login function register . Logging in is often a headache ,Laravel Provides a solution , You can use it directly . After trying , It feels great ! Premise : It's installed locally PHP Running environment .mysq ...
- phpbbchina Back online
Last month, I put ICP The record has been re filed , But I have been busy dealing with work . After several days' efforts since last weekend , take phpbbchina Back online . Time flies , Exactly ten years . The latest data that can be found at present is 2008-10 ...