当前位置:网站首页>Scratch crawler framework
Scratch crawler framework
2022-07-27 16:05:00 【fresh_ nam】
List of articles
Preface
Scrapy Is a very easy to use crawler framework , Its asynchronous processing feature makes us get the content we want faster , Now let's use it to crawl some pictures .
One 、 The goal is
I want to crawl some pictures now , I looked it up on the Internet , The website decided to crawl is http://www.mmonly.cc/mmtp/, There are many beautiful pictures in it , Crawl all the pictures on its homepage .
Two 、 Use steps
1. download scrapy
Use pip Download it
pip install scrapy
2. Create project
Use cmd Enter the folder where you want to create the crawler , Enter the following command :
scrapy startproject scrapy_demo
A folder named scrapy_demo Folder , Inside is the crawler file , Use pychram Open its directory structure as follows :
then cd Command to enter the crawler folder , Enter the following command :
scrapy genspider MMspider www.mmonly.cc/mmtp/
among MMspider It's the reptile name , When starting the crawler, use , The website behind is the website to be crawled . After executing the command, it will be in spiders A folder named MMspide Of py file , Later crawler logic should be written inside , The generated domain name should be changed , The results are as follows :
MMspider.py
# -*- coding: utf-8 -*-
import scrapy
class MmspiderSpider(scrapy.Spider):
name = 'MMspider' # Reptile name
allowed_domains = ['mmonly.cc'] # Allow crawler domain name
start_urls = ['http://www.mmonly.cc/mmtp//'] # The web address of the starting page where the crawler crawls for information
def parse(self, response):
pass

2. Analyze the structure of the web page
To crawl a web page , First understand its structure :

The target page has many pages , Every page has many photo albums , Each photo album will also have multiple pictures , Each picture is also a page . So get all the photo album pictures , Reptiles should be designed like this : Get links to all photo albums on the main page , Then crawl through the link to get all the pictures of each photo album , Then judge whether the main page has the next page , If yes, continue to crawl , Until you crawl all the pictures of the photo page corresponding to all the home pages .
3 Writing Crawlers
First of all items.py It defines the information to crawl :
items.py
import scrapy
class ScrapyDemoItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
siteURL = scrapy.Field() # Picture website address
detailURL = scrapy.Field() # The address of the original picture
title = scrapy.Field() # Picture series name
fileName = scrapy.Field() # Image storage full path name
path = scrapy.Field() # Image series storage path
Then write a crawler parsing page :
MMspider.py
import os
import scrapy
import datetime
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_demo.items import ScrapyDemoItem
# Inherit CrawlSpider
class MmspiderSpider(CrawlSpider):
name = 'MMspider' # Reptile name
base = r'D:\python Grab pictures \MMspider\ ' # Define picture storage path
allowed_domains = ['mmonly.cc'] # The domain name that the crawler allows to crawl information
start_urls = ['http://www.mmonly.cc/mmtp//']
# Define main page crawling rules , On the next page, continue to dig deeply , Other qualified links call parse_item Resolve the original address
rules = (
Rule(LinkExtractor(allow=('https://www.mmonly.cc/(.*?).html'), restrict_xpaths=(u"//div[@class='ABox']")),
callback="parse_item", follow=False),
Rule(LinkExtractor(allow=(''), restrict_xpaths=(u"//a[contains(text(),' The next page ')]")), follow=True),
)
def parse_item(self, response):
item = ScrapyDemoItem()
item['siteURL'] = response.url
item['title'] = response.xpath('//h1/text()').extract_first() # xpath Parse title
item['path'] = self.base + item['title'] # Define the storage path , The same series is stored in the same directory
path = item['path']
if not os.path.exists(path):
os.makedirs(path) # If the storage path does not exist, create
item['detailURL'] = response.xpath('//a[@class="down-btn"]/@href').extract_first() # Parse original URL
num = response.xpath('//span[@class="nowpage"]/text()').extract_first() # Analyze the number of the same series of pictures
item['fileName'] = item['path'] + '/' + str(num) + '.jpg' # Splice picture name
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), item['fileName'], u' Parsing succeeded !')
yield item
try:
# Determine whether the current page is the last page , If not , Then get the next page
if num != total_page:
# Incoming parsing item If there is a link to the next page , Continue to call parse_item
next_page = response.xpath(u"//a[contains(text(),' The next page ')]/@href").extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_item)
except:
pass
Is it a little confused to see this string of code , Don't worry , Listen to me slowly explain .
rule It's the crawler rule I define , I have defined two crawler rules : The first is from every page class by ’ABOX’ Of div The request in the tag is satisfied ’https://www.mmonly.cc/(.*?).html’ Regular expression links , Achieve the purpose of getting links to each page of photo album .
callback="parse_item" It means using parse_item() Function to parse the request page .
The second rule is to get a link to the next page and request , Find all of the pages a The corresponding text in the label is... On the next page a label , Request its corresponding link , It realizes the acquisition of the next page .
follow The parameter is a Boolean (boolean) value , Specified from... According to the rule response Whether the extracted links need to be followed up . If callback It's empty ,follow The default setting is True, Otherwise default to False.
Xpath yes html The parser , Use it to quickly locate html Some element of , for example response.xpath(’//a[@class=“down-btn”]/@href’).extract_first() Is to get all class by down-btn Of a The label corresponds to href Contents of Li 
The code to download the image is written in the pipeline file . It uses requests Library requests picture links and saves pictures , First download requests:
pip install requests
The code is as follows :
pipelines.py
import requests
import datetime
class ScrapyDemoPipeline(object):
def process_item(self, item, spider):
detailURL = item['detailURL'] # Get links to pictures
fileName = item['fileName'] # Get the full path of file saving
try:
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), u' Saving picture :', detailURL)
print(u' file :', fileName)
image = requests.get(detailURL) # According to the analysis item Original picture link to download pictures
f = open(fileName, 'wb') # Open the picture
f.write(image.content) # Write picture
f.close()
print(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), fileName, u' Saved successfully !')
except Exception:
print(fileName, 'other fault:', Exception)
return item
Last , Also open the pipeline file in the configuration :
settings.py
# Obey robots.txt rules
ROBOTSTXT_OBEY = False #True Change to False
# Remove annotations
ITEM_PIPELINES = {
'scrapy_demo.pipelines.ScrapyDemoPipeline': 300,
}
Last command input :
scrapy crawl MMspider
Reptiles can run , Be accomplished !
Pictures are also saved :

summary
scrapy Is a very powerful framework , I only use part of its functions here . Among them, I made some modifications to the code to enable the crawler to start , There may be no steps written above , If you have any questions, please leave a message in the comment area .
边栏推荐
猜你喜欢
![[sword finger offer] interview question 42: the maximum sum of continuous subarrays -- with 0x80000000 and int_ MIN](/img/01/bbf81cccb47b6351d7265ee4a77c55.png)
[sword finger offer] interview question 42: the maximum sum of continuous subarrays -- with 0x80000000 and int_ MIN

Is low code the future of development? On low code platform
![[sword finger offer] interview question 50: the first character that appears only once - hash table lookup](/img/72/b35bdf9bde72423410e365e5b6c20e.png)
[sword finger offer] interview question 50: the first character that appears only once - hash table lookup

juc包下常用工具类

First understanding of structure

网络层的IP协议

Single machine high concurrency model design

IP protocol of network layer

Half find

SQL multi table query
随机推荐
QT (VI) value and string conversion
文本截取图片(哪吒之魔童降世壁纸)
SQL multi table query
一款功能强大的Web漏洞扫描和验证工具(Vulmap)
Single machine high concurrency model design
Mlx90640 infrared thermal imager temperature sensor module development notes (VII)
Live broadcast software development, customized pop-up effect of tools
初识MySQL数据库
Voice live broadcast system -- a necessary means to improve the security of cloud storage
[sword finger offer] interview question 53- Ⅱ: missing numbers in 0 ~ n-1 - binary search
C language: data storage
leetcode25题:K 个一组翻转链表——链表困难题目详解
Six capabilities of test and development
逗号操作符你有用过吗?
通俗易懂地区分++i和i++
[sword finger offer] interview question 51: reverse pairs in the array - merge sort
On juicefs
减小程序rom ram,gcc -ffunction-sections -fdata-sections -Wl,–gc-sections 参数详解
43亿欧元现金收购欧司朗宣告失败!ams表示将继续收购
C language realizes the conversion between byte stream and hexadecimal string