当前位置:网站首页>Crawler scrapy framework learning 2
Crawler scrapy framework learning 2
2022-06-13 04:32:00 【gohxc】
Extract the data
Learn how to use Scrapy The best way to extract data is to use shell Scrapy shell Try selector .
scrapy shell "http://quotes.toscrape.com/page/1/"Use shell, You can try CSS And response object selection elements :
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').extract()
['Quotes to Scrape'] Result of operation response.css(‘title’) Is an object similar to a list SelectorList, It represents a Selector contain XML / HTML The object list of the element
One is that we have added ::text To CSS Querying , This means that we just want to select the text element directly within the element . If we don't specify ::text, We will get a complete title Elements , Include its label
The result of the call .extract() It's a list , Because we are dealing with an instance SelectorList. Use when you know you only want the first result extract_first()
response.css('title::text').extract_first()
response.css('title::text')[0].extract()
// this 2 One is the same
// however , When it cannot find any elements that match the selection , Use .extract_first() avoid IndexError And return None. except extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
了 extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']XPath: Brief introduction
except CSS,Scrapy Selectors also support the use of XPath expression :
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'XPath The expression is very powerful , yes Scrapy Selectors The basis of . actually ,CSS The selector is converted to... Under the hood XPath. If you read it carefully shell Text representation of the selector object in , You can see .
Use XPath, You can select the following : Select include text “ The next page ” Link to . This makes XPath Very suitable for grabbing tasks , We encourage you to learn XPath, Even if you already know how to construct CSS Selectors , It also makes grabbing easier .
whole :
$ scrapy shell 'http://quotes.toscrape.com'
>>>quote = response.css("div.quote")[0]
>>>title = quote.css("span.text::text").extract_first()
>>>tags = quote.css("div.tags a.tag::text").extract()Extract data from our spiders
up to now , It does not specifically extract any data , Just put the whole HTML Save the page to a local file . Let's integrate the above extraction logic into our spider .
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}Output is as follows :
scrapy crawl quotes -o quotes.json For historical reasons ,Scrapy Will be attached to a given file instead of overwriting its contents .
You can also use other formats , for example JSON Lines:
scrapy crawl quotes -o quotes.jlGet the link in the page
response.css('li.next a::attr(href)').extract_first()Now let's see that our spider has been modified to recursively follow links to the next page , Extract data from :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)Create a shortcut to the request
As the creation of Request Shortcuts to objects , You can use response.follow:
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)And scrapy.Request Different , it response.follow Direct support for relative URL - Don't need to call urljoin. Be careful ,response.follow Only return one Request example ; You still need to make this request .
You can also pass selectors response.follow Instead of strings ; This selector should extract the necessary properties :
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)For elements , There is a shortcut :response.follow Automatically use its href attribute . So the code can be further shortened :
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)More examples and patterns
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
} This spider will start from the main page , It will follow all links to the author page ,parse_author Call a callback for each page , as well as parse The paging link we saw earlier with the callback .
Another interesting thing this spider demonstrates is , Even if the same author has many citations , We don't have to worry about visiting the same author's page multiple times . By default ,Scrapy Will filter the visited URL Repeated requests for , So as to avoid the problem of too many servers caused by programming errors .
Use spider parameters
You can -a Use this option to provide command line parameters for your spider when it runs : scrapy crawl quotes -o quotes-humor.json -a tag=humor
These parameters are passed to Spider Of init Method , Silent Change to spider attribute .
In this example , The value provided for the parameter tag Will pass through self.tag. You can use this option to make your spider only get quotes with specific tags , And build according to the parameters URL:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)If you will tag=humor Parameter passed to this spider , You will notice that it only accesses humor In the tag URL , for example http://quotes.toscrape.com/tag/humor.
边栏推荐
猜你喜欢

【Try to Hack】upload-labs通关(暂时写到12关)

Li Kou brush question 338 Bit count

Interpretation and implementation of proxy mode

CreateAnonymousThreadX给匿名线程传递参数

C盘无损移动文件

一款开源的Markdown转富文本编辑器的实现原理剖析

剑指 Offer 56 - I. 数组中数字出现的次数

Use the visual studio code terminal to execute the command, and the prompt "because running scripts is prohibited on this system" will give an error

Express scaffold creation

knife4j aggregation 2.0.9支持路由文档自动刷新
随机推荐
前几年的互联网人vs现在的互联网人
Notes on uni app
Analysis of the implementation principle of an open source markdown to rich text editor
PHP development 14 compilation of friendship link module
February 25, 2021 (Archaeology 12 year Landbridge provincial competition)
一款開源的Markdown轉富文本編輯器的實現原理剖析
[flutter problem Series Chapter 67] the Solution to the problem of Routing cannot be jumped again in in dialog popup Using get plug - in in flutter
Latex operation
记录一次排查问题的经过——视频通话无法接起
2022 ICLR | CONTRASTIVE LEARNING OF IMAGE- AND STRUCTURE BASED REPRESENTATIONS IN DRUG DISCOVERY
第三方评论插件
剑指 Offer 11. 旋转数组的最小数字-二分查找
【Flutter 问题系列第 67 篇】在 Flutter 中使用 Get 插件在 Dialog 弹窗中不能二次跳转路由问题的解决方案
web自动化测试之webdriver api总结
Hugo 博客搭建教程
Small program input element moving up
Common terms of electromagnetic compatibility
php开发14 友情链接模块的编写
php安全开发15用户密码修改模块
Discussion sur la modélisation de la série 143