当前位置:网站首页>Crawler scrapy framework learning 2
Crawler scrapy framework learning 2
2022-06-13 04:32:00 【gohxc】
Extract the data
Learn how to use Scrapy The best way to extract data is to use shell Scrapy shell Try selector .
scrapy shell "http://quotes.toscrape.com/page/1/"
Use shell, You can try CSS And response object selection elements :
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']
Result of operation response.css(‘title’) Is an object similar to a list SelectorList, It represents a Selector contain XML / HTML The object list of the element
One is that we have added ::text To CSS Querying , This means that we just want to select the text element directly within the element . If we don't specify ::text, We will get a complete title Elements , Include its label
The result of the call .extract() It's a list , Because we are dealing with an instance SelectorList. Use when you know you only want the first result extract_first()
response.css('title::text').extract_first()
response.css('title::text')[0].extract()
// this 2 One is the same
// however , When it cannot find any elements that match the selection , Use .extract_first() avoid IndexError And return None.
except extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
了 extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
XPath: Brief introduction
except CSS,Scrapy Selectors also support the use of XPath expression :
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
XPath The expression is very powerful , yes Scrapy Selectors The basis of . actually ,CSS The selector is converted to... Under the hood XPath. If you read it carefully shell Text representation of the selector object in , You can see .
Use XPath, You can select the following : Select include text “ The next page ” Link to . This makes XPath Very suitable for grabbing tasks , We encourage you to learn XPath, Even if you already know how to construct CSS Selectors , It also makes grabbing easier .
whole :
$ scrapy shell 'http://quotes.toscrape.com'
>>>quote = response.css("div.quote")[0]
>>>title = quote.css("span.text::text").extract_first()
>>>tags = quote.css("div.tags a.tag::text").extract()
Extract data from our spiders
up to now , It does not specifically extract any data , Just put the whole HTML Save the page to a local file . Let's integrate the above extraction logic into our spider .
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Output is as follows :
scrapy crawl quotes -o quotes.json
For historical reasons ,Scrapy Will be attached to a given file instead of overwriting its contents .
You can also use other formats , for example JSON Lines:
scrapy crawl quotes -o quotes.jl
Get the link in the page
response.css('li.next a::attr(href)').extract_first()
Now let's see that our spider has been modified to recursively follow links to the next page , Extract data from :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Create a shortcut to the request
As the creation of Request Shortcuts to objects , You can use response.follow:
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
And scrapy.Request Different , it response.follow Direct support for relative URL - Don't need to call urljoin. Be careful ,response.follow Only return one Request example ; You still need to make this request .
You can also pass selectors response.follow Instead of strings ; This selector should extract the necessary properties :
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
For elements , There is a shortcut :response.follow Automatically use its href attribute . So the code can be further shortened :
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)
More examples and patterns
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page , It will follow all links to the author page ,parse_author Call a callback for each page , as well as parse The paging link we saw earlier with the callback .
Another interesting thing this spider demonstrates is , Even if the same author has many citations , We don't have to worry about visiting the same author's page multiple times . By default ,Scrapy Will filter the visited URL Repeated requests for , So as to avoid the problem of too many servers caused by programming errors .
Use spider parameters
You can -a Use this option to provide command line parameters for your spider when it runs : scrapy crawl quotes -o quotes-humor.json -a tag=humor
These parameters are passed to Spider Of init Method , Silent Change to spider attribute .
In this example , The value provided for the parameter tag Will pass through self.tag. You can use this option to make your spider only get quotes with specific tags , And build according to the parameters URL:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
If you will tag=humor Parameter passed to this spider , You will notice that it only accesses humor In the tag URL , for example http://quotes.toscrape.com/tag/humor.
边栏推荐
- Li Kou brush question 647 Palindrome substring
- Redis master-slave replication, sentinel mode, cluster
- R: Airline customer value analysis practice
- Analyse du principe de mise en œuvre d'un éditeur de texte open source markdown - to - rich
- 【Flutter 问题系列第 67 篇】在 Flutter 中使用 Get 插件在 Dialog 弹窗中不能二次跳转路由问题的解决方案
- Applet version update
- Detailed explanation of KOA development process
- Small program imitating Taobao Jiugong grid sliding effect
- Message scrolling JS implementation
- Checkmarks and crosses to collect
猜你喜欢
Collection of wrong questions in soft test -- morning questions in the first half of 2011
Uni app dynamic add style dynamic bind background image invalid
120. triangle minimum path sum - Dynamic Planning
2019 Blue Bridge Cup
基于DE2-115平台的VGA显示
记录一次排查问题的经过——视频通话无法接起
[flutter problem Series Chapter 67] the Solution to the problem of Routing cannot be jumped again in in dialog popup Using get plug - in in flutter
2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets
Data analysis report
Analyse du principe de mise en œuvre d'un éditeur de texte open source markdown - to - rich
随机推荐
Get verification code
dumi 搭建文档型博客
Forgotten fleeting years
Collection of wrong questions in soft test -- morning questions in the first half of 2010
力扣刷题338.比特位计数
[sword finger offer] interview question 25 Merge two ordered linked lists
Common terms of electromagnetic compatibility
Manage PC startup items
Uni app Ali font icon does not display
Set properties \ classes
Simple static web page + animation (small case)
1.4.2 Capital Market Theroy
PHP development 14 compilation of friendship link module
Dumi construit un blog documentaire
SEO specification
正态分布(高斯分布)
Notes on software test for programmers -- basic knowledge of software development, operation and maintenance
120. triangle minimum path sum - Dynamic Planning
一款開源的Markdown轉富文本編輯器的實現原理剖析
Mongodb compass connects to the Alibaba cloud remote server database or reports an error occurred while loading instance info: command hostinfo req