当前位置:网站首页>Crawler scrapy framework learning 2
Crawler scrapy framework learning 2
2022-06-13 04:32:00 【gohxc】
Extract the data
Learn how to use Scrapy The best way to extract data is to use shell Scrapy shell Try selector .
scrapy shell "http://quotes.toscrape.com/page/1/"
Use shell, You can try CSS And response object selection elements :
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']
Result of operation response.css(‘title’) Is an object similar to a list SelectorList, It represents a Selector contain XML / HTML The object list of the element
One is that we have added ::text To CSS Querying , This means that we just want to select the text element directly within the element . If we don't specify ::text, We will get a complete title Elements , Include its label
The result of the call .extract() It's a list , Because we are dealing with an instance SelectorList. Use when you know you only want the first result extract_first()
response.css('title::text').extract_first()
response.css('title::text')[0].extract()
// this 2 One is the same
// however , When it cannot find any elements that match the selection , Use .extract_first() avoid IndexError And return None.
except extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
了 extract() and extract_first() Out of the way , You can also use the re() Method uses regular expressions to extract :
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
XPath: Brief introduction
except CSS,Scrapy Selectors also support the use of XPath expression :
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
XPath The expression is very powerful , yes Scrapy Selectors The basis of . actually ,CSS The selector is converted to... Under the hood XPath. If you read it carefully shell Text representation of the selector object in , You can see .
Use XPath, You can select the following : Select include text “ The next page ” Link to . This makes XPath Very suitable for grabbing tasks , We encourage you to learn XPath, Even if you already know how to construct CSS Selectors , It also makes grabbing easier .
whole :
$ scrapy shell 'http://quotes.toscrape.com'
>>>quote = response.css("div.quote")[0]
>>>title = quote.css("span.text::text").extract_first()
>>>tags = quote.css("div.tags a.tag::text").extract()
Extract data from our spiders
up to now , It does not specifically extract any data , Just put the whole HTML Save the page to a local file . Let's integrate the above extraction logic into our spider .
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
Output is as follows :
scrapy crawl quotes -o quotes.json
For historical reasons ,Scrapy Will be attached to a given file instead of overwriting its contents .
You can also use other formats , for example JSON Lines:
scrapy crawl quotes -o quotes.jl
Get the link in the page
response.css('li.next a::attr(href)').extract_first()
Now let's see that our spider has been modified to recursively follow links to the next page , Extract data from :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Create a shortcut to the request
As the creation of Request Shortcuts to objects , You can use response.follow:
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
And scrapy.Request Different , it response.follow Direct support for relative URL - Don't need to call urljoin. Be careful ,response.follow Only return one Request example ; You still need to make this request .
You can also pass selectors response.follow Instead of strings ; This selector should extract the necessary properties :
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, callback=self.parse)
For elements , There is a shortcut :response.follow Automatically use its href attribute . So the code can be further shortened :
for a in response.css('li.next a'):
yield response.follow(a, callback=self.parse)
More examples and patterns
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)'):
yield response.follow(href, self.parse_author)
# follow pagination links
for href in response.css('li.next a::attr(href)'):
yield response.follow(href, self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
This spider will start from the main page , It will follow all links to the author page ,parse_author Call a callback for each page , as well as parse The paging link we saw earlier with the callback .
Another interesting thing this spider demonstrates is , Even if the same author has many citations , We don't have to worry about visiting the same author's page multiple times . By default ,Scrapy Will filter the visited URL Repeated requests for , So as to avoid the problem of too many servers caused by programming errors .
Use spider parameters
You can -a Use this option to provide command line parameters for your spider when it runs : scrapy crawl quotes -o quotes-humor.json -a tag=humor
These parameters are passed to Spider Of init Method , Silent Change to spider attribute .
In this example , The value provided for the parameter tag Will pass through self.tag. You can use this option to make your spider only get quotes with specific tags , And build according to the parameters URL:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
}
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
If you will tag=humor Parameter passed to this spider , You will notice that it only accesses humor In the tag URL , for example http://quotes.toscrape.com/tag/humor.
边栏推荐
- 【自动化测试】关于unittest你需要知道的事
- Ionic Cordova command line
- Uni app dynamic add style dynamic bind background image invalid
- 【Try to Hack】upload-labs通关(暂时写到12关)
- 是“凯撒密码”呀。(*‘▽‘*)*
- Express framework knowledge - Art template template, cookie, session
- Click change color to change subscript
- Filter and listener
- 前几年的互联网人vs现在的互联网人
- 基于DE2-115平台的VGA显示
猜你喜欢
Mongodb compass connects to the Alibaba cloud remote server database or reports an error occurred while loading instance info: command hostinfo req
Catalan number
Data analysis report
Analysis of the implementation principle of an open source markdown to rich text editor
Sword finger offer 56 - I. number of occurrences in the array
电磁兼容常用名词术语
Day45. data analysis practice (1): supermarket operation data analysis
Manage PC startup items
Li Kou brush question 647 Palindrome substring
【自动化测试】关于unittest你需要知道的事
随机推荐
【Flutter 問題系列第 67 篇】在 Flutter 中使用 Get 插件在 Dialog 彈窗中不能二次跳轉路由問題的解决方案
php开发博客系统的首页头部功能实现
【自动化测试】关于unittest你需要知道的事
CreateAnonymousThreadX给匿名线程传递参数
Consolidated figures
Small program input element moving up
Express framework knowledge - Art template template, cookie, session
MySQL索引
SEO specification
Ionic Cordova command line
Ctfshow SQL injection (231-253)
SS selector
This Sedata uses multiple methods to dynamically modify objects and values in arrays. Object calculation properties
PHP development 14 compilation of friendship link module
2022 ICML | Pocket2Mol: Efficient Molecular Sampling Based on 3D Protein Pockets
SQL 进阶挑战(1 - 5)
[automated test] what you need to know about unittest
Idea Download
个人总结的MVP框架
Ladder race