当前位置:网站首页>Site data collection -scrapy usage notes
Site data collection -scrapy usage notes
2022-07-29 10:37:00 【c01dkit】
Preface
There are many ways to collect website data , For example, the most basic requests, You can get web information in a few lines . Use selenium Simulated web page clicking can bypass many anti crawl strategies , Writing ideas are also different from other methods . use scrapy The framework can clearly split the target , And using the built-in thread pool can get information very efficiently .
This article takes scrapy Target , Summarize the usage of the foundation , For follow-up review .
To configure
Locally configured python And pip after , Use pip install scrapy You can install scrapy.
Basic use
New project
scrapy When use , It needs to be in the host command line scrapy startproject <projectname> Create a project , Like running scrapy startproject example Post generation example Folder , The contents are shown in the figure .

Add target site
The command line will also prompt to enter example Catalog , And run scrapy genspider To create a spider. Like running scrapy genspider example_spider example.com, After the spiders Create a example_spider.py file . The code of the crawler needs to be written in this file .
GET request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
POST request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
urls = [
'https://example.com/page/1/',
'https://example.com/page/2/',
]
def start_requests(self):
for target in urls:
# send out 'Content-Type':'application/x-www-form-urlencoded' Request
yield scrapy.FormRequest(
url=url,
formdata={
'arg1':'xxx','arg2':'xxx'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
# send out 'Content-Type':'application/json' Request
yield scrapy.Request(
url=url,
method='POST',
body = json.dumps({
'arg1':'xxx','arg2':'xxx'}),
headers = {
'Content-Type':'application/json'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
def parse(self, response):
pass
It should be noted that :
nameIt's the name of the reptile , namely spidername, After running, you need to specify this name .allowed_domainsSpecify the domain name allowed to crawl , It's OK not to .start_urlsSpecify which websites to crawl , The runtime will send requests to these websites one by one , And send the response to parse function . If you need to dynamically generate the target website , You can delete thisstart_urlsVariable , And add astart_requests(self)Member functions ( Need to useyield scrapy.Request(url = <targetwebsite>, callback=self.parse)As return value . When the crawler runs, if it finds that there is no definitionstart_urlsVariable , This function will be called .scrapy.RequestUsed for sending GET request . You can add onecb_kwargsParameters , It accepts a dictionary , And can be inparse(self, response, **kwargs)Pass throughkwargsTo get this dictionary , To achieve custom parameter passing . You can also usemetaParameters , And then in parse In the useresponse.metaGet the dictionary passed .scrapy.FormRequestUsed for sending POST request , The request body is placed informdatain , Parameters should be of string type . have access tometaTo pass parameters ( It can also be used. cb_kwargs? No test )
Here is an example of modifying the official document :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, cb_kwargs={
'this_url':url})
def parse(self, response, **kwargs):
page = response.url.split("/")[-2]
url = kwargs['this_url']
filename = f'quotes-{
page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {
filename}')
Start crawling
In the outermost layer example Run in directory scrapy crawl <spidername>, You can start crawling .
Take the official website documents as an example :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
response.css To extract elements , Its meaning is self-evident . It can also be done through response.text Get text information .
there parse function yield A dictionary , You can specify to save the file at run time :scrapy crawl <spidername> -O <output.jl> To save it to a file , Convenient for subsequent processing .jl yes jsonline Single line json, Can be in python Use simple file line by line traversal coordination json To deal with it . among -O Indicates overwriting the output file ,-o Indicates that... Is appended after the output file . You can add -L ERROR To ignore irrelevant output at runtime .
To the public API Continue to crawl ,jl It works wonders .
边栏推荐
- 可线性渐变的环形进度条的实现探究
- 使用R包PreMSIm根据基因表达量来预测微卫星不稳定
- 阿里P8爆出的这份大厂面试指南,看完工资暴涨30k!
- 跟着李老师学线代——行列式(持续更新)
- Luogu p4185 [usaco18jan]mootube g problem solution
- [semantic segmentation] 2021-pvt2 cvmj
- factoextra:多元统计方法的可视化PCA
- Using Riemann sum to calculate approximate integral in R language
- 2022cuda summer training camp Day6 practice
- 云服务大厂高管大变阵:技术派让位销售派
猜你喜欢

A tour of grp:04 - GRP unary call unary call

Big cloud service company executives changed: technology gives way to sales

Learning R language these ebooks are enough!
![[HFCTF 2021 Final]easyflask](/img/58/8113cafae8aeafcb1c9ad09eefd30f.jpg)
[HFCTF 2021 Final]easyflask

After eating Alibaba's core notes of highly concurrent programming, the backhand rose 5K

mosquitto_ Sub -f parameter use
![[semantic segmentation] 2021-pvt iccv](/img/43/3756c0dbc30fa2871dc8cae5be9bce.png)
[semantic segmentation] 2021-pvt iccv

MySQL优化理论学习指南

Implementation of college logistics repair application system based on SSM

VMWare:使用命令更新或升级 VMWare ESXi 主机
随机推荐
Tell you from my accident: Mastering asynchrony is key
PDF处理还收费?不可能
R 语言 二分法与 牛顿迭代法计算中方程的根
Docker安装Redis、配置及远程连接
Software testing dry goods
[HFCTF 2021 Final]easyflask
Print out the "hourglass" and the remaining number according to the given number of characters and characters
Big cloud service company executives changed: technology gives way to sales
The server
R 语言 Monte Carlo方法 和平均值法 计算定积分, 考虑随机投点法,计算在置信度0.05, 要求为ϵ=0.01 , 所需要的试验次数
[paper reading] i-bert: integer only Bert quantification
Oncopy and onpaste
可线性渐变的环形进度条的实现探究
Follow teacher Wu to learn advanced numbers - function, limit and continuity (continuous update)
What are the compensation standards for hospital misdiagnosis? How much can the hospital pay?
AI模型风险评估 第2部分:核心内容
Open source, compliance escort! 2022 open atom global open source summit open source compliance sub forum is about to open
After E-sports enters Asia, will Tencent be the next "NBA game catcher"?
阿里P8爆出的这份大厂面试指南,看完工资暴涨30k!
[reading notes] the way of enterprise IT architecture transformation Alibaba's China Taiwan strategic thinking and Architecture Practice