当前位置:网站首页>Site data collection -scrapy usage notes
Site data collection -scrapy usage notes
2022-07-29 10:37:00 【c01dkit】
Preface
There are many ways to collect website data , For example, the most basic requests, You can get web information in a few lines . Use selenium Simulated web page clicking can bypass many anti crawl strategies , Writing ideas are also different from other methods . use scrapy The framework can clearly split the target , And using the built-in thread pool can get information very efficiently .
This article takes scrapy Target , Summarize the usage of the foundation , For follow-up review .
To configure
Locally configured python And pip after , Use pip install scrapy You can install scrapy.
Basic use
New project
scrapy When use , It needs to be in the host command line scrapy startproject <projectname> Create a project , Like running scrapy startproject example Post generation example Folder , The contents are shown in the figure .

Add target site
The command line will also prompt to enter example Catalog , And run scrapy genspider To create a spider. Like running scrapy genspider example_spider example.com, After the spiders Create a example_spider.py file . The code of the crawler needs to be written in this file .
GET request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
POST request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
urls = [
'https://example.com/page/1/',
'https://example.com/page/2/',
]
def start_requests(self):
for target in urls:
# send out 'Content-Type':'application/x-www-form-urlencoded' Request
yield scrapy.FormRequest(
url=url,
formdata={
'arg1':'xxx','arg2':'xxx'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
# send out 'Content-Type':'application/json' Request
yield scrapy.Request(
url=url,
method='POST',
body = json.dumps({
'arg1':'xxx','arg2':'xxx'}),
headers = {
'Content-Type':'application/json'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
def parse(self, response):
pass
It should be noted that :
nameIt's the name of the reptile , namely spidername, After running, you need to specify this name .allowed_domainsSpecify the domain name allowed to crawl , It's OK not to .start_urlsSpecify which websites to crawl , The runtime will send requests to these websites one by one , And send the response to parse function . If you need to dynamically generate the target website , You can delete thisstart_urlsVariable , And add astart_requests(self)Member functions ( Need to useyield scrapy.Request(url = <targetwebsite>, callback=self.parse)As return value . When the crawler runs, if it finds that there is no definitionstart_urlsVariable , This function will be called .scrapy.RequestUsed for sending GET request . You can add onecb_kwargsParameters , It accepts a dictionary , And can be inparse(self, response, **kwargs)Pass throughkwargsTo get this dictionary , To achieve custom parameter passing . You can also usemetaParameters , And then in parse In the useresponse.metaGet the dictionary passed .scrapy.FormRequestUsed for sending POST request , The request body is placed informdatain , Parameters should be of string type . have access tometaTo pass parameters ( It can also be used. cb_kwargs? No test )
Here is an example of modifying the official document :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, cb_kwargs={
'this_url':url})
def parse(self, response, **kwargs):
page = response.url.split("/")[-2]
url = kwargs['this_url']
filename = f'quotes-{
page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {
filename}')
Start crawling
In the outermost layer example Run in directory scrapy crawl <spidername>, You can start crawling .
Take the official website documents as an example :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
response.css To extract elements , Its meaning is self-evident . It can also be done through response.text Get text information .
there parse function yield A dictionary , You can specify to save the file at run time :scrapy crawl <spidername> -O <output.jl> To save it to a file , Convenient for subsequent processing .jl yes jsonline Single line json, Can be in python Use simple file line by line traversal coordination json To deal with it . among -O Indicates overwriting the output file ,-o Indicates that... Is appended after the output file . You can add -L ERROR To ignore irrelevant output at runtime .
To the public API Continue to crawl ,jl It works wonders .
边栏推荐
- [paper reading] q-bert: Hessian based ultra low precision quantification of Bert
- Performance optimization analysis tool | perf
- 站点数据收集-Scrapy使用笔记
- 皕杰报表之文本附件属件
- 基于STM32设计的酒驾报警系统
- R 语言 二分法与 牛顿迭代法计算中方程的根
- 若依集成minio实现分布式文件存储
- 使用R包skimr汇总统计量的优美展示
- Data visualization design guide (information chart)
- 【论文阅读】I-BERT: Integer-only BERT Quantization
猜你喜欢

Hanyuan high tech Gigabit 2-optical 6-conductor rail managed Industrial Ethernet switch supports X-ring redundant ring network one key ring network switch

阿里P8爆出的这份大厂面试指南,看完工资暴涨30k!
![[HFCTF 2021 Final]easyflask](/img/58/8113cafae8aeafcb1c9ad09eefd30f.jpg)
[HFCTF 2021 Final]easyflask

Learning R language these ebooks are enough!

How to realize the function of adding watermark

How can agile development reduce cognitive bias in collaboration| Agile way

Data visualization design guide (information chart)

跟着田老师学实用英语语法(持续更新)

Kunlunbase instruction manual (III) data import & synchronization

会议OA项目(五)---- 会议通知、反馈详情
随机推荐
What is "enterprise level" low code? Five abilities that must be possessed to become enterprise level low code
使用R包skimr汇总统计量的优美展示
[paper reading] q-bert: Hessian based ultra low precision quantification of Bert
paho交叉编译
MySQL 8 of relational database -- deepening and comprehensive learning from the inside out
使用tidymodels搞定二分类logistic模型
R语言 使用数据集 veteran 进行生存分析
Performance optimization analysis tool | perf
[semantic segmentation] 2021-pvt2 cvmj
mosquitto_sub -F 参数使用
A tour of grp:04 - GRP unary call unary call
Big cloud service company executives changed: technology gives way to sales
[untitled]
Implementation of college logistics repair application system based on SSM
二次握手??三次挥手??
Structure the eighth operation of the combat battalion module
Oncopy and onpaste
[HFCTF 2021 Final]easyflask
[dark horse morning post] Youxian responded to the dissolution every day, and many places have been unable to place orders; Li Bin said that Wei Lai will produce a mobile phone every year; Li Ka Shing
Kunlunbase instruction manual (I) quick installation manual