当前位置:网站首页>Site data collection -scrapy usage notes
Site data collection -scrapy usage notes
2022-07-29 10:37:00 【c01dkit】
Preface
There are many ways to collect website data , For example, the most basic requests, You can get web information in a few lines . Use selenium Simulated web page clicking can bypass many anti crawl strategies , Writing ideas are also different from other methods . use scrapy The framework can clearly split the target , And using the built-in thread pool can get information very efficiently .
This article takes scrapy Target , Summarize the usage of the foundation , For follow-up review .
To configure
Locally configured python And pip after , Use pip install scrapy You can install scrapy.
Basic use
New project
scrapy When use , It needs to be in the host command line scrapy startproject <projectname> Create a project , Like running scrapy startproject example Post generation example Folder , The contents are shown in the figure .

Add target site
The command line will also prompt to enter example Catalog , And run scrapy genspider To create a spider. Like running scrapy genspider example_spider example.com, After the spiders Create a example_spider.py file . The code of the crawler needs to be written in this file .
GET request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
POST request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
urls = [
'https://example.com/page/1/',
'https://example.com/page/2/',
]
def start_requests(self):
for target in urls:
# send out 'Content-Type':'application/x-www-form-urlencoded' Request
yield scrapy.FormRequest(
url=url,
formdata={
'arg1':'xxx','arg2':'xxx'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
# send out 'Content-Type':'application/json' Request
yield scrapy.Request(
url=url,
method='POST',
body = json.dumps({
'arg1':'xxx','arg2':'xxx'}),
headers = {
'Content-Type':'application/json'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
def parse(self, response):
pass
It should be noted that :
nameIt's the name of the reptile , namely spidername, After running, you need to specify this name .allowed_domainsSpecify the domain name allowed to crawl , It's OK not to .start_urlsSpecify which websites to crawl , The runtime will send requests to these websites one by one , And send the response to parse function . If you need to dynamically generate the target website , You can delete thisstart_urlsVariable , And add astart_requests(self)Member functions ( Need to useyield scrapy.Request(url = <targetwebsite>, callback=self.parse)As return value . When the crawler runs, if it finds that there is no definitionstart_urlsVariable , This function will be called .scrapy.RequestUsed for sending GET request . You can add onecb_kwargsParameters , It accepts a dictionary , And can be inparse(self, response, **kwargs)Pass throughkwargsTo get this dictionary , To achieve custom parameter passing . You can also usemetaParameters , And then in parse In the useresponse.metaGet the dictionary passed .scrapy.FormRequestUsed for sending POST request , The request body is placed informdatain , Parameters should be of string type . have access tometaTo pass parameters ( It can also be used. cb_kwargs? No test )
Here is an example of modifying the official document :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, cb_kwargs={
'this_url':url})
def parse(self, response, **kwargs):
page = response.url.split("/")[-2]
url = kwargs['this_url']
filename = f'quotes-{
page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {
filename}')
Start crawling
In the outermost layer example Run in directory scrapy crawl <spidername>, You can start crawling .
Take the official website documents as an example :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
response.css To extract elements , Its meaning is self-evident . It can also be done through response.text Get text information .
there parse function yield A dictionary , You can specify to save the file at run time :scrapy crawl <spidername> -O <output.jl> To save it to a file , Convenient for subsequent processing .jl yes jsonline Single line json, Can be in python Use simple file line by line traversal coordination json To deal with it . among -O Indicates overwriting the output file ,-o Indicates that... Is appended after the output file . You can add -L ERROR To ignore irrelevant output at runtime .
To the public API Continue to crawl ,jl It works wonders .
边栏推荐
- [Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types
- Achieve the effect of a menu tab
- Atomic operation of day4 practice in 2022cuda summer training camp
- [configuration related]
- mosquitto_ Sub -f parameter use
- Data office system
- R 语言 BRCA.mRNA数据集 分析
- [reading notes] the way of enterprise IT architecture transformation Alibaba's China Taiwan strategic thinking and Architecture Practice
- 【配置相关】
- Open source, compliance escort! 2022 open atom global open source summit open source compliance sub forum is about to open
猜你喜欢

Two MySQL tables with different codes (utf8, utf8mb4) are joined, resulting in index failure

12代酷睿处理器+2.8K OLED华硕好屏,灵耀14 2022影青釉商务轻薄本

这才是开发者神器正确的打开方式

Attachment of text of chenjie Report

基于STM32设计的酒驾报警系统

二次握手??三次挥手??

跟着武老师学高数——函数、极限和连续(持续更新)

Follow teacher Tian to learn practical English Grammar (continuous update)

会议OA项目(五)---- 会议通知、反馈详情

factoextra:多元统计方法的可视化PCA
随机推荐
这才是开发者神器正确的打开方式
factoextra:多元统计的可视化
Review of the 16th issue of HMS core discovery | play with the new "sound" state of AI with tiger pier
Research on the realization of linear gradient circular progress bar
Explore SQL Server metadata (I)
LeetCode二叉树系列——144.二叉树的前序遍历
Kunlunbase instruction manual (I) quick installation manual
Factoextra: visualization of multivariate statistics
Comprehensive and detailed SQL learning guide (MySQL direction)
DW: optimize the training process of target detection and more comprehensive calculation of positive and negative weights | CVPR 2022
ADB shell WM command and usage:
服务器
通过tidymodels使用XGBOOST
会议OA项目(五)---- 会议通知、反馈详情
Drunken driving alarm system based on stm32
What are the compensation standards for hospital misdiagnosis? How much can the hospital pay?
mosquitto_sub -F 参数使用
98. (cesium chapter) cesium point heat
Vim到底可以配置得多漂亮?
若依如何实现添加水印功能