当前位置:网站首页>Site data collection -scrapy usage notes
Site data collection -scrapy usage notes
2022-07-29 10:37:00 【c01dkit】
Preface
There are many ways to collect website data , For example, the most basic requests, You can get web information in a few lines . Use selenium Simulated web page clicking can bypass many anti crawl strategies , Writing ideas are also different from other methods . use scrapy The framework can clearly split the target , And using the built-in thread pool can get information very efficiently .
This article takes scrapy Target , Summarize the usage of the foundation , For follow-up review .
To configure
Locally configured python And pip after , Use pip install scrapy You can install scrapy.
Basic use
New project
scrapy When use , It needs to be in the host command line scrapy startproject <projectname> Create a project , Like running scrapy startproject example Post generation example Folder , The contents are shown in the figure .

Add target site
The command line will also prompt to enter example Catalog , And run scrapy genspider To create a spider. Like running scrapy genspider example_spider example.com, After the spiders Create a example_spider.py file . The code of the crawler needs to be written in this file .
GET request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
start_urls = ['http://example.com/']
def parse(self, response):
pass
POST request
import scrapy
class ExampleSpiderSpider(scrapy.Spider):
name = 'example_spider'
allowed_domains = ['example.com']
urls = [
'https://example.com/page/1/',
'https://example.com/page/2/',
]
def start_requests(self):
for target in urls:
# send out 'Content-Type':'application/x-www-form-urlencoded' Request
yield scrapy.FormRequest(
url=url,
formdata={
'arg1':'xxx','arg2':'xxx'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
# send out 'Content-Type':'application/json' Request
yield scrapy.Request(
url=url,
method='POST',
body = json.dumps({
'arg1':'xxx','arg2':'xxx'}),
headers = {
'Content-Type':'application/json'},
callback=self.parse,
meta={
'arg1':1,'arg2':2}
)
def parse(self, response):
pass
It should be noted that :
nameIt's the name of the reptile , namely spidername, After running, you need to specify this name .allowed_domainsSpecify the domain name allowed to crawl , It's OK not to .start_urlsSpecify which websites to crawl , The runtime will send requests to these websites one by one , And send the response to parse function . If you need to dynamically generate the target website , You can delete thisstart_urlsVariable , And add astart_requests(self)Member functions ( Need to useyield scrapy.Request(url = <targetwebsite>, callback=self.parse)As return value . When the crawler runs, if it finds that there is no definitionstart_urlsVariable , This function will be called .scrapy.RequestUsed for sending GET request . You can add onecb_kwargsParameters , It accepts a dictionary , And can be inparse(self, response, **kwargs)Pass throughkwargsTo get this dictionary , To achieve custom parameter passing . You can also usemetaParameters , And then in parse In the useresponse.metaGet the dictionary passed .scrapy.FormRequestUsed for sending POST request , The request body is placed informdatain , Parameters should be of string type . have access tometaTo pass parameters ( It can also be used. cb_kwargs? No test )
Here is an example of modifying the official document :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse, cb_kwargs={
'this_url':url})
def parse(self, response, **kwargs):
page = response.url.split("/")[-2]
url = kwargs['this_url']
filename = f'quotes-{
page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {
filename}')
Start crawling
In the outermost layer example Run in directory scrapy crawl <spidername>, You can start crawling .
Take the official website documents as an example :
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'https://quotes.toscrape.com/page/1/',
'https://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
response.css To extract elements , Its meaning is self-evident . It can also be done through response.text Get text information .
there parse function yield A dictionary , You can specify to save the file at run time :scrapy crawl <spidername> -O <output.jl> To save it to a file , Convenient for subsequent processing .jl yes jsonline Single line json, Can be in python Use simple file line by line traversal coordination json To deal with it . among -O Indicates overwriting the output file ,-o Indicates that... Is appended after the output file . You can add -L ERROR To ignore irrelevant output at runtime .
To the public API Continue to crawl ,jl It works wonders .
边栏推荐
- 跟着武老师学高数——函数、极限和连续(持续更新)
- 这才是开发者神器正确的打开方式
- 学习R语言这几本电子书就够了!
- Factoextra: visualization of multivariate statistics
- Understanding of Arduino circuit
- R language uses data set veteran for survival analysis
- Structure the eighth operation of the combat battalion module
- Research on the realization of linear gradient circular progress bar
- R包pedquant实现股票下载和金融量化分析
- Static resource mapping
猜你喜欢

After eating Alibaba's core notes of highly concurrent programming, the backhand rose 5K

Kunlunbase instruction manual (IV) real time synchronization of data from Oracle to kunlunbase

学习R语言这几本电子书就够了!

SAP Fiori @OData. Analysis of the working principle of publish annotation

Attachment of text of chenjie Report

Drunken driving alarm system based on stm32

PDF处理还收费?不可能

This is the right way for developers to open artifacts
![[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types](/img/3c/eaadcc105377f012db5ee8852b5e28.png)
[Yugong series] go teaching course 010 in July 2022 - Boolean and character types of data types

MySQL优化理论学习指南
随机推荐
How can agile development reduce cognitive bias in collaboration| Agile way
After eating Alibaba's core notes of highly concurrent programming, the backhand rose 5K
Second handshake?? Three waves??
【配置相关】
跟着李老师学线代——矩阵(持续更新)
paho交叉编译
R 语言 二分法与 牛顿迭代法计算中方程的根
[IVI] 17.1 debugging pit FAQ (compilation)
What is "enterprise level" low code? Five abilities that must be possessed to become enterprise level low code
Mitsubishi PLC and Siemens PLC
Kunlunbase instruction manual (III) data import & synchronization
Data office system
mosquitto_sub -F 参数使用
[HFCTF 2021 Final]easyflask
What happens when MySQL tables change from compressed tables to ordinary tables
ADDS:使用 PowerShell 创建 OU 结构
使用R包PreMSIm根据基因表达量来预测微卫星不稳定
Attachment of text of chenjie Report
Efficient 7 habit learning notes
Is there any charge for PDF processing? impossible