当前位置:网站首页>Basic use of scrapy
Basic use of scrapy
2022-07-30 18:24:00 【Cold Lane(*_*)】
直接去cmd下载

下载好之后打开pycharm
Create a new project in the command line

这个demo01是项目的名字,This can be written casually. 然后回车

这样就是创建好了.可以cdto that item,但是这里直接在file里面打开就行了!
file-open-demo1

点ok然后点this windows,Just open in the current window,After closing and reopening it is still the same project,如果点new windowsIf so, a new window will open,After the restart it is the previous project.

这个是创建一个spider,
scrapy genspider baidu baidu.com这里的第一个baidu是名字,Then follow the domain name.创建完成后在spidersThere will be one more insidebaidu.py.然后我们进入这个py

name就是名字,项目名,allowed_domainsit's the domain name,start_urls是开始的地方,Just modify it from here where you want to start climbing,下面的parse()是解析的
# Scrapy settings for demo01 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'demo01' # 爬虫项目名
SPIDER_MODULES = ['demo01.spiders']
NEWSPIDER_MODULE = 'demo01.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozlla/5.0' # user_agent这个可以改,可以在这里设置,也可以在下面设置
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 是否遵循robots协议,一定要设置为false
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8 # 最大并发量,默认为16 这个要改
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1 # Download delay request,How often to send a request(Reduce the frequency of data crawling)
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False # 是否启用cookies,默认禁止,Uncomment is turned on
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# 请求头,类似于requests.get()方法中的headers参数
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'demo01.pipelines.Demo01Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
回到baidu.py

然后在pycharmInside the terminal run the project
The above is to check yoursscrapy的配置,version or something,然后
发现这个,Extracted is a dictionary.
在这里创建一个run.py出来

输入
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())然后右键运行

简单明了,颜色都不一样,You can directly see the content you want to crawl.
回到刚刚的问题,

然后运行一下
接着修改
边栏推荐
- 深化校企合作 搭建技术技能人才成长“立交桥”
- 猎豹移动终于递交年报:年营收7.85亿 腾讯持股16.6%
- NC | 西湖大学陶亮组-TMPRSS2“助攻”病毒感染并介导索氏梭菌出血毒素的宿主入侵...
- Kettle(二):连接SQL Server数据库
- ESP8266-Arduino编程实例-BMP180气压温度传感器驱动
- 运营 23 年,昔日“国内第一大电商网站”黄了...
- 基础架构之Mongo
- Application of time series database in the field of ship risk management
- 3D机器视觉厂商的场景争夺战役
- 【总结】1396- 60+个 VSCode 插件,打造好用的编辑器
猜你喜欢

LeetCode Exercise - Two Questions About Finding Sum of Array Elements

One year after graduation, I was engaged in software testing and won 11.5k. I didn't lose face to the post-98 generation...

这玩意儿都能优化?果然是细节都在魔鬼里。

DevEco Studio3.0下载失败,提示An unknown error occurred

自然语言处理nltk

微信小程序云开发 | 城市信息管理

Critical Reviews | 南农邹建文组综述全球农田土壤抗生素与耐药基因分布

Web结题报告

leetcode-1319:连通网络的操作次数

Test the.net text to Speech module System. Researched
随机推荐
mysql的多实例
固定资产可视化智能管理系统
终端分屏工具Terminalx的使用
Meta元宇宙部门第二季度亏损28亿!仍要继续押注?元宇宙发展尚未看到出路!
分布式消息队列平滑迁移技术实战
Network Basics (3) 01-Basic Concepts of Networks - Protocols, Host Addresses, Paths and Parameters of URL Addresses & 127.0.0.1 Local Loopback Address & View URL IP Address and Access Ping Space + URL
【HMS Core】【FAQ】运动健康、音频编辑、华为帐号服务 典型问题合集7
reporter undercover
ROS 节点初始化步骤、topic/service创建及使用
国轩高科瑞交所上市:募资近7亿美元 为瑞士今年最大融资项目
5 个开源的 Rust Web 开发框架,你选择哪个?
[Solved] The problem that Unity Hub fails to obtain a license or does not respond and cannot develop
CCNA-子网划分(VLSM)
Redis for infrastructure
积性函数
第14章 类型信息
微博广告分布式配置中心的构建与实践(有彩蛋)
Kettle(二):连接SQL Server数据库
LeetCode 练习——关于查找数组元素之和的两道题
《自然语言处理实战入门》---- 文本样本扩展小技巧:使用回译技术进行样本增强