当前位置:网站首页>Basic use of scrapy
Basic use of scrapy
2022-07-30 18:24:00 【Cold Lane(*_*)】
直接去cmd下载

下载好之后打开pycharm
Create a new project in the command line

这个demo01是项目的名字,This can be written casually. 然后回车

这样就是创建好了.可以cdto that item,但是这里直接在file里面打开就行了!
file-open-demo1

点ok然后点this windows,Just open in the current window,After closing and reopening it is still the same project,如果点new windowsIf so, a new window will open,After the restart it is the previous project.

这个是创建一个spider,
scrapy genspider baidu baidu.com这里的第一个baidu是名字,Then follow the domain name.创建完成后在spidersThere will be one more insidebaidu.py.然后我们进入这个py

name就是名字,项目名,allowed_domainsit's the domain name,start_urls是开始的地方,Just modify it from here where you want to start climbing,下面的parse()是解析的
# Scrapy settings for demo01 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'demo01' # 爬虫项目名
SPIDER_MODULES = ['demo01.spiders']
NEWSPIDER_MODULE = 'demo01.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozlla/5.0' # user_agent这个可以改,可以在这里设置,也可以在下面设置
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 是否遵循robots协议,一定要设置为false
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8 # 最大并发量,默认为16 这个要改
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1 # Download delay request,How often to send a request(Reduce the frequency of data crawling)
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False # 是否启用cookies,默认禁止,Uncomment is turned on
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# 请求头,类似于requests.get()方法中的headers参数
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'demo01.pipelines.Demo01Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
回到baidu.py

然后在pycharmInside the terminal run the project
The above is to check yoursscrapy的配置,version or something,然后
发现这个,Extracted is a dictionary.
在这里创建一个run.py出来

输入
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())然后右键运行

简单明了,颜色都不一样,You can directly see the content you want to crawl.
回到刚刚的问题,

然后运行一下
接着修改
边栏推荐
- Pytorch基础--tensorboard使用(一)
- 终端分屏工具Terminalx的使用
- 时序数据库在船舶风险管理领域的应用
- leetcode-684:冗余连接
- 《痞子衡嵌入式半月刊》 第 59 期
- Codeblocks + Widgets 创建窗口代码分析
- Pytorch foundation -- tensorboard use (1)
- Confluence OGNL注入漏洞复现(CVE-2022-26134)
- Test the.net text to Speech module System. Researched
- 网络基础(二)-Web服务器-简介——WampServer集成服务器软件之Apache+MySQL软件安装流程 & netstat -an之检测计算机的端口是否占用
猜你喜欢
随机推荐
开源盛宴ApacheCon Asia 2022即将开幕,精彩不容错过!
Leetcode数据库系列题解合集(持续更新)
6块钱1斤,日本公司为何来中国收烟头?
Recommended Books | Recommend 3 database books with rave reviews
【总结】1396- 60+个 VSCode 插件,打造好用的编辑器
Graphic LeetCode -- 11. Containers of most water (difficulty: medium)
线性筛求积性函数
运营 23 年,昔日“国内第一大电商网站”黄了...
LeetCode Exercise - Two Questions About Finding Sum of Array Elements
MySQL data types
载誉而归,重磅发布!润和软件亮相2022开放原子全球开源峰会
CCNA-网络汇总 超网(CIDR) 路由最长掩码匹配
AWS 控制台
5分钟搞懂MySQL - 行转列
Redis for infrastructure
【剑指 Offe】剑指 Offer 18. 删除链表的节点
怎么样的框架对于开发者是友好的?
Pytorch基础--tensorboard使用(一)
[Solved] The problem that Unity Hub fails to obtain a license or does not respond and cannot develop
银行适用:此文能够突破你的运维流程管理问题









