当前位置:网站首页>Basic use of scrapy
Basic use of scrapy
2022-07-30 18:24:00 【Cold Lane(*_*)】
直接去cmd下载

下载好之后打开pycharm
Create a new project in the command line

这个demo01是项目的名字,This can be written casually. 然后回车

这样就是创建好了.可以cdto that item,但是这里直接在file里面打开就行了!
file-open-demo1

点ok然后点this windows,Just open in the current window,After closing and reopening it is still the same project,如果点new windowsIf so, a new window will open,After the restart it is the previous project.

这个是创建一个spider,
scrapy genspider baidu baidu.com这里的第一个baidu是名字,Then follow the domain name.创建完成后在spidersThere will be one more insidebaidu.py.然后我们进入这个py

name就是名字,项目名,allowed_domainsit's the domain name,start_urls是开始的地方,Just modify it from here where you want to start climbing,下面的parse()是解析的
# Scrapy settings for demo01 project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'demo01' # 爬虫项目名
SPIDER_MODULES = ['demo01.spiders']
NEWSPIDER_MODULE = 'demo01.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozlla/5.0' # user_agent这个可以改,可以在这里设置,也可以在下面设置
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 是否遵循robots协议,一定要设置为false
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 8 # 最大并发量,默认为16 这个要改
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1 # Download delay request,How often to send a request(Reduce the frequency of data crawling)
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False # 是否启用cookies,默认禁止,Uncomment is turned on
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# 请求头,类似于requests.get()方法中的headers参数
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0'
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01SpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'demo01.middlewares.Demo01DownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'demo01.pipelines.Demo01Pipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
回到baidu.py

然后在pycharmInside the terminal run the project
The above is to check yoursscrapy的配置,version or something,然后
发现这个,Extracted is a dictionary.
在这里创建一个run.py出来

输入
from scrapy import cmdline
cmdline.execute('scrapy crawl baidu'.split())然后右键运行

简单明了,颜色都不一样,You can directly see the content you want to crawl.
回到刚刚的问题,

然后运行一下
接着修改
边栏推荐
- Hangzhou electric school game 2 1001 2022 Static Query on Tree (Tree + hash table difference chain subdivision
- ESP8266-Arduino编程实例-HC-SR04超声波传感器驱动
- Hello, my new name is "Bronze Lock/Tongsuo"
- 微博广告分布式配置中心的构建与实践(有彩蛋)
- 《痞子衡嵌入式半月刊》 第 59 期
- 攻防世界web-Cat
- 原生js系列
- Quickly build an e-commerce platform based on Amazon cloud technology serverless service - performance
- ROS 节点初始化步骤、topic/service创建及使用
- 你好好想想,你真的需要配置中心吗?
猜你喜欢

core sound driver详解

基础架构之Mongo

WeChat Mini Program Cloud Development | Urban Information Management

【HMS core】【FAQ】Account Kit、MDM能力、push Kit典型问题合集6

「Redis应用与深度实践笔记」,深得行业人的心,这还不来看看?

《痞子衡嵌入式半月刊》 第 59 期

LeetCode Exercise - Two Questions About Finding Sum of Array Elements

Meta元宇宙部门第二季度亏损28亿!仍要继续押注?元宇宙发展尚未看到出路!

(2022杭电多校四)1001-Link with Bracket Sequence II(区间动态规划)

This year..I sincerely recommend the professional engineer to upgrade to the book!
随机推荐
SQL行列转换
Meta元宇宙部门第二季度亏损28亿!仍要继续押注?元宇宙发展尚未看到出路!
银行适用:此文能够突破你的运维流程管理问题
leetcode-684:冗余连接
CCNA-网络汇总 超网(CIDR) 路由最长掩码匹配
自然语言处理nltk
ESP8266-Arduino programming example-BMP180 air pressure temperature sensor driver
积性函数
cocos creater 热更重启导致崩溃
「Redis应用与深度实践笔记」,深得行业人的心,这还不来看看?
MySQL数据类型
SQL存储过程详解
国轩高科瑞交所上市:募资近7亿美元 为瑞士今年最大融资项目
基于b/s架构搭建一个支持多路摄像头的实时处理系统 ---- 使用yolo v5 系列模型
Network Basics (3) 01-Basic Concepts of Networks - Protocols, Host Addresses, Paths and Parameters of URL Addresses & 127.0.0.1 Local Loopback Address & View URL IP Address and Access Ping Space + URL
时序数据库在船舶风险管理领域的应用
AWS 控制台
Fixed asset visualization intelligent management system
强啊,点赞业务缓存设计优化探索之路。
Presto 中 lookUp Join的实现