当前位置:网站首页> 使用Scrapy框架爬取网页并保存到Mysql的实现
使用Scrapy框架爬取网页并保存到Mysql的实现
2022-07-07 13:11:00 【1024问】
大家好,这一期阿彬给大家分享Scrapy爬虫框架与本地Mysql的使用。今天阿彬爬取的网页是虎扑体育网。
(1)打开虎扑体育网,分析一下网页的数据,使用xpath定位元素。
(2)在第一部分析网页之后就开始创建一个scrapy爬虫工程,在终端执行以下命令:
“scrapy startproject huty(注:‘hpty’是爬虫项目名称)”,得到了下图所示的工程包:
(3)进入到“hpty/hpty/spiders”目录下创建一个爬虫文件叫‘“sww”,在终端执行以下命令: “scrapy genspider sww” (4)在前两步做好之后,对整个爬虫工程相关的爬虫文件进行编辑。 1、setting文件的编辑:
把君子协议原本是True改为False。
再把这行原本被注释掉的代码把它打开。
2、对item文件进行编辑,这个文件是用来定义数据类型,代码如下:
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.htmlimport scrapyclass HptyItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() 球员 = scrapy.Field() 球队 = scrapy.Field() 排名 = scrapy.Field() 场均得分 = scrapy.Field() 命中率 = scrapy.Field() 三分命中率 = scrapy.Field() 罚球命中率 = scrapy.Field()
3、对最重要的爬虫文件进行编辑(即“hpty”文件),代码如下:
import scrapyfrom ..items import HptyItemclass SwwSpider(scrapy.Spider): name = 'sww' allowed_domains = ['https://nba.hupu.com/stats/players'] start_urls = ['https://nba.hupu.com/stats/players'] def parse(self, response): whh = response.xpath('//tbody/tr[not(@class)]') for i in whh: 排名 = i.xpath( './td[1]/text()').extract()# 排名 球员 = i.xpath( './td[2]/a/text()').extract() # 球员 球队 = i.xpath( './td[3]/a/text()').extract() # 球队 场均得分 = i.xpath( './td[4]/text()').extract() # 得分 命中率 = i.xpath( './td[6]/text()').extract() # 命中率 三分命中率 = i.xpath( './td[8]/text()').extract() # 三分命中率 罚球命中率 = i.xpath( './td[10]/text()').extract() # 罚球命中率 data = HptyItem(球员=球员, 球队=球队, 排名=排名, 场均得分=场均得分, 命中率=命中率, 三分命中率=三分命中率, 罚球命中率=罚球命中率) yield data
4、对pipelines文件进行编辑,代码如下:
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html# useful for handling different item types with a single interfacefrom cursor import cursorfrom itemadapter import ItemAdapterimport pymysqlclass HptyPipeline: def process_item(self, item, spider): db = pymysql.connect(host="Localhost", user="root", passwd="root", db="sww", charset="utf8") cursor = db.cursor() 球员 = item["球员"][0] 球队 = item["球队"][0] 排名 = item["排名"][0] 场均得分 = item["场均得分"][0] 命中率 = item["命中率"] 三分命中率 = item["三分命中率"][0] 罚球命中率 = item["罚球命中率"][0] # 三分命中率 = item["三分命中率"][0].strip('%') # 罚球命中率 = item["罚球命中率"][0].strip('%') cursor.execute( 'INSERT INTO nba(球员,球队,排名,场均得分,命中率,三分命中率,罚球命中率) VALUES (%s,%s,%s,%s,%s,%s,%s)', (球员, 球队, 排名, 场均得分, 命中率, 三分命中率, 罚球命中率) ) # 对事务操作进行提交 db.commit() # 关闭游标 cursor.close() db.close() return item
(5)在scrapy框架设计好了之后,先到mysql创建一个名为“sww”的数据库,在该数据库下创建名为“nba”的数据表,代码如下: 1、创建数据库
create database sww;
2、创建数据表
create table nba (球员 char(20),球队 char(10),排名 char(10),场均得分 char(25),命中率 char(20),三分命中率 char(20),罚球命中率 char(20));
3、通过创建数据库和数据表可以看到该表的结构:
(6)在mysql创建数据表之后,再次回到终端,输入如下命令:“scrapy crawl sww”,得到的结果
到此这篇关于使用Scrapy框架爬取网页并保存到Mysql的实现的文章就介绍到这了,更多相关Scrapy爬取网页并保存内容请搜索软件开发网以前的文章或继续浏览下面的相关文章希望大家以后多多支持软件开发网!
边栏推荐
- 最安全的证券交易app都有哪些
- How to enable radius two factor / two factor (2fa) identity authentication for Anheng fortress machine
- PG basics -- Logical Structure Management (locking mechanism -- table lock)
- Yyds dry goods inventory # solve the real problem of famous enterprises: cross line
- [Data Mining] Visual Pattern Mining: Hog Feature + cosinus Similarity / K - means Clustering
- Lidar Knowledge Drop
- "Baidu Cup" CTF competition 2017 February, web:include
- 摘抄的只言片语
- [today in history] July 7: release of C; Chrome OS came out; "Legend of swordsman" issued
- @Introduction and three usages of controlleradvice
猜你喜欢
Ctfshow, information collection: web5
【OBS】RTMPSockBuf_ Fill, remote host closed connection.
2. 堆排序『较难理解的排序』
写一篇万字长文《CAS自旋锁》送杰伦的新专辑登顶热榜
With 8 modules and 40 thinking models, you can break the shackles of thinking and meet the thinking needs of different stages and scenes of your work. Collect it quickly and learn it slowly
[deep learning] image hyperspectral experiment: srcnn/fsrcnn
"Baidu Cup" CTF competition 2017 February, web:include
【深度学习】图像超分实验:SRCNN/FSRCNN
IDA pro逆向工具寻找socket server的IP和port
Niuke real problem programming - Day12
随机推荐
Stream learning notes
[deep learning] semantic segmentation experiment: UNET network /msrc2 dataset
Apache多个组件漏洞公开(CVE-2022-32533/CVE-2022-33980/CVE-2021-37839)
时空可变形卷积用于压缩视频质量增强(STDF)
[Data Mining] Visual Pattern Mining: Hog Feature + cosinus Similarity / K - means Clustering
2. 堆排序『较难理解的排序』
[follow Jiangke University STM32] stm32f103c8t6_ PWM controlled DC motor_ code
Why do we use UTF-8 encoding?
How does the database perform dynamic custom sorting?
JSON解析实例(Qt含源码)
With 8 modules and 40 thinking models, you can break the shackles of thinking and meet the thinking needs of different stages and scenes of your work. Collect it quickly and learn it slowly
Guangzhou Development Zone enables geographical indication products to help rural revitalization
写一篇万字长文《CAS自旋锁》送杰伦的新专辑登顶热榜
Lidar knowledge drops
PAT 甲级 1103 Integer Factorizatio
TypeScript 发布 4.8 beta 版本
Niuke real problem programming - day14
Bye, Dachang! I'm going to the factory today
FFmpeg----图片处理
Briefly describe the working principle of kept