当前位置:网站首页>【Django中运行scrapy框架,并将数据存入数据库】
【Django中运行scrapy框架,并将数据存入数据库】
2022-06-24 06:44:00 【浪荡子爱自由】
【解决的问题】
1、django和scrapy如何结合
2、通过django启动scrapy爬虫
此文仅介绍Django和scrapy的简单实现,适合想要快速上手的朋友。
任务一、单独使用django框架创建web项目
Django项目可以用命令创建,也可以用pycharm手动创建。此文用pycharm手动创建。
1、使用pycharm创建Django项目:菜单栏File-->New project-->Django-->填写项目名称pro,app名称为app01-->create.

此时项目的目录结构为如下,app01为项目pro的子应用。

2、创建首页,首页内容为"这是一个测试页面"。



3、运行程序,在浏览器中输入http://127.0.0.1:8000/index/。出现【这是一个测试页面!】则成功。
任务二、单独使用scrapy实现简单爬虫
一、准备工作
scrapy安装:确保已经安装了scrapy,如未安装,则打开cmd,输入pip install scrapy.

常用命令:
scrapy startproject 项目名 # 创建scrapy项目
scrapy genspider 爬虫名 域名
scrapy crawl 爬虫名任务描述:
1、爬取凤凰网,网址是:http://app.finance.ifeng.com/list/stock.php?t=ha,爬取此网页中沪市A股的代码、名称、最新价等信息。
2、将文件保存在H:\2022年学习资料中。
3、项目名为ifengNews
二、实现步骤
1、使用cmd进入要创建项目的目录【2022学习资料】,使用命令【scrapy startproject ifengNews】创建scrapy项目。

效果:出现【蓝框】内容,【ifengNews】项目已经创建成功。

2、使用pycharm打开文件夹【ifengNews】。文件目录如下:

items.py:定义爬虫程序的数据模型
middlewares.py:定义数据模型中的中间件
pipelines.py:管道文件,负责对爬虫返回数据的处理
settings.py:爬虫程序设置,主要是一些优先级设置(将ROBOTSTXT_OBEY=True 改为 False,这行代码表示是否遵循爬虫协议,如果是Ture的可能有些内容无法爬取)
scrapy.cfg:内容为scrapy的基础配置
spiders目录:放置spider代码的目录
3、在pycharm终端中输入【scrapy genspider ifeng_spider ifeng.com】 其中:ifeng_spider 是文件名,可以自定义,但是不能与项目名一样;ifeng.com为域名。

效果:spiders文件夹下创建一个ifeng_spider.py文件,爬虫代码都写在此文件的def parse中。

3.1此步骤也可在cmd中完成。

4、修改setting
第一个是不遵循机器人协议
# Obey robots.txt rules
ROBOTSTXT_OBEY = False # 是否遵循机器人协议,默认是true,需要改为false,否则很多东西爬不了第二个是请求头,添加一个User-Agent
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Cookie':'adb_isBlock=0; userid=1652710683278_ihrfq92084; prov=cn0731; city=0732; weather_city=hn_xt; region_ip=110.53.149.x; region_ver=1.2; wxIsclose=false; ifengRotator_iis3=6; ifengWindowCookieName_919=1'
# 默认是注释的,这个东西非常重要,如果不写很容易被判断为电脑,简单点洗一个Mozilla/5.0即可
}第三个是打开一个管道
# ITEM_PIPELINES:项目管道,300为优先级,越低越爬取的优先度越高
ITEM_PIPELINES = {
'ifengNews.pipelines.IfengnewsPipeline': 300,
# 'subeiNews.pipelines.SubeinewsMysqlPipeline': 200, # 存数据的管道
}5、页面爬取。首先在ifeng_spider.py中写自己的爬虫文件:
import scrapy
from ifengNews.items import IfengnewsItem
class IfengSpiderSpider(scrapy.Spider):
name = 'ifeng_spider'
allowed_domains = ['ifeng.com']
start_urls = ['http://app.finance.ifeng.com/list/stock.php?t=ha'] # 爬取地址
def parse(self, response):
# 爬取股票具体的信息
for con in response.xpath('//*[@class="tab01"]/table/tr'):
items = IfengnewsItem()
flag = con.xpath('./td[3]//text()').get() # 最新价
if flag:
items['title'] = response.xpath('//div[@class="block"]/h1/text()').get()
items['code'] = con.xpath('./td[1]//text()').get() # 代码
items['name'] = con.xpath('./td[2]//text()').get() # 名称
items['latest_price'] = con.xpath('./td[3]//text()').get() # 最新价
items['quote_change'] = con.xpath('./td[4]//text()').get() # 涨跌幅
items['quote_num'] = con.xpath('./td[5]//text()').get() # 涨跌额
items['volume'] = con.xpath('./td[6]//text()').get() # 成交量
items['turnover'] = con.xpath('./td[7]//text()').get() # 成交额
items['open_today'] = con.xpath('./td[8]//text()').get() # 今开盘
items['closed_yesterday'] = con.xpath('./td[9]//text()').get() # 昨收盘
items['lowest_price'] = con.xpath('./td[10]//text()').get() # 最低价
items['highest_price'] = con.xpath('./td[11]//text()').get() # 最高价
print(items['title'], items['name'])
yield items
打开items.py,更改items.py用于存储数据:
import scrapy
class IfengnewsItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
code = scrapy.Field()
name = scrapy.Field()
latest_price = scrapy.Field()
quote_change = scrapy.Field()
quote_num = scrapy.Field()
volume = scrapy.Field()
turnover = scrapy.Field()
open_today = scrapy.Field()
closed_yesterday = scrapy.Field()
lowest_price = scrapy.Field()
highest_price = scrapy.Field()
如要实现二级页面、翻页等操作,需自行学习。
6、运行爬虫:在终端中输入【scrapy crawl ifeng_spider】.

6.1 也可以写一个run.py文件来运行程序,将数据存储在infos.csv中。

任务三、django+scrapy结合
任务描述:将Django和scrapy结合,实现通过Django控制scrapy的运行,并将爬取的数据存入数据库。
1、在Django项目的根目录中创建一个子应用warehouse,单独存放scrapy的数据库等信息。使用命令行创建app,在终端输入执行命令:python manage.py startapp warehouse.

此时的目录结构:

并在pro-settings.py中注册warehouse这个应用,如下图。

2、在Django项目中创建scrapy项目,并修改项目的setting.py,与任务二中的步骤4一致.


并调整目录结构,与下图一致:

3、在scrapy的setting.py中加入以下代码:
import os
import sys
import django
sys.path.append(os.path.dirname(os.path.abspath('.')))
os.environ['DJANGO_SETTINGS_MODULE'] = 'pro.settings' # 项目名.settings
django.setup()4、warehouse下的model.py中创建数据库,用来存储爬到的数据。并在终端执行命令python manage.py makemigrations和 python manage.py migrate,生成数据库表。
from django.db import models
class StockInfo(models.Model):
"""
股票信息
"""
title = models.TextField(verbose_name="股票类型" )
code = models.TextField(verbose_name="代码" )
name = models.TextField(verbose_name="名称" )
latest_price = models.TextField(verbose_name="最新价" )
quote_change = models.TextField(verbose_name="涨跌幅" )
quote_num = models.TextField(verbose_name="涨跌额" )
volume = models.TextField(verbose_name="成交量" )
turnover = models.TextField(verbose_name="成交额" )
open_today = models.TextField(verbose_name="今开盘" )
closed_yesterday = models.TextField(verbose_name="昨收盘" )
lowest_price = models.TextField(verbose_name="最低价" )
highest_price = models.TextField(verbose_name="最高价" )
5、修改pipelines.py 、 items.py 、 ifeng_spider.py。
ifeng_spider.py:
import scrapy
from ifengNews.items import IfengnewsItem
class IfengSpiderSpider(scrapy.Spider):
name = 'ifeng_spider'
allowed_domains = ['ifeng.com']
start_urls = ['http://app.finance.ifeng.com/list/stock.php?t=ha']
def parse(self, response):
# 爬取股票具体的信息
for con in response.xpath('//*[@class="tab01"]/table/tr'):
items = IfengnewsItem()
flag = con.xpath('./td[3]//text()').get() # 最新价
if flag:
items['title'] = response.xpath('//div[@class="block"]/h1/text()').get()
items['code'] = con.xpath('./td[1]//text()').get() # 代码
items['name'] = con.xpath('./td[2]//text()').get() # 名称
items['latest_price'] = con.xpath('./td[3]//text()').get() # 最新价
items['quote_change'] = con.xpath('./td[4]//text()').get() # 涨跌幅
items['quote_num'] = con.xpath('./td[5]//text()').get() # 涨跌额
items['volume'] = con.xpath('./td[6]//text()').get() # 成交量
items['turnover'] = con.xpath('./td[7]//text()').get() # 成交额
items['open_today'] = con.xpath('./td[8]//text()').get() # 今开盘
items['closed_yesterday'] = con.xpath('./td[9]//text()').get() # 昨收盘
items['lowest_price'] = con.xpath('./td[10]//text()').get() # 最低价
items['highest_price'] = con.xpath('./td[11]//text()').get() # 最高价
print(items['title'], items['name'])
yield itemspipelines.py:
class IfengnewsPipeline(object):
def process_item(self, item, spider):
print('打开了数据库')
item.save()
print('关闭了数据库')
return itemitems.py中, 导入DjangoItem,与数据库进行连接。
from warehouse.models import StockInfo
from scrapy_djangoitem import DjangoItem
class IfengnewsItem(DjangoItem):
django_model = StockInfo如下图安装scrapy_djangoitem:
5、
6、修改url.py 、views.py


7、 运行程序,在浏览器中输入http://127.0.0.1:8000/stocks/update/.页面返回ok,则可以在数据库中查看到爬取的数据。
demo下载:
边栏推荐
- Tidb operator source code reading (IV) control cycle of components
- bjdctf_2020_babystack
- [机缘参悟-29]:鬼谷子-内揵篇-与上司交往的五种层次
- Event related | reveal how Ti-One's support ability for large-scale events is developed
- MSSQL high permission injection write horse to Chinese path
- Serviceworker working mechanism and life cycle: resource caching and collaborative communication processing
- [WordPress website] 5 Set code highlight
- More than 60 million shovel excrement officials, can they hold a spring of domestic staple food?
- [image feature extraction] image feature extraction based on pulse coupled neural network (PCNN) including Matlab source code
- 2.1.1 QML grammar foundation I
猜你喜欢

What is the mentality of spot gold worth learning from

Win11 points how to divide disks? How to divide disks in win11 system?
![[WordPress website] 5 Set code highlight](/img/01/f669b70f236c334b98527a9320400c.png)
[WordPress website] 5 Set code highlight
![[WUSTCTF2020]爬](/img/b6/4a0582144c3125e7a0666bbbbfe29d.png)
[WUSTCTF2020]爬

Alibaba cloud full link data governance

How can genetic testing help patients fight disease?

Combine with (& &) logic or (||), dynamic binding and ternary operation

相機標定(標定目的、原理)
![[image segmentation] retinal vessel segmentation based on morphology with matlab code](/img/e3/0805df81a597346ea7c2d2da20ac96.png)
[image segmentation] retinal vessel segmentation based on morphology with matlab code

【Vulhub靶场】】zabbix-SQL注入(CVE-2016-10134)漏洞复现
随机推荐
Face pincher: a hot meta universe stylist
The fund management of London gold is more important than others
Serviceworker working mechanism and life cycle: resource caching and collaborative communication processing
Spark stage and shuffle for daily data processing
[image fusion] image fusion based on directional discrete cosine transform and principal component analysis with matlab code
When MFC uses the console, the project path cannot have spaces or Chinese, otherwise an error will be reported. Lnk1342 fails to save the backup copy of the binary file to be edited, etc
A case of bouncing around the system firewall
What is automated testing? What software projects are suitable for automated testing?
buuctf misc [UTCTF2020]docx
Ultra wideband pulse positioning scheme, UWB precise positioning technology, wireless indoor positioning application
UE常用控制台命令
Alibaba cloud full link data governance
Analog display of the module taking software verifies the correctness of the module taking data, and reversely converts the bin file of the lattice array to display
选择器(>,~,+,[])
How can genetic testing help patients fight disease?
Counter attack from outsourcing to big factories! Android has been developed for 5 years, and after a year of dormancy, it has tried to become an offer harvester. Tencent has a fixed salary of 20*15
[learn FPGA programming from scratch -41]: vision chapter - Moore's era and Moore's law and the arrival of the post Moore Era
2.1.1 QML grammar foundation I
Reconfiguration of nebula integration testing framework based on BDD theory (Part 2)
[DDCTF2018](╯°□°)╯︵ ┻━┻