当前位置:网站首页>scrapy 定时执行
scrapy 定时执行
2022-07-28 05:25:00 【幻影七幻】
1.使用schedule定时执行
# -*- coding: utf-8 -*-
import subprocess
import schedule
import time
import datetime
from multiprocessing import Process
from scrapy import cmdline
import logging
def crawl_work():
# subprocess.Popen('scrapy crawl it')
print('-'*100)
# args = ["scrapy", "crawl", 'it']
# while True:
# start = time.time()
# p = Process(target=cmdline.execute, args=(args,))
# p.start()
# p.join()
# logging.debug("### use time: %s" % (time.time() - start))
if __name__=='__main__':
print('*'*10+'开始执行定时爬虫'+'*'*10)
schedule.every(1).minutes.do(crawl_work)
print('当前时间为{}'.format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
print('*' * 10 + '定时爬虫开始运行' + '*' * 10)
while True:
schedule.run_pending()
time.sleep(10)
2.比较傻的办法。循环睡眠
# -*- coding: utf-8 -*-
from multiprocessing import Process
from scrapy import cmdline
import time
import logging
# 配置参数即可, 爬虫名称,运行频率
confs = [
{
"spider_name": "it",
"frequency": 2,
},
]
def start_spider(spider_name, frequency):
args = ["scrapy", "crawl", spider_name]
while True:
start = time.time()
p = Process(target=cmdline.execute, args=(args,))
p.start()
p.join()
logging.debug("### use time: %s" % (time.time() - start))
time.sleep(frequency)
if __name__ == '__main__':
for conf in confs:
process = Process(target=start_spider,args=(conf["spider_name"], conf["frequency"]))
process.start()
time.sleep(86400)3.ubuntu情况下或者win采用系统本身的定时
编写cron.sh脚本
#! /bin/sh
export PATH=$PATH:/usr/local/bin
cd /home/zhangchao/CVS/testCron
nohup scrapy crawl example >> example.log 2>&1 &scrapy设置执行总时间
关闭定时任务:
scrapy的setting中添加一个配置项
CLOSESPIDER_TIMEOUT = 82800 # 23小时后结束爬虫
解释一下
CLOSESPIDER_TIMEOUT
默认值:0
一个整数值,单位为秒。如果一个spider在指定的秒数后仍在运行, 它将以 closespider_timeout 的原因被自动关闭。 如果值设置为0(或者没有设置),spiders不会因为超时而关闭。
边栏推荐
- I heard that you are also practicing when I interviewed several junior interns.
- Fluke dtx-sfm2 single mode module of a company in Hangzhou - repair case
- Machine learning note 5 - logistic regression
- Perl introductory learning (VIII) subroutine
- Common table expression CTE in Clickhouse
- npm yarn相关的操作
- Cautious speculation about fusion on Apple silicon
- 解决内存占用比应用进程占用高的问题
- JSP实现文件上传功能的同时还要向后台传递参数
- 【学习笔记】知识管理
猜你喜欢

EfficientNET_ V1

Overall understanding of PLC

ICC2(三)Clock Tree Synthesis

使用wampserver3.2.6时切换中文时造成启动失败

Pytorch learning note 4 - automatic calculation of gradient descent autograd

Design and analysis of contactor coil control circuit

VI and VIM commands

Ship detection in SAR image based on yolov5

qt设置加载界面的几种方法

Cautious speculation about fusion on Apple silicon
随机推荐
雷达成像 Matlab 仿真 4 —— 距离分辨率分析
Pytorch learning notes
How to test industrial Ethernet cables (using fluke dsx-8000)?
ClickHouse 中的公共表表达式CTE
error: redefinition of ‘xxx‘
转义字符笔记
一、ffmpeg录制音频为pcm文件
MySQL delete tables without deleting databases
Paper artifact vs code + latex + latex workshop
Esxi on ARM v1.2 (updated in November 2020)
MFC 使用控制台打印程序信息
VAN(DWConv+DWDilationConv+PWConv)
A NOVEL DEEP PARALLEL TIME-SERIES RELATION NETWORK FOR FAULT DIAGNOSIS
[yolov5] environment construction: win11 + mx450
【学习笔记】vim 编辑器
qt批量操作控件,并设置信号槽
ICC2使用report_placement检查floorplan
自定义组件--纯数据字段&组件的生命周期
雷达成像 Matlab 仿真 2 —— 脉冲压缩与加窗
PyTorch 学习笔记 4 —— 自动计算梯度下降 AUTOGRAD