当前位置:网站首页>scrapy 定时执行

scrapy 定时执行

2022-07-28 05:25:00 幻影七幻

1.使用schedule定时执行

# -*- coding: utf-8 -*-
import subprocess
import schedule
import time
import datetime
from multiprocessing import Process
from scrapy import cmdline
import logging
def crawl_work():
    # subprocess.Popen('scrapy crawl it')
    print('-'*100)
    # args = ["scrapy", "crawl", 'it']
    # while True:
    #     start = time.time()
    #     p = Process(target=cmdline.execute, args=(args,))
    #     p.start()
    #     p.join()
    #     logging.debug("### use time: %s" % (time.time() - start))
if __name__=='__main__':
    print('*'*10+'开始执行定时爬虫'+'*'*10)
    schedule.every(1).minutes.do(crawl_work)
    print('当前时间为{}'.format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
    print('*' * 10 + '定时爬虫开始运行' + '*' * 10)
    while True:
        schedule.run_pending()
        time.sleep(10)
 

2.比较傻的办法。循环睡眠

# -*- coding: utf-8 -*-
from multiprocessing import Process
from scrapy import cmdline
import time
import logging
 
# 配置参数即可, 爬虫名称,运行频率
confs = [
    {
        "spider_name": "it",
        "frequency": 2,
    },
]
 
 
def start_spider(spider_name, frequency):
    args = ["scrapy", "crawl", spider_name]
    while True:
        start = time.time()
        p = Process(target=cmdline.execute, args=(args,))
        p.start()
        p.join()
        logging.debug("### use time: %s" % (time.time() - start))
        time.sleep(frequency)
 
 
if __name__ == '__main__':
    for conf in confs:
        process = Process(target=start_spider,args=(conf["spider_name"], conf["frequency"]))
        process.start()
        time.sleep(86400)

3.ubuntu情况下或者win采用系统本身的定时

编写cron.sh脚本

#! /bin/sh                                                                                                                                            

export PATH=$PATH:/usr/local/bin

cd /home/zhangchao/CVS/testCron

nohup scrapy crawl example >> example.log 2>&1 &

scrapy设置执行总时间

关闭定时任务:

scrapy的setting中添加一个配置项

CLOSESPIDER_TIMEOUT = 82800 # 23小时后结束爬虫

解释一下

CLOSESPIDER_TIMEOUT

默认值:0

一个整数值,单位为秒。如果一个spider在指定的秒数后仍在运行, 它将以 closespider_timeout 的原因被自动关闭。 如果值设置为0(或者没有设置),spiders不会因为超时而关闭。

原网站

版权声明
本文为[幻影七幻]所创,转载请带上原文链接,感谢
https://blog.csdn.net/qq_41048831/article/details/125747792