当前位置:网站首页>Regular execution of scratch

Regular execution of scratch

2022-07-28 06:50:00 Phantom seven illusions

1. Use schedule Timing execution

# -*- coding: utf-8 -*-
import subprocess
import schedule
import time
import datetime
from multiprocessing import Process
from scrapy import cmdline
import logging
def crawl_work():
    # subprocess.Popen('scrapy crawl it')
    print('-'*100)
    # args = ["scrapy", "crawl", 'it']
    # while True:
    #     start = time.time()
    #     p = Process(target=cmdline.execute, args=(args,))
    #     p.start()
    #     p.join()
    #     logging.debug("### use time: %s" % (time.time() - start))
if __name__=='__main__':
    print('*'*10+' Start scheduled crawler execution '+'*'*10)
    schedule.every(1).minutes.do(crawl_work)
    print(' The current time is {}'.format(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')))
    print('*' * 10 + ' Timed crawler starts running ' + '*' * 10)
    while True:
        schedule.run_pending()
        time.sleep(10)
 

2. A silly way . Circular sleep

# -*- coding: utf-8 -*-
from multiprocessing import Process
from scrapy import cmdline
import time
import logging
 
#  Configure parameters ,  Reptile name , Operating frequency 
confs = [
    {
        "spider_name": "it",
        "frequency": 2,
    },
]
 
 
def start_spider(spider_name, frequency):
    args = ["scrapy", "crawl", spider_name]
    while True:
        start = time.time()
        p = Process(target=cmdline.execute, args=(args,))
        p.start()
        p.join()
        logging.debug("### use time: %s" % (time.time() - start))
        time.sleep(frequency)
 
 
if __name__ == '__main__':
    for conf in confs:
        process = Process(target=start_spider,args=(conf["spider_name"], conf["frequency"]))
        process.start()
        time.sleep(86400)

3.ubuntu In some cases or win Adopt the timing of the system itself

To write cron.sh Script

#! /bin/sh                                                                                                                                            

export PATH=$PATH:/usr/local/bin

cd /home/zhangchao/CVS/testCron

nohup scrapy crawl example >> example.log 2>&1 &

scrapy Set the total execution time

 Turn off scheduled tasks :

scrapy Of setting Add a configuration item to 

CLOSESPIDER_TIMEOUT = 82800 # 23 End the crawler in hours 

 Explain it. 

CLOSESPIDER_TIMEOUT

 The default value is :0

 An integer value , The unit is in seconds . If one spider Still running after the specified number of seconds ,  It will take  closespider_timeout  The reason for is automatically turned off .  If the value is set to 0( Or no Settings ),spiders Will not close due to timeout .

原网站

版权声明
本文为[Phantom seven illusions]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/209/202207280519305380.html