A Powerful Spider(Web Crawler) System in Python.

Overview

pyspider Build Status Coverage Status

A Powerful Spider(Web Crawler) System in Python.

  • Write script in Python
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend
  • RabbitMQ, Redis and Kombu as message queue
  • Task priority, retry, periodical, recrawl by age, etc...
  • Distributed architecture, Crawl Javascript pages, Python 2.{6,7}, 3.{3,4,5,6} support, etc...

Tutorial: http://docs.pyspider.org/en/latest/tutorial/
Documentation: http://docs.pyspider.org/
Release notes: https://github.com/binux/pyspider/releases

Sample Code

from pyspider.libs.base_handler import *


class Handler(BaseHandler):
    crawl_config = {
    }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('http://scrapy.org/', callback=self.index_page)

    @config(age=10 * 24 * 60 * 60)
    def index_page(self, response):
        for each in response.doc('a[href^="http"]').items():
            self.crawl(each.attr.href, callback=self.detail_page)

    def detail_page(self, response):
        return {
            "url": response.url,
            "title": response.doc('title').text(),
        }

Installation

WARNING: WebUI is open to the public by default, it can be used to execute any command which may harm your system. Please use it in an internal network or enable need-auth for webui.

Quickstart: http://docs.pyspider.org/en/latest/Quickstart/

Contribute

TODO

v0.4.0

  • a visual scraping interface like portia

License

Licensed under the Apache License, Version 2.0

Comments
  • no web interface

    no web interface

    Hi, I try to use pyspider, but I can't see the web interface when I click the web button after crawling a page, so I can't use CSS selector on pyspider. when I click the html button, it can show the page source. but on the demo.pyspider.org. everything is ok do you know what's wrong?

    opened by wdfsinap 24
  • fail when running command pyspider

    fail when running command pyspider

    安装完成后,在command line输入pyspider,结果有:ImportError: dlopen(/usr/local/lib/python2.7/site-packages/pycurl.so, 2): Library not loaded: libssl.1.0.0.dylib Referenced from: /usr/local/lib/python2.7/site-packages/pycurl.so Reason: image not found

    大神这是怎么回事啊?

    另,我用的是mac os x,谢谢!

    opened by ghost 24
  • Batch job start

    Batch job start

    When adding a batch job, why do you want to wait until after the completion of the task to grab? For example, I added a URL 5W, I read the log is to wait until the completion of the 5W to start?

    opened by kaito-kidd 23
  • pyspider command disappear

    pyspider command disappear

    the command will show up only after I install pyspider of python2, can't work in python3

    $ python3 /usr/local/bin/pyspider
    [I 150114 16:17:57 result_worker:44] result_worker starting...
    Exception in thread Thread-1:
    Traceback (most recent call last):
      File "<frozen importlib._bootstrap>", line 2195, in _find_and_load_unlocked
    AttributeError: '_MovedItems' object has no attribute '__path__'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.4/dist-packages/pyspider/scheduler/scheduler.py", line 418, in xmlrpc_run
        from six.moves.xmlrpc_server import SimpleXMLRPCServer
    ImportError: No module named 'six.moves.xmlrpc_server'; 'six.moves' is not a package
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/usr/lib/python3.4/threading.py", line 920, in _bootstrap_inner
        self.run()
      File "/usr/lib/python3.4/threading.py", line 868, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/scheduler/scheduler.py", line 420, in xmlrpc_run
        from SimpleXMLRPCServer import SimpleXMLRPCServer
    ImportError: No module named 'SimpleXMLRPCServer'
    
    [I 150114 16:17:57 scheduler:388] loading projects
    [I 150114 16:17:57 processor:157] processor starting...
    [I 150114 16:17:57 tornado_fetcher:387] fetcher starting...
    Traceback (most recent call last):
      File "/usr/local/bin/pyspider", line 9, in <module>
        load_entry_point('pyspider==0.3.0', 'console_scripts', 'pyspider')()
      File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 532, in main
        cli()
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 610, in __call__
        return self.main(*args, **kwargs)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 590, in main
        rv = self.invoke(ctx)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 916, in invoke
        return Command.invoke(self, ctx)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 782, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
        return callback(*args, **kwargs)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 144, in cli
        ctx.invoke(all)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
        return callback(*args, **kwargs)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 409, in all
        ctx.invoke(webui, **webui_config)
      File "/usr/local/lib/python3.4/dist-packages/click/core.py", line 416, in invoke
        return callback(*args, **kwargs)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 277, in webui
        app = load_cls(None, None, webui_instance)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/run.py", line 47, in load_cls
        return utils.load_object(value)
      File "/usr/local/lib/python3.4/dist-packages/pyspider/libs/utils.py", line 348, in load_object
        module = __import__(module_name, globals(), locals(), [object_name])
      File "/usr/local/lib/python3.4/dist-packages/pyspider/webui/__init__.py", line 8, in <module>
        from . import app, index, debug, task, result, login
      File "/usr/local/lib/python3.4/dist-packages/pyspider/webui/app.py", line 79, in <module>
        template_folder=os.path.join(os.path.dirname(__file__), 'templates'))
      File "/usr/local/lib/python3.4/dist-packages/flask/app.py", line 319, in __init__
        template_folder=template_folder)
      File "/usr/local/lib/python3.4/dist-packages/flask/helpers.py", line 741, in __init__
        self.root_path = get_root_path(self.import_name)
      File "/usr/local/lib/python3.4/dist-packages/flask/helpers.py", line 631, in get_root_path
        loader = pkgutil.get_loader(import_name)
      File "/usr/lib/python3.4/pkgutil.py", line 467, in get_loader
        return find_loader(fullname)
      File "/usr/lib/python3.4/pkgutil.py", line 488, in find_loader
        return spec.loader
    AttributeError: 'NoneType' object has no attribute 'loader'
    
    
    
    opened by zhanglongqi 23
  • valid json config file

    valid json config file

    Could you please add sample config.json file for list of valid parameters? e.g

    {
      "webui": {
    "host": "127.0.0.1",
     "port": "5501"
     },
    ...
    }
    
    opened by mavencode01 20
  • use Fig to run docker container instead of use docker command line

    use Fig to run docker container instead of use docker command line

    首先,我改了改wiki,因为不加 :latest 会自动下载所有tags 然后,觉得我们这个项目可以用fig操作docker跑起来,效果更好

    首先,新建一个目录pyspider 下载这个命名为fig.yml http://p.esd.cc/paste/wp5ELQ2M (手机打的代码 没有测试过抱歉) 然后fig up即可!

    安装fig pip install -U fig

    enhancement 
    opened by imlonghao 20
  • how to use local file as project's script, how to use customized mysql result database.

    how to use local file as project's script, how to use customized mysql result database.

    代码中使用了本地的一个python文件,将抓取的结果写入mysql中,把此文件放在\database\mysql文件夹下运行代码时提示导入出错,找不到文件,小问题不知道怎么解决。看过issues中有说导入project当做模块的feature,但是不知道怎么用,请指导。 附带的一个问题是,如果要替换数据库,需要重写on_result(self,result)函数,有没有示例参考下。

    opened by ronaldhan 19
  • How to define a global variable

    How to define a global variable

    class Handler(BaseHandler): configuration = {'a' : 'b', 'c' : 'd'}

    @every(minutes=12 * 60)
    def on_start(self):
        self.configuration = {'a' : 'a', 'c' : 'c'}
    
    @config(age= 12 * 60 * 60)
    def index_page(self, response):
        print(self.configuration)
    

    I changed configuration in on_start. But in index_page, it still print {'a' : 'b', 'c' : 'd'}. How can I define a global variable?

    opened by liu840185317 18
  • logging gb decode error

    logging gb decode error

    Hi,我安装后,启动run.py,可以正常进入web页面,但是一旦报错task任务。就会出错:

    Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354 Traceback (most recent call last): File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 859, in emit msg = self.format(record) File "/home/zfz/spider/stack/python-2.7.8/lib/python2.7/logging/init.py", line 732, in format return fmt.format(record) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 121, in format formatted = formatted.rstrip() + "\n" + _unicode(record.exc_text) File "/home/zfz/spider/src/pyspider/pyspider/libs/log.py", line 27, in _unicode raise e UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 670-671: illegal multibyte sequence Logged from file scheduler.py, line 354

    此后所有web请求全是500错误。操作系统版本centos 6.5,python环境2.7.8。

    另外发现默认的是sqlite数据库,我想换用mysql,没找到在哪里配置?谢谢。

    bug 
    opened by zfz 16
  • how to store results to database like mongodb

    how to store results to database like mongodb

    hi, pyspider is very nice to manage many crawler projects. but i have problem of how to store results into database. I see your tutorial on doc.pyspider.org. here is some tutorial on how to do with results

    from pyspider.result import ResultWorker

    Class MyResultWorker(ResultWorker): def on_result(self, task, result): assert task['taskid'] assert task['project'] assert task['url'] assert result #save result to database can i use this way in the script to save crawler results into database automatically?

    opened by wdfsinap 15
  • Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why  doesn't  it work?

    Running pyspider to crawl https://phytozome.jgi.doe.gov/pz/portal.html, why doesn't it work?

    code as follow: from pyspider.libs.base_handler import *

    class Handler(BaseHandler): crawl_config = {'headers': { 'Content-Type':'application/x-www-form-urlencoded', 'Accept':'/', 'Accept-Encoding':'gzip, deflate', 'Accept-Language':'zh-CN,zh;q=0.8', 'Cache-Control':'max-age=0', 'Connection':'keep-alive', 'Content-Length':'295', 'X-Requested-With': 'XMLHttpRequest', 'Cookie':'__utmt=1; __utma=89664858.1557390068.1472454301.1472628080.1472628080.6; __utmb=89664858.3.10.1472628080; __utmc=89664858; __utmz=89664858.1472628080.5.5.utmcsr=sogou|utmccn=(organic)|utmcmd=organic|utmctr=phytozome', 'Host':'phytozome.jgi.doe.gov', 'Origin':'https://phytozome.jgi.doe.gov', 'Referer':'https://phytozome.jgi.doe.gov/pz/portal.html', 'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36', 'X-GWT-Module-Base':'https://phytozome.jgi.doe.gov/pz/phytoweb/', 'X-GWT-Permutation':'80DA602CF8FBCB99E9D79278AD2DA616', } }

    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl('https://phytozome.jgi.doe.gov/pz/portal.html#!results?search=0&crown=1&star=1&method=2296&searchText=AUX/IAA&offset=0',callback=self.detail_page,fetch_type='js')
    
    def index_page(self, response):
        for each in response.doc('*').items():
            self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')
    
    @config(priority=2)
    def detail_page(self, response):
        self.index_page(response)
        for each in response.doc('*').items():
            self.crawl(each.attr.href, callback=self.detail_page,fetch_type='js')
        return {
            "url": response.url,
            "content":response.doc("*").text()
        }
    

    只能抓取到css

    opened by GenomeW 14
  • fix(sec): upgrade lxml to 4.9.1

    fix(sec): upgrade lxml to 4.9.1

    What happened?

    There are 1 security vulnerabilities found in lxml 4.3.3

    What did I do?

    Upgrade lxml from 4.3.3 to 4.9.1 for vulnerability fix

    What did you expect to happen?

    Ideally, no insecure libs should be used.

    The specification of the pull request

    PR Specification from OSCS Signed-off-by:pen4[email protected]

    opened by pen4 0
  • fix(sec): upgrade tornado to 5.1

    fix(sec): upgrade tornado to 5.1

    What happened?

    There are 1 security vulnerabilities found in tornado 4.5.3

    What did I do?

    Upgrade tornado from 4.5.3 to 5.1 for vulnerability fix

    What did you expect to happen?

    Ideally, no insecure libs should be used.

    The specification of the pull request

    PR Specification from OSCS

    opened by chncaption 0
  • Add support to release Linux aarch64 wheels

    Add support to release Linux aarch64 wheels

    Problem

    On aarch64, ‘pip install pyspider’ is giving the below error -

    ERROR: Command errored out with exit status 1:
         command: /usr/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-w68_cof2/pycurl/setup.py'"'"'; __file__='"'"'/tmp/pip-install-w68_cof2/pycurl/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-w68_cof2/pycurl/pip-egg-info
             cwd: /tmp/pip-install-w68_cof2/pycurl/
        Complete output (22 lines):
        Traceback (most recent call last):
          File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 235, in configure_unix
            p = subprocess.Popen((self.curl_config(), '--version'),
          File "/usr/lib/python3.8/subprocess.py", line 858, in __init__
            self._execute_child(args, executable, preexec_fn, close_fds,
          File "/usr/lib/python3.8/subprocess.py", line 1704, in _execute_child
            raise child_exception_type(errno_num, err_msg, err_filename)
        FileNotFoundError: [Errno 2] No such file or directory: 'curl-config'
    
        During handling of the above exception, another exception occurred:
    
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 1017, in <module>
            ext = get_extension(sys.argv, split_extension_source=split_extension_source)
          File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 673, in get_extension
            ext_config = ExtensionConfiguration(argv)
          File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 99, in __init__
            self.configure()
          File "/tmp/pip-install-w68_cof2/pycurl/setup.py", line 240, in configure_unix
            raise ConfigurationError(msg)
        __main__.ConfigurationError: Could not run curl-config: [Errno 2] No such file or directory: 'curl-config'
        ----------------------------------------
    ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
    

    Resolution

    On aarch64, ‘pip install pyspider’ should download the wheels from PyPI.

    @binux and Team, Please let me know your interest in releasing aarch64 wheels. I can help in this.

    opened by odidev 0
  • got error when starting the webui

    got error when starting the webui

    [W 220404 15:02:18 run:413] phantomjs not found, continue running without it. [I 220404 15:02:20 result_worker:49] result_worker starting... [I 220404 15:02:20 processor:211] processor starting... [I 220404 15:02:20 tornado_fetcher:638] fetcher starting... [I 220404 15:02:20 scheduler:647] scheduler starting... [I 220404 15:02:20 scheduler:782] scheduler.xmlrpc listening on 127.0.0.1:23333 [I 220404 15:02:20 scheduler:586] in 5m: new:0,success:0,retry:0,failed:0 [I 220404 15:02:20 app:84] webui exiting... Traceback (most recent call last): File "/usr/local/Caskroom/miniconda/base/envs/web/bin/pyspider", line 8, in sys.exit(main()) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 754, in main cli() File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1128, in call return self.main(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1053, in main rv = self.invoke(ctx) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1637, in invoke super().invoke(ctx) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 1395, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 165, in cli ctx.invoke(all) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 497, in all ctx.invoke(webui, **webui_config) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/core.py", line 754, in invoke return __callback(*args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/run.py", line 384, in webui app.run(host=host, port=port) File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/webui/app.py", line 59, in run from .webdav import dav_app File "/usr/local/Caskroom/miniconda/base/envs/web/lib/python3.6/site-packages/pyspider/webui/webdav.py", line 207, in '/': ScriptProvider(app) TypeError: Can't instantiate abstract class ScriptProvider with abstract methods get_resource_inst

    opened by kroraina-threesteps 6
Releases(v0.3.10)
  • v0.3.9(Mar 18, 2017)

    New features:

    • Support for Python 3.6.
    • Auto Pause: the project will be paused for scheduler.PAUSE_TIME (default: 5min) when last scheduler.FAIL_PAUSE_NUM (default: 10) task failed, and dispatch scheduler.UNPAUSE_CHECK_NUM (default: 3) tasks after scheduler.PAUSE_TIME. Project will resume if any one of last scheduler.UNPAUSE_CHECK_NUM tasks success.
    • Each callback now have a default 30s process time limit. (Platform support required) @beader
    • New Javascript render engine - Splash support: Enabled by fetch argument --splash-endpoint=http://splash:8050/execute
    • Python3 webdav support.
    • Python3 from projects import project support.
    • A link to corresponding task is added to webui debug page when debugging a exists task in webui.
    • New user_agent parameter in self.crawl, you can set user-agent by headers though.

    Fix several bugs:

    • New webui dashboard frontend framework - vue.js, improved the performance when having large number of tasks (e.g. http://demo.pyspider.org/)
    • Fix crawl_config doesn't work in webui while debugging a script issue.
    • Fix CSS Selector Helper doesn't work issue. @ackalker
    • Fix connection_timeout not working issue.
    • FIx need_auth option not applied on webdav issue.
    • Fix "fix can't dump counter to file: scheduler.all" error.
    • Some other fixes
    Source code(tar.gz)
    Source code(zip)
  • v0.3.8(Aug 18, 2016)

    New features:

    Fix several bugs:

    • * Fixed a global config object thread interference issue, which may cause connect to scheduler rpc error: error(10061, '') error when all --run-in=thread (default in windows platform)
    • Fix response.save lost when fetch failed issue
    • Fix potential scheduler failure caused by old version of six
    • Fix result dump return nothing when using mongodb backend
    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Apr 20, 2016)

    • ThreadBaseScheduler added to improve the performance of scheduler
    • robots.txt supported!
    • elasticsearch database backend supported!
    • new script callback on_finished, http://docs.pyspider.org/en/latest/About-Projects/#on_finished-callback
    • you can now set the delay time between retries:

    retry_delay is a dict to specify retry intervals. The items in the dict are {retried: seconds}, and a special key: '' (empty string) is used to specify the default retry delay if not specified.

    • dict parameters in crawl_config, @config will be merged (e.g. headers), thanks to @ihipop
    • add parameter max_redirects in self.crawl to control maximum redirect numbers when doing the fetch, thanks to @AtaLuZiK
    • add parameter validate_cert in self.crawl to ignore the error of server’s certificate.
    • new property etree for Response, etree is a cached lxml.html.HtmlElement object, thanks to @waveyeung
    • you can now pass arguments to phantomjs from command line or config file.
    • support for pymongo 3.0
    • local.projectdb now accept a glob path (e.g. script/*.py) to load multiple projects from local filesystem.
    • queue size in the dashboard is not working for osx, thanks to @xyb
    • counters in dashboard will shown for stopped projects
    • other bug fix
    Source code(tar.gz)
    Source code(zip)
  • v0.3.6(Nov 10, 2015)

    • NEW: webdav mode, now you can use webdav to mount project folder to your local filesystem and edit scripts with your favority editor! (not support python 3, wsgidav required, which is not contained in setup.py)
    • bug fixes for Python 3 compatibility, Postgresql, flask-Login>=0.3.0, typo and more, thanks for the help of @lushl9301 @hitjackma @exoticknight @d0ugal @qiang.luo @twinmegami @jttoday @machinewu @littlezz @yaokaige
    • fix Queue.qsize NotImplementedError on Mac OS X, thanks @xyb
    Source code(tar.gz)
    Source code(zip)
  • v0.3.5(May 22, 2015)

    • New parameter: auto_recrawl - auto restart task every age.
    • New parameter: js_viewport_width/js_viewport_height to set viewport size for phantomjs engine.
    • New command line option to set different message queue backends with URI scheme.
    • New task level storage mechanism: self.save
    • New redis taskdb
    • New redis message queue.
    • New high level message queue interface kombu.
    • Fix bugs related to mongodb (keyword missing if not set).
    • Fix phantomjs not work in all mode.
    • Fix a potential deadlock in processor send_message.
    • Default log level of scheduler is changed to INFO
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Apr 21, 2015)

    Global

    • New message queue support: beanstalkd by @tiancheng91
    • New global argument: --logging-config to specify a customization logging config (to disable werkzeug logs for instance). You can get a sample config from pyspider/logging.conf).
    • Project group info is added to task package now.
    • Change docker base image to cmfatih/phantomjs, you can use phantomjs with same docker image now.
    • Auto restart phantomjs if crash, only enabled in all mode by default.

    WebUI

    • Show next exetime of a task in task page.
    • Show fetch time and process time in tasks page.
    • Show average fetch time and process time in 5min in dashboard page.
    • Show message queue status in dashboard page.
    • limit and offset parameter support in result dump.
    • Fix frontend bug when crawling pages with dataurl.

    Other

    • Fix support for phantomjs 2.0.
    • Fix scheduler project update inform not work, and use md5sum of script as another signal.
    • Scheduler: periodic counter report in log.
    • Fetcher: fix for legacy version of pycurl
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Mar 8, 2015)

    API

    • self.crawl will raise TypeError when get unexcepted arguments
    • self.crawl not accapt cURL command as first argument, see http://docs.pyspider.org/en/latest/apis/self.crawl/#curl-command.

    WEBUI

    • A new css selector tool bar is added, the pre-generated css selected pattern can be modified and added/copy to script.

    Benchmarking

    • The database table for bench test will be cleared before and after bench test.
    • insert/update/get bench test for database and put/get test for message queue is added.

    Other

    • The default message queue is switched to ampq.
    • docs fix.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Feb 11, 2015)

    Scheduler

    • The size of task queue is more accurate now, you can use it to determine all done status of scheduler.

    Fetcher

    • Fix tornado loss cookies while doing 30x redirects
    • You can use cookies with cookie header at same time now
    • Fix proxy not working bug.
    • Enable proxy by default.
    • Proxy now support username and password authorization. @soloradish
    • Etag and Last-Modified header will be disabled while last crawl is failed.

    Databases

    • MySQL default engine changed to InnoDB @laapsaap
    • MySQL, larger result column size, changed to MEDIUMBLOB(up to 16M) @laapsaap

    WebUI

    • WebUI will use same arguments as the fetcher, fix proxy not word for webui bug.
    • Results will be sorted in the order of updatetime.

    One Mode

    • Script exception logs would be printed to screen

    New Command send_message

    You can use the command pyspider send_message [project] [message] to send a message to project via command-line.

    Other

    • Using localhosted test web pages
    • Remove version specify of lxml, you can use apt-get to install any version of lxml
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jan 22, 2015)

    One Mode

    One mode not only means all-in-one, it runs every thing in one process over tornado.ioloop. One mode is designed for debug purpose. You can test scripts written in local files and using --interactive to choose a task to be tested.

    With one mode you can use pyspider.libs.utils.python_console() to open an interactive shell in your script context to test your code.

    full documentation: http://docs.pyspider.org/en/latest/Command-Line/#one

    • bug fix
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jan 11, 2015)

    • A lot of bug fixed.
    • Make pyspider as a single top-level package. (thanks to zbb, iamtew and fmueller from HN)
    • Python 3 support!
    • Use click to create a better command line interface.
    • Postgresql Supported via SQLAlchemy (with the power of SQLAlchemy, pyspider also support Oracle, SQL Server, etc).
    • Benchmark test.
    • Documentation & tutorial: http://docs.pyspider.org/
    • Flake8 cleanup (thanks to @jtwaleson)

    Base

    • Use messagepack instead of pickle in message queue.
    • JSON data will encoding as base64 string when content is binary.
    • Rabbitmq lazy limit for better performance.

    Scheduler

    • Never re-crawl a task with a negative age.

    Fetcher

    • proxy parameter support ip:port format.
    • increase default fetcher poolsize to 100.
    • PhantomJS will return JS script result in Response.js_script_result.

    Processor

    • Put multiple new tasks in one package. performance for rabbitmq.
    • Not store all of the headers when success.

    Script

    • Add an interface to generate taskid with task object. get_taskid
    • Task would be de-duplicated by project and taskid.

    Webui

    • Project list sortable.
    • Return 404 page when dump a not exists project.
    • Web preview support image
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Nov 12, 2014)

    Base

    • mysql, mongodb backend support, and you can use a database uri to setup them.
    • rabbitmq as Queue for distributed deployment
    • docker supported
    • support for Windows
    • support for python2.6
    • a resultdb, result_worker and WEBUI is added.

    Scheduler

    • cronjob task supported
    • delete project supported

    Fetcher

    • a phantomjs fetcher is added. now you can fetch pages with javascript/ajax technology!

    Processor

    • send_message api to send message to other projects
    • now you can import other project as module via from projects import xxxx
    • @config helper for setting configs for a callback

    WEBUI

    • a css selector helper is added to debugger.
    • a option to switch JS/CSS CDN.
    • a page of task history/config
    • a page of recent active tasks
    • pages of results
    • a demo mode is added for http://demo.pyspider.org/

    Others

    • bug fixes
    • more tests, coverage is used.
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Mar 9, 2014)

    finish a basic runnable system with:

    • sqlite3 task & project database
    • runnable scheduler & fetcher & processor
    • basic dashboard and debugger
    Source code(tar.gz)
    Source code(zip)
Owner
Roy Binux
[NEW] Add a bio
Roy Binux
12306抢票脚本

12306抢票脚本

罐子里的茶 457 Jan 05, 2023
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
A Python web scraper to scrape latest posts from official Coinbase's Blog.

Coinbase Blog Scraper A Python web scraper to scrape latest posts from official Coinbase's Blog. IDEA It scrapes up latest blog posts from https://blo

Lucas Villela 3 Feb 18, 2022
A package designed to scrape data from Yahoo Finance.

yahoostock A package designed to scrape data from Yahoo Finance. Installation The most simple installation method is through PIP. pip install yahoosto

Rohan Singh 2 May 28, 2022
基于Github Action的定时HITsz疫情上报脚本,开箱即用

HITsz Daily Report 基于 GitHub Actions 的「HITsz 疫情系统」访问入口 定时自动上报脚本,开箱即用。 感谢 @JellyBeanXiewh 提供原始脚本和 idea。 感谢 @bugstop 对脚本进行重构并新增 Easy Connect 校内代理访问。

Ter 56 Nov 27, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine .

TwitterScraper Script for scrape user data like "id,username,fullname,followers,tweets .. etc" by Twitter's search engine . Screenshot Data Users Only

Remax Alghamdi 19 Nov 17, 2022
A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working Links.

WaGpScraper A Python Oriented tool to Scrap WhatsApp Group Link using Google Dork it Scraps Whatsapp Group Links From Google Results And Gives Working

Muhammed Rizad 27 Dec 18, 2022
This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

Guilherme Silva Uchoa 3 Oct 04, 2022
Parse feeds in Python

feedparser - Parse Atom and RSS feeds in Python. Copyright 2010-2020 Kurt McKee Kurt McKee 1.5k Dec 30, 2022

This is a web crawler that works on employ email data by gmane.org and visualizes it in different ways.

crawler_to_visual_gmane Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library. This is a set of tools that al

Saim Zafar 1 Dec 20, 2021
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022
This is my CS 20 final assesment.

eeeeeSpider This is my CS 20 final assesment. How to use: Open program Run to your hearts content! There are no external dependancies that you will ha

1 Jan 17, 2022
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021
京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东秒杀,添加误差时间调整,优化了茅台抢购进程队列

776 Jul 28, 2021
CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform

CRI Scrape CRI Scrape is a tool for get general info about Italian Red Cross in GAIA Platform Disclaimer This code is only for educational purpose. So

Vincenzo Cardone 0 Jul 23, 2022
京东抢茅台,秒杀成功很多次讨论,天猫抢购,赚钱交流等。

Jd_Seckill 特别声明: 请添加个人微信:19972009719 进群交流讨论 目前群里很多人抢到【扫描微信添加群就好,满200关闭群,有喜欢薅信用卡羊毛的也可以找我交流】 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性

50 Jan 05, 2023
This program will help you to properly scrape all data from a specific website

This program will help you to properly scrape all data from a specific website

MD. MINHAZ 0 May 15, 2022
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022