A scalable frontier for web crawlers

Related tags

Web Crawlingfrontera
Overview

Frontera

pypi python versions Build Status codecov

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Comments
  • Redesign codecs

    Redesign codecs

    Issue discussed here https://github.com/scrapinghub/frontera/issues/211#issuecomment-251931413 Todo List

    • [X] Fix msgpack codec
    • [x] Fix json codec
    • [x] Integration test with Hbase backend(manually)

    This PR fixes #211

    Other things done in this besides the todo list:

    • Added two methods _convert and reconvert in json codec. These are needed as JSONEncoder accepts strings only as unicode. Method convert converts objects recursively to unicode and saves their type.
    • made the requirement of msgpack >=0.4 as only versions greater than 0.4 support the changes made in this PR.
    • fixed a buggy test case in test_message_bus_backend which got exposed after fixing the codecs.
    opened by voith 35
  • Distributed example (HBase, Kafka)

    Distributed example (HBase, Kafka)

    The documentation is a little simple and does not explain how to integrate with Kafka and Hbase for a fully distributed architecture. Could you, please provide an example in the examples folder of a well configured distributed frontera config?

    opened by casertap 33
  • PY3 Syntactic changes.

    PY3 Syntactic changes.

    Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

    opened by Preetwinder 32
  • Redirect loop when using distributed-frontera

    Redirect loop when using distributed-frontera

    I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

    2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    ...
    2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    

    This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

    opened by lljrsr 25
  • [WIP] Added Cassandra backend

    [WIP] Added Cassandra backend

    This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

    I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

    I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

    The PR includes unit tests and some integration tests with the backends integration testing framework.

    Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

    I am open to all sorts of suggestions :)

    opened by voith 17
  • cluster kafka db worker doesnt recognize partitions

    cluster kafka db worker doesnt recognize partitions

    Hi, Im trying to use cluster configuration. I've created topics in kafka and have it up and running. Im running into trouble starting the database worker. Tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0,1 got an error 0,1 not recognized, tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0 I was getting the same issue as in #359, but somehow that stopped happening.

    Now I'm getting: that kafka partitions are not recognized or iterrable, see error. Im using python 3.6 and the frontera from the repo (FYI qzm and cachetools still needed to be installed manually). Any ideas?

    File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 246, in args.no_scoring, partitions=args.partitions) File "/usr/lib/python3.6/dist-packages/frontera/worker/stats.py", line 22, in init super(StatsExportMixin, self).init(settings, *args, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 115, in init self.slot = Slot(self, settings, **slot_kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 46, in init self.components = self._load_components(worker, settings, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 55, in _load_components component = cls(worker, settings, stop_event=self.stop_event, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/components/scoring_consumer.py", line 24, in init self.scoring_log_consumer = scoring_log.consumer() File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 219, in consumer return Consumer(self._location, self._enable_ssl, self._cert_path, self._topic, self._group, partition_id=None) File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in init self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

    opened by danmsf 16
  • [WIP] Downloader slot usage optimization

    [WIP] Downloader slot usage optimization

    Imagine, we have a queue of 10K urls from many different domains. Our task is to fetch it as fast as possible. At the same time we have a prioritization which tends to group URLs from the same domain. During downloading we want to be polite and limit per host RPS. So, picking just top URLs from the queue leeds us to the time waste, because connection pool of Scrapy downloader most of time underused.

    In this PR, I'm addressing this issue by propagating information about overused hostnames/IPs in downloader pool.

    opened by sibiryakov 16
  • Fixed scheduler process_spider_output() to yield requests

    Fixed scheduler process_spider_output() to yield requests

    fixes #253 Here's a screenshot using the same code discussed here. screen shot 2017-02-12 at 3 13 48 pm

    Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

    I've the split this PR into three commits:

    • The first commit adds a test to reproduce the bug.
    • The second commit fixes the bug
    • The third commit fixes the broken test discussed above

    A note about the tests added:

    The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/spidermw.py#L34-L73: This is to understand how scrapy processes the different methods of the spider middleware.
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L135-L147: This is to understand how the scrapy core executes the spider middleware methods and passes the control to the spider callbacks.

    I have simulated the above discussed code in order to write the test.

    opened by voith 15
  • New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    While this is solved you can use this on your settings as a workaround:

    DELAY_ON_EMPTY=0.0
    

    The problem is in frontera.contrib.scrapy.schedulers.FrontieraScheduler, method _get_next_requests. If there are no pending requests and the test self._delay_next_call < time() fails, an empty list is returned which causes the crawl to terminate

    bug 
    opened by plafl 14
  • Fix SQL integer type for crc32 field

    Fix SQL integer type for crc32 field

    CRC32 is an unsigned 4-byte int, so it does not fit in a signed 4-byte int (Integer). There is no unsigned int type in the SQL standard, so I changed it to BigInteger instead. Without this change, both MySQL and Postgres complain that host_crc32 field value is out of bounds. Another option (to save space) would be to conver CRC32 into a signed 4-bit int, but this will complicate things, not sure it's worth it.

    opened by lopuhin 12
  • Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    This is a follow up to https://github.com/scrapinghub/frontera/pull/45.

    It enables the manager to receive the crawler settings and then instantiate the frontera settings accordingly. I added a few tests that should make the new behavior a little clearer.

    Is something along this lines acceptable? How can it be improved?

    opened by josericardo 12
  • how can I know it works when I use it with scrapy?

    how can I know it works when I use it with scrapy?

    I did everything as the document running-the-rawl, and start to run

    scrapy crawl my-spider
    

    I notice the item being crawled from the console, but I don't know whether Frontera works.

    What I did

    image

    sandwarm/frontera/settings.py

    
    BACKEND = 'frontera.contrib.backends.sqlalchemy.Distributed'
    
    SQLALCHEMYBACKEND_ENGINE="mysql://acme:[email protected]:3306/acme"
    SQLALCHEMYBACKEND_MODELS={
        'MetadataModel': 'frontera.contrib.backends.sqlalchemy.models.MetadataModel',
        'StateModel': 'frontera.contrib.backends.sqlalchemy.models.StateModel',
        'QueueModel': 'frontera.contrib.backends.sqlalchemy.models.QueueModel'
    }
    
    SPIDER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
    })
    
    DOWNLOADER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
    })
    
    SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
    
    

    settings.py

    FRONTERA_SETTINGS = 'sandwarm.frontera.settings'
    
    

    Since I enable mysql backend, I am supposed to see connection error, for I don't start mysql yet.

    Thanks for your guys hard working, but please make the document easier for humans. for example, a very basic working example, currently, we need to gather all documents to get the basic idea, even the worse, it still doesn't work at all. I alreay spent a week on a working example.

    opened by vidyli 1
  • Project Status?

    Project Status?

    It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

    opened by psdon 8
  • Message Decode Error

    Message Decode Error

    Getting following error when adding URL to Kafka for scrapy to parse

    2020-09-07 20:12:46 [messagebus-backend] WARNING: Could not decode message: b'http://quotes.toscrape.com/page/1/', error unpack(b) received extra data.
    
    opened by ab-bh 0
  • The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    https://github.com/scrapinghub/frontera/blob/master/frontera/core/manager.py I use 0.8.1 code base in LOCAL_MODE, The KeyError throw when running to to_fetch in StateContext class:

    from line 801:

    class StatesContext(object):
    	...
        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                fingerprint = request.meta[b'fingerprint'] # error occured here!!!
    

    I think the reason is the meta b'fingerprint' used before it's setting:

    from line 302:

    class LocalFrontierManager(BaseContext, StrategyComponentsPipelineMixin, BaseManager):
        def page_crawled(self, response):
    ...
            self.states_context.to_fetch(response)  # here used  b'fingerprint'
            self.states_context.fetch()
            self.states_context.states.set_states(response)
            super(LocalFrontierManager, self).page_crawled(response) # but only here init!
            self.states_context.states.update_cache(response)
    

    from line 233:

    class BaseManager(object):			
        def page_crawled(self, response):
    ...
            self._process_components(method_name='page_crawled',
                                     obj=response,
                                     return_classes=self.response_model) # b'fingerprint' will be set when pipeline go through here
    		
    

    My corrent work aroud is add the line to to_fetch method of StateContext class:

        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                if b'fingerprint' not in request.meta:                
                    request.meta[b'fingerprint'] = sha1(request.url)
                fingerprint = request.meta[b'fingerprint']
                self._fingerprints[fingerprint] = request
    

    What is the collect way to fix this?

    opened by yujiaao 0
  • KeyError [b'frontier'] on Request Creation from Spider

    KeyError [b'frontier'] on Request Creation from Spider

    Issue might be related to #337

    Hi,

    I have already read in discussions here, that the scheduling of requests should be done by frontera and apparently even the creation should be done by the frontier and not by the spider. However, in the documentation of scrapy and frontera it is written that requests shall be yielded in the spider parse function.

    How should the process look like, if requests are to be created by the crawling strategy and not yielded by the spider? How does the spider trigger that?

    In my use case, I am using scrapy-selenium with scrapy and frontera (I use SeleniumRequests to be able to wait for JS loaded elements).

    I have to generate the URLs I want to scrape in two phases: I am yielding them firstly in the start_requests() method of the spider instead of a seeds file and yield requests for extracted links in the first of two parse functions.

    Yielding SeleniumRequests from start_requests works, but yielding SeleniumRequests from the parse function afterwards results in the following error (only pasted an extract, as the iterable error prints the same errors over and over):

    return (_set_referer(r) for r in result or ())
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
        frontier_request = response.meta[b'frontier_request']
    KeyError: b'frontier_request'
    

    Very thankful for all hints and examples!

    opened by dkipping 3
Releases(v0.8.1)
  • v0.8.1(Apr 5, 2019)

  • v0.8.0.1(Jul 30, 2018)

  • v0.8.0(Jul 25, 2018)

    This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

    We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

    Here is a (somewhat) full change log:

    • PyPy (2.7.*) support,
    • Redis backend (kudos to @khellan),
    • LRU cache and two cache generations for HBaseStates,
    • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
    • Breadth-first and depth-first crawling strategies,
    • new mandatory component in backend: DomainMetadata,
    • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
    • create_request in crawling strategy is now using FronteraManager middlewares,
    • many batch gen instances,
    • support of latest kafka-python,
    • statistics are sent to message bus from all parts of Frontera,
    • overall reliability improvements,
    • settings for OverusedBuffer,
    • DBWorker was refactored and divided on components (kudos to @vshlapakov),
    • seeds addition can be done using s3 now,
    • Python 3.7 compatibility.
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Feb 9, 2017)

    Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

    Other improvements include:

    • batched states refresh in crawling strategy,
    • proper access to redirects in Scrapy converters,
    • more readable and simple OverusedBuffer implementation,
    • examples, tests and docs fixes.

    Thank you all, for your contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Nov 29, 2016)

    A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API. Other improvements:

    • SW consumes less CPU (because of rare state flushing),
    • requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
    • new article in the docs on cluster setup,
    • disable scoring log consumption option in DB worker,
    • fix of hbase drop table,
    • improved tests coverage.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 18, 2016)

    • Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to @Preetwinder.
    • canonicalize_url method removed in favor of w3lib implementation.
    • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
    • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
    • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
    • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
    • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
    • Strategy worker refactoring to simplify it’s customization from subclasses.
    • Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jul 22, 2016)

  • v0.5.2.3(Jul 18, 2016)

  • v0.5.2.2(Jun 29, 2016)

    • CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
    • Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
    • Finishing in SW is fixed when crawling strategy reports finished.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2.1(Jun 24, 2016)

    Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 21, 2016)

  • v0.5.1.1(Jun 2, 2016)

  • v0.5.0(Jun 1, 2016)

    Here is the change log:

    • latest SQLAlchemy unicode-related crashes are fixed,
    • corporate website friendly canonical solver has been added.
    • crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check), FrontierManager available on construction,
    • strategy worker code was refactored,
    • default state introduced for links generated during crawling strategy operation,
    • got rid of Frontera logging in favor of Python native logging,
    • logging system configuration by means of logging.config using file,
    • partitions to instances can be assigned from command line now,
    • improved test coverage from @Preetwinder.

    Enjoy!

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Apr 22, 2016)

    This release prevents installing kafka-python package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 18, 2016)

    • fixed API docs generation on RTD,
    • added body field in Request objects, to support POST-type requests,
    • guidance on how to set MAX_NEXT_REQUESTS and settings docs fixes,
    • fixed colored logging.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 30, 2015)

    A tremendous work was done:

    • distributed-frontera and frontera were merged together into the single project: to make it easier to use and understand,
    • Backend was completely redesigned. Now it's consisting of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies,
    • Added definition of run modes: single process, distributed spiders, distributed spider and backend.
    • Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
    • Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
    • Much less configuration footprint.

    Enjoy this new year release and let us know what you think!

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Sep 29, 2015)

    • tldextract is no longer minimum required dependency,
    • SQLAlchemy backend now persists headers, cookies, and method, also _create_page method added to ease customization,
    • Canonical solver code (needs documentation)
    • Other fixes and improvements
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jun 19, 2015)

    Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:

    1. Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
    2. settings defined in the Scrapy settings,
    3. default frontier settings.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(May 25, 2015)

    Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 15, 2015)

    • Frontera is the new name for Crawl Frontier.
    • Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
    • Overused buffer (subject to remove in the future in favor of downloader internal queue).
    • Backend internals became more customizable.
    • Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
    • Several Frontera middlewares are disabled by default.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jan 12, 2015)

    • Added documentation (Scrapy Seed Loaders+Tests+Examples)
    • Refactored backend tests
    • Added requests library example
    • Added requests library manager and object converters
    • Added FrontierManagerWrapper
    • Added frontier object converters
    • Fixed script examples for new changes
    • Optional Color logging (only if available)
    • Changed Scrapy frontier and recorder integration to scheduler+middlewares
    • Changed default frontier backend
    • Added comment support to seeds
    • Added doc requirements for RTD build
    • Removed optional dependencies for setup.py and requirements
    • Changed tests to pytest
    • Updated docstrings and documentation
    • Changed frontier componets (Backend and Middleware) to abc
    • Modified Scrapy frontier example to use seed loaders
    • Refactored Scrapy Seed loaders
    • Added new fields to Request and Response frontier objects
    • Added ScrapyFrontierManager (Scrapy wrapper for Frontier Manager)
    • Changed frontier core objects (Page/Link to Request/Response)
    Source code(tar.gz)
    Source code(zip)
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
原神爬虫 抓取原神界面圣遗物信息

原神圣遗物半自动爬虫 说明 直接抓取原神界面中的圣遗物数据 目前只适配了背包页面的抓取 准确率:97.5%(普通通用接口,对 40 件随机圣遗物识别,统计完全正确的数量为 39) 准确率:100%(4k 屏幕,普通通用接口,对 110 件圣遗物识别,统计完全正确的数量为 110) 不排除还有小错误的

hwa 28 Oct 10, 2022
A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

0 Jun 17, 2021
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Scrapes the Sun Life of Canada Philippines web site for historical prices of their investment funds and then saves them as CSV files.

slocpi-scraper Sun Life of Canada Philippines Inc Investment Funds Scraper Install dependencies pip install -r requirements.txt Usage General format:

Daryl Yu 2 Jan 07, 2022
Meme-videos - Scrapes memes and turn them into a video compilations

Meme Videos Scrapes memes from reddit using praw and request and then converts t

Partho 12 Oct 28, 2022
Web scraper build using python.

Web Scraper This project is made in pyhthon. It took some info. from website list then add them into data.json file. The dependencies used are: reques

Shashwat Harsh 2 Jul 22, 2022
Scrapy uses Request and Response objects for crawling web sites.

Requests and Responses¶ Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and p

Md Rashidul Islam 1 Nov 03, 2021
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages.

Video Games Web Scraper Video Games Web Scraper is a project that crawls websites and APIs and extracts video game related data from their pages. This

Albert Marrero 1 Jan 12, 2022
Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

1 Feb 11, 2022
Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data.

Web scraped S&P 500 Data from Wikipedia using Pandas and performed Exploratory Data Analysis on the data. Then used Yahoo Finance to get the related stock data and displayed them in the form of chart

Samrat Mitra 3 Sep 09, 2022
Pythonic Crawling / Scraping Framework based on Non Blocking I/O operations.

Pythonic Crawling / Scraping Framework Built on Eventlet Features High Speed WebCrawler built on Eventlet. Supports relational databases engines like

Juan Manuel Garcia 173 Dec 05, 2022
🐞 Douban Movie / Douban Book Scarpy

Python3-based Douban Movie/Douban Book Scarpy crawler for cover downloading + data crawling + review entry.

Xingbo Jia 1 Dec 03, 2022
爱奇艺会员,腾讯视频,哔哩哔哩,百度,各类签到

My-Actions 个人收集并适配Github Actions的各类签到大杂烩 不要fork了 ⭐️ star就行 使用方式 新建仓库并同步代码 点击Settings - Secrets - 点击绿色按钮 (如无绿色按钮说明已激活。直接到下一步。) 新增 new secret 并设置 Secr

280 Dec 30, 2022
Works very well and you can ask for the type of image you want the scrapper to collect.

Works very well and you can ask for the type of image you want the scrapper to collect. Also follows a specific urls path depending on keyword selection.

Memo Sim 1 Feb 17, 2022
A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

DatNgo 32 Dec 31, 2022
A module for CME that spiders hashes across the domain with a given hash.

hash_spider A module for CME that spiders hashes across the domain with a given hash. Installation Simply copy hash_spider.py to your CME module folde

37 Sep 08, 2022
A repository with scraping code and soccer dataset from understat.com.

UNDERSTAT - SHOTS DATASET As many people interested in soccer analytics know, Understat is an amazing source of information. They provide Expected Goa

douglasbc 48 Jan 03, 2023
Tool to scan for secret files on HTTP servers

snallygaster Finds file leaks and other security problems on HTTP servers. what? snallygaster is a tool that looks for files accessible on web servers

Hanno Böck 2k Dec 28, 2022
Snowflake database loading utility with Scrapy integration

Snowflake Stage Exporter Snowflake database loading utility with Scrapy integration. Meant for streaming ingestion of JSON serializable objects into S

Oleg T. 0 Dec 06, 2021