A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way :chestnut:

Overview

Squirrel Core

Share, load, and transform data in a collaborative, flexible, and efficient way

Python PyPI
Conda Documentation Status Downloads License DOI Generic badge Slack


What is Squirrel?

Squirrel is a Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.

  1. SPEED: Avoid data stall, i.e. the expensive GPU will not be idle while waiting for the data.

  2. COSTS: First, avoid GPU stalling, and second allow to shard & cluster your data and store & load it in bundles, decreasing the cost for your data bucket cloud storage.

  3. FLEXIBILITY: Work with a flexible standard data scheme which is adaptable to any setting, including multimodal data.

  4. COLLABORATION: Make it easier to share data & code between teams and projects in a self-service model.

Stream data from anywhere to your machine learning model as easy as:

it = (Catalog.from_plugins()["imagenet"].get_driver()
      .get_iter("train")
      .map(lambda r: (augment(r["image"]), r["label"]))
      .batched(100))

Check out our full getting started tutorial notebook. If you have any questions or would like to contribute, join our Slack community.

Installation

You can install squirrel-core by

pip install "squirrel-core[all]"

Documentation

Read our documentation at ReadTheDocs

Example Notebooks

Check out the Squirrel-datasets repository for open source and community-contributed tutorial and example notebooks of using Squirrel.

Contributing

Squirrel is open source and community contributions are welcome!

Check out the contribution guide to learn how to get involved.

The humans behind Squirrel

We are Merantix Momentum, a team of ~30 machine learning engineers, developing machine learning solutions for industry and research. Each project comes with its own challenges, data types and learnings, but one issue we always faced was scalable data loading, transforming and sharing. We were looking for a solution that would allow us to load the data in a fast and cost-efficient way, while keeping the flexibility to work with any possible dataset and integrate with any API. That's why we build Squirrel – and we hope you'll find it as useful as we do! By the way, we are hiring!

Citation

If you use Squirrel in your research, please cite it using:

@article{2022squirrelcore,
  title={Squirrel: A Python library that enables ML teams to share, load, and transform data in a collaborative, flexible, and efficient way.},
  author={Squirrel Developer Team},
  journal={GitHub. Note: https://github.com/merantix-momentum/squirrel-core},
  doi={10.5281/zenodo.6418280},
  year={2022}
}
Comments
  • Update Doc-String of MapDriver.get_iter

    Update Doc-String of MapDriver.get_iter

    • Better document the behavior of max_workers and link to official ThreadPoolExecutor documentation.
    • Update *_map doc-strings that use ThreadPoolExecutor and link to official ThreadPoolExecutor documentation.

    Fixes #60 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by kai-tub 14
  • Refactoring DataFrameDriver and related drivers

    Refactoring DataFrameDriver and related drivers

    Description

    Refactors the DataFrameDriver and all data frame-related drivers. In particular:

    • Fixes a current bug (I believe) that storage options are not passed down when reading using the pandas/dask interface. This affected the implementation of CsvDriver.
    • Refactors the DataFrameDriver base class to provide a common interface for all drivers that use some read functionality from pandas or dask. The base class now handles the storage options and read arguments precedence for all derived classes
    • Using the new abstraction, adds FeatherDriver, JsonDriver, ParquetDriver, and XlsDriver and refactors CsvDriver
    • This does break some datasets using the CsvDriver as the read_csv_kwargs are now renamed to a common read_kwargs. However, so far, only two research datasets used this property. See the corresponding PR in squirrel-datasets. So far, this is a bit of a rough sketch. I tested the existing CsvDriver-based datasets but otherwise, this requires a bit more cleanup I suppose.
    • Renames the previous use_dask option to engine across all data frame drivers.
    • Change the default DataFrame engine to Pandas.

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 8
  • zip multiple iterables as a source

    zip multiple iterables as a source

    Description

    The use case: we have a store with samples, and 1..n other stores that each contain only features. These stores must have the same keys and same number of samples per shard.

    IterableZipSource makes it possible to zip items from several iterables and use that as a source. For instance:

    it1 = MessagepackDriver(url1).get_iter()
    it2 = MessagepackDriver(url2).get_iter()
    
    it3 = IterableZipSource(iterables=[it1, it2]).collect()
    
    

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 5
  • Add csv driver option to specify csv read args

    Add csv driver option to specify csv read args

    Description

    Adds a read_csv_kwargs argument to CsvDriver initialization which is used in all read_csv calls in the class. Does not break backward compatibility, as get_df and get_iter still allow to specify kwargs for read_csv which will take precedence over the ones given at initialization.

    This makes the creation of new catalog entries based on the CsvDriver much easier as dataset-specific read options (such as seperator, dtypes, etc.) can be specified in the driver_kwargs.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by MaxSchambach 4
  • Quantify randomness of shuffle in squirrel

    Quantify randomness of shuffle in squirrel

    Description

    Introduce a function to measure the randomness of a shuffle operation in the squirrel pipeline by implementing a simple example driver, random sampling and comparing the distances of sampled trajectories.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by winfried-ripken 4
  • Explain automatic version iteration

    Explain automatic version iteration

    Description

    Adding explanation of the default version iteration behaviour of the catalog, which was not clearly stated before.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ x ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AdemFr 3
  • Add storage options kwargs to FPGen

    Add storage options kwargs to FPGen

    Description

    FilePathGenerator does not expose storage options, used by fsspec when instantiating a filesystem. This can prove to be problematic, when advanced options are needed for accessing the data, e.g. when needing requester_pays argument for accessing data within google bucket. This change adds such kwargs to the constructor of FilePathGenerator object, which are passed onto fsspec.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [x] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 3
  • Warn when creating driver that points to an empty directory

    Warn when creating driver that points to an empty directory

    Description

    A warning is shown when we iterate over a driver, that points to an empty or non-existent directory.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • Nvidia DALI external source integration

    Nvidia DALI external source integration

    Description

    Motivation Squirrel is a fast data loading framework, and Nvidia DALI is a fast, gpu-accelerated library for complex ML workloads such as run-time augmentations. The aim is to provide users with an intuitive interface to use Squirrel as a backend for Nvidia DALI.

    Context For more context, check out these internal benchmarks below. Running the Squirrel pipeline without any augmentations is approx. 33k samples / sec. If you use Squirrel as an external source and an affine image augmentation from DALI you can reach approx. 28k samples / sec. This suggests that DALI can make use full use of Squirrel's speed as the data loading speed is almost not slowed down by the run-time augmentations (33k vs 28k). DALI brings two things to the table: you can augment data in batches and not one-by-one as is necessary with the other frameworks, and you can do it on the GPU.

    Screen Shot 2022-08-03 at 15 36 30

    Code Design

    • DALI comes with a concept called "pipeline" (docs), that defines how data should be read and transformed by DALI.
    • We use the external_source data reader API in the DALI pipeline, which we can provide with a modified Squirrel Iterable, the squirrel.iterstream.DaliExternalSource.
    • As suggested by Nvidia DALI staff, I benchmarked loading the samples one-by-one and let DALI do the batching. It turned out that batching in Squirrel was much faster (18.2k sps vs 32.2 sps). This suggests that DALI profits from the async loading in Squirrel here.
    • As suggested by Nvidia DALI staff, I tried using parallel external source, which is multi-proc dataloading by DALI. As stated in their docs, DALI prefers single samples (un-batched) here to let DALI handle the multi-proc logic of parallel data fetching. The problem here is that DALI would like a Callable external source here, Iterables are not allowed for parallel fetching. While this is technically possible (e.g. fit your dataset in one shard and then access the items by their keys, i.e. shard names), indexability is not straightforward and not yet integrated in squirrel. Since DALI already nearly makes use of Squirrel's full performance, we don't see that DALI could speed things up here. But it's worth investigating once the feature is implemented in Squirrel.
    • There was no performance increase by returning cupy arrays on the GPU to the external_source reader. Numpy was slightly faster, so users are advised to return numpy arrays in their collation function.

    Usage Pattern

    • users will simply turn their iterable into an external source with the iterstream API.
    # define a dummy pipeline
    @pipeline_def 
    def pipeline(it: DaliExternalSource, device: str) -> Tuple[DataNode]:
        img, label = fn.external_source(source=it, num_outputs=2, device=device)
        enhanced = fn.brightness_contrast(img, contrast=2) # do other augmentations here
        return enhanced, label
        
    it = squirrel_iterator.to_dali_external_source(batch_size, my_collation_fn)
    pipe = pipeline(it, device, batch_size=BATCH_SIZE)
    pipe.build()
    
    loader = DALIGenericIterator([pipe], ["img", "label"])
    for item in loader: 
        # ... 
    

    Things to Discuss

    1. I tried turning the iterstream into a DALIGenericIterator directly and abstract the above code away, but in my mind that does not make a lot of sense, as DALI users are used to the above API and we are really just an external source. The user will need to define their custom pipeline anyway for their use-case, so I don't see a big benefit of abstracting the below code away into a squirrel functionality - possibly adding some assumptions here and there and thereby limiting the original functionality of DALI (wdyt @AlirezaSohofi ?).
    2. We would need to find out if the self.i and self.n parameters need to be set for the external source as indicated here. For now, it seems to work out of the box, but maybe for more complex use-cases these variables are needed for DALI to keep track of the loaded samples. Sidenote: Currently DaliExternalSource could also simply be replaced with squirrel_iterable.batched(bs, fn), but I assume that self.i and self.n are needed somehow (input from NVIDIA needed here), so it's useful to have DaliExternalSource where we can add more features.
    3. Please check out the test_to_dali_external_source_gpu_multi_epoch. After iterating over Squirrel's generator once the iterable is empty. Hence after each epoch we need to create a new DALIGenericIterator. Afaik this is also how e.g. Pytorch Lightning handles it. Let me know if that sounds ok, or if we need to loop over the data.
    4. Tests & Requirements: Note that I added pytests for the code, but did not update the requirements accordingly, because the CI currently doesn't run GPU tasks. Moreover, we won't ask users to install DALI for now (also, there are many different versions for different cuda drivers), so we assume people will prefer installing themselves. The DaliExternalSource doesn't depend on any DALI code, so the DALI install is technically not required.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 3
  • Store the processing steps in a stream

    Store the processing steps in a stream

    Description

    Store more information in Composables:

    • Which Squirrel version is used
    • Git info e.g. commit-hash, remote repository
    • Log processing steps when chaining Composables

    This aims to provide the user more information about the stream. When a Composable stores sensitive information e.g. url in FilePathGenerator, then this should not be logged.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [x] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [x] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 3
  • [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    [FEATURE] Make `get_iter` method documentation about `max_workers` more explicit

    Hey, I've stumbled across a potentially easy-to-misunderstand part of the MapDriver.get_iter documentation:

    https://github.com/merantix-momentum/squirrel-core/blob/8e2942313c7d7dd974b1ca2f2308895f660d3d26/squirrel/driver/driver.py#L68-L155

    The documentation of max_workers states that by default None will be used and also mentions that this will cause async_map to be called but I missed these parts of the documentation and was surprised to see that so many threads were allocated.

    I am/was not too familiar with the ThreadPoolExecutor interface and find it somewhat surprising that None equals numer_of_processors x 5 according to the ThreadPoolExecutor definition. Maybe it would be helpful to explicitly state that by default ThreadPoolExecutor will be used with so many threads? The documentation string reads a bit unintuitive as the starts out that max_worker defines how many items are fetched simultaneously with max_worker and then continues to state that otherwise map is used. From that perspective, max_workers=None doesn't sound like it should be using any threads at all. Without knowing the default values of ThreadPoolExecutor I would make it more explicit that to disable threading one has to set max_workers=0/1 and that by default many threads are used.

    I am happy to add a PR with my suggested doc-string update if you agree! :)

    enhancement 
    opened by kai-tub 3
  • Interaction Nvidia DALI and Squirrel

    Interaction Nvidia DALI and Squirrel

    Description

    Describes in detail how Squirrel and DALI can work together. Also includes benchmarks on how to best utilize DALI and how it compares to transforms in Torchvision.

    Attaching PDF rendered version of the Sphinx documentation here. Unfortunately, I couldn't get syntax highlighting to work.

    Apparent next steps are figuring out how Squirrel and DALI can work together in multi-processing. It is not obvious how we could implement this, and if this provides a performance boost. Using a DALI parallel external source would probably be the way to go, but DALI expects a callable here that fetches individual images given a specific image index. This can be implemented easily if we set shard-size=1, but our initial experiments showed that larger shard sizes are more desirable.

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [x] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [x] I have read the contributing guideline doc (external contributors only)
    • [x] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [x] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by axkoenig 0
  • Bugfix deserializer kwargs

    Bugfix deserializer kwargs

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [x] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by mg515 0
  • PoC to cache data

    PoC to cache data

    driver = MessagepackDriver(url=url, cache_url=another_url)
    

    Description

    Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

    Fixes # issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by AlirezaSohofi 0
  • Safety checks for store and driver using FilePathGenerator

    Safety checks for store and driver using FilePathGenerator

    Description

    For both store and driver we need to asses if a URL points to an empty directory or nested empty directories.

    • For drivers, warning the user when using empty directories alerts the user early on that the url might be invalid
    • For stores, we want to only overwrite an existing non-empty directory when it is explicitly allowed

    In both cases, checking if the directories/nested directories are empty are done through the FilePathGenerator

    Type of change

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [ ] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [ ] I have kept the PR small so that it can be easily reviewed
    • [ ] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [ ] All dependency changes have been reflected in the pip requirement files.
    opened by pzdkn 2
  • [DRAFT] Support for different SquirrelStore compression modes

    [DRAFT] Support for different SquirrelStore compression modes

    Description

    See #59

    Fixes #59 issue

    Type of change

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [X] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [X] Documentation update
    • [ ] Refactoring including code style reformatting
    • [ ] Other (please describe):

    Checklist:

    • [X] I have read the contributing guideline doc (external contributors only)
    • [ ] Lint and unit tests pass locally with my changes
    • [X] I have kept the PR small so that it can be easily reviewed
    • [X] I have made corresponding changes to the documentation
    • [ ] I have added tests that prove my fix is effective or that my feature works
    • [X] All dependency changes have been reflected in the pip requirement files.

    Draft State!

    This is a draft PR to make it easier to discuss the different pros and cons of various solutions. This is not in a final state.

    I tried to add some test and verify that they pass locally, but the tests spam a lot of ValueError: Bucket is requester pays. Setrequester_pays=Truewhen creating the GCSFileSystem. and it is hard to tell where these tests/errors are coming from. The contributing guideline provides no further information on how to run the tests.

    opened by kai-tub 9
  • [FEATURE] Allow configuring compression mode in MessagepackSerializer

    [FEATURE] Allow configuring compression mode in MessagepackSerializer

    Hey,

    Thank you for working on this library! I think it has a huge potential, especially for dataset creators to provide their dataset in an optimized deep-learning format that is well suited for distribution. The performance of the MessagepackSerializer is amazing and being able to distribute subsets of the dataset (shards) is something I never wanted but really want to utilize in the future!

    I have played around with some "MessagepackSerializer" configurations and according to some internal benchmarks, it would be helpful to allow the user to configure the compression algorithm.

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L28-L48

    Currently, the compression mode is "locked" to gzip. I assume the main reason is due to the wide usage of gzip and to keep the code 'simple' as it makes it easy in the deserializer to know that the gzip compression was used:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/serialization/msgpack.py#L58-L81

    Here I would like to note that given the extension, fsspec (default) could also infer the compression by inspecting the filename suffix. But I can see how this might cause problems if somebody would like to switch out fsspec with something else (although I would have no idea with what and why :D )

    Other spots within the codebase that are coupled to this compression assumption are the methods from the SquirrelStore:

    https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L12-L67

    Or to show the significant parts:

    • get: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L40-L41

    • set: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L59-L60

    • keys: https://github.com/merantix-momentum/squirrel-core/blob/0ff6be7368a258d02e7c39d63ed2993d8a5af322/squirrel/store/squirrel_store.py#L66-L67

    In my internal benchmarks, I was able to greatly speed up the data loading by simply using no compression at all (None). Although I am fully aware that the correct compression mode heavily depends on the specific hardware/use case. But even in a network limited domain, I can see reasons to then prefer xz instead due to its better compression ratio and relatively similar decompression speed to gzip.

    IMHO, I think it should be ok to not store any suffix at all for the squirrel store. If I/a user looks inside of the squirrel store URL I think it is not mandatory to show what compression algorithm was used. The user could/should use the designated driver/metadata that comes bundled with the dataset and let the driver handle the correct decompression.

    If you don't agree I still think the gz extension doesn't have to be 'hardcoded' into these functions. This is actually something that confused me when I was looking at the internals of the code base. So instead, we could use something like:

    comp = kwargs.get("compression", "gzip")
    comp_to_ext_dict[comp] # just to show the concept
    

    With these modifications, it should be possible to utilize different compression modes and make them easily configurable. I would be very happy to create a PR and contribute to this project!

    enhancement 
    opened by kai-tub 3
Releases(v0.18.0)
  • v0.18.0(Nov 10, 2022)

    What's Changed

    • zip_index method for Composable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/92
    • Quantify randomness of shuffle in squirrel by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/86
    • Change Catalog repr to sorted set by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/94
    • Installation instruction by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/96
    • Upgrade requirements by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/97
    • Reference Huggingface, Hub and Torchvision Drivers by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/99
    • Update requirements by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/101
    • Refactoring DataFrameDriver and related drivers by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/98

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.7...v0.18.0

    Source code(tar.gz)
    Source code(zip)
  • v0.17.7(Oct 7, 2022)

    What's Changed

    • Add hooks to check backwards compatibility with py3.6+ by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/87
    • Add pyupgrade, yaml formatting and update all hooks by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/88
    • Fix file driver storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/85
    • Peng add kwargs to map by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/90
    • Add hooks to csv driver by @winfried-ripken in https://github.com/merantix-momentum/squirrel-core/pull/91
    • Explain automatic version iteration by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/84
    • Add csv driver option to specify csv read args by @MaxSchambach in https://github.com/merantix-momentum/squirrel-core/pull/93

    New Contributors

    • @MaxSchambach made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/93

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.4...v0.17.7

    Source code(tar.gz)
    Source code(zip)
  • v0.17.4(Aug 31, 2022)

    What's Changed

    • Make this repo installable with all python versions by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/82
    • Fix storage options by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/83

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.17.2...v0.17.4

    Source code(tar.gz)
    Source code(zip)
  • v0.17.2(Aug 25, 2022)

    What's Changed

    • Make CatalogSource visible in the API by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/71
    • Minor tweaks in documentation by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/73
    • Introduce rst linting via precommit hook by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/74
    • Remove binary file in tests dir by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/75
    • Unifies folder-creation behaviour when instantiation SquirrelStore by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/72
    • Bugfix - Register Torch Composables by @axkoenig in https://github.com/merantix-momentum/squirrel-core/pull/78
    • Upgrade infra to py3.9 by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/79
    • Add storage options kwargs to FPGen by @mg515 in https://github.com/merantix-momentum/squirrel-core/pull/81

    New Contributors

    • @axkoenig made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/78
    • @mg515 made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/81

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.16.0...v0.17.2

    Source code(tar.gz)
    Source code(zip)
  • v0.16.0(Jul 26, 2022)

    What's Changed

    • introduce loop and fixed size iterable by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/47
    • Move cla assistant to workflows by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/62
    • *add tutorials, *ignore test in api-ref, *remove unused execption by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/63
    • First draft of advanced section for iterstreams by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/55
    • Update Doc-String of MapDriver.get_iter by @kai-tub in https://github.com/merantix-momentum/squirrel-core/pull/61
    • Composable.compose gets source as kwarg, which is equal to self by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/66
    • Peng add pytorch convenience functions to composable by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/69
    • partial function for keys method by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/70

    New Contributors

    • @kai-tub made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/61

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/0.14.2...v0.16.0

    Source code(tar.gz)
    Source code(zip)
  • 0.14.2(Jun 23, 2022)

    What's Changed

    • change squirrel test using a tmp public bucket by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/46
    • Update fs.open mode for catalog by @AdemFr in https://github.com/merantix-momentum/squirrel-core/pull/48
    • CatalogKey can be used to index catalog by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/49
    • accept callable as source for composable to make it completly lazy by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/44
    • add sphinxcontrib-mermaid by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/51
    • Architecture overview by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/54
    • *add advanced store *reorganize sections *add icon,favicon by @pzdkn in https://github.com/merantix-momentum/squirrel-core/pull/53
    • Create codeql-analysis.yml by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/52
    • Upgrade numpy & numba by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/57
    • Winnie bump pyjwt by @winfried-loetzsch in https://github.com/merantix-momentum/squirrel-core/pull/58

    New Contributors

    • @AdemFr made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/48
    • @pzdkn made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/53
    • @winfried-loetzsch made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/57

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.2...0.14.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.2(May 18, 2022)

    What's Changed

    • Fix SourceCombiner.get_iter() not interleaving correctly by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/45

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.13.1...v0.13.2

    Source code(tar.gz)
    Source code(zip)
  • v0.13.1(May 18, 2022)

    What's Changed

    • Add community files by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/38
    • Minor requirement changes by @AlpAribal in https://github.com/merantix-momentum/squirrel-core/pull/40
    • messagepack unpacker set use_list argument to False by default by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/39

    New Contributors

    • @AlpAribal made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/40

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.3...v0.13.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.3(Apr 11, 2022)

    What's Changed

    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/31
    • pin numpy and update PR template by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/34
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/33
    • update document links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/36
    • update version to 0.12.3 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/37

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.2...v0.12.3

    Source code(tar.gz)
    Source code(zip)
  • v0.12.2(Apr 6, 2022)

    What's Changed

    • update img to github raw file so public pypi can load it by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/26
    • Tiansu add readthedocs.yml by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/27
    • add dependencies for readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/28
    • fix readthedoc by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/29
    • update readthedocs links by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/30
    • Tiansu move leftover commits by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/32

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.1...v0.12.2

    Source code(tar.gz)
    Source code(zip)
  • v0.12.1(Apr 5, 2022)

    What's Changed

    • update docs link by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/12
    • add logo by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/13
    • remove old extra file by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/14
    • add back keyring until public release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/16
    • key_hook param of get_iter accepts SplitByRank and SplitByWorker, par… by @AlirezaSohofi in https://github.com/merantix-momentum/squirrel-core/pull/15
    • fix install instruction by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/18
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/19
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/20
    • Update README.md by @ThomasWollmann in https://github.com/merantix-momentum/squirrel-core/pull/21
    • Tiansu update black by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/22
    • add CLA bot by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/23
    • switch to publish in public pypi by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/24
    • update version to 0.12.1 by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/25

    New Contributors

    • @ThomasWollmann made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/13
    • @AlirezaSohofi made their first contribution in https://github.com/merantix-momentum/squirrel-core/pull/15

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/compare/v0.12.0...v0.12.1

    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Mar 12, 2022)

    What's Changed

    • add basic files to get infrastructure running by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/3
    • new semantic versioning format for dev release by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/4
    • tiansu copy squirrel codebase by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/5
    • Tiansu add docs by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/9
    • add pypi classifiers by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/10
    • change version norm by @TiansuYu in https://github.com/merantix-momentum/squirrel-core/pull/11

    Full Changelog: https://github.com/merantix-momentum/squirrel-core/commits/v0.12.0

    Source code(tar.gz)
    Source code(zip)
Machine Learning Privacy Meter: A tool to quantify the privacy risks of machine learning models with respect to inference attacks, notably membership inference attacks

ML Privacy Meter Machine learning is playing a central role in automated decision making in a wide range of organization and service providers. The da

Data Privacy and Trustworthy Machine Learning Research Lab 357 Jan 06, 2023
Bayesian Image Reconstruction using Deep Generative Models

Bayesian Image Reconstruction using Deep Generative Models R. Marinescu, D. Moyer, P. Golland For technical inquiries, please create a Github issue. F

Razvan Valentin Marinescu 51 Nov 23, 2022
Custom IMDB Dataset is extracted between 2020-2021 and custom distilBERT model is trained for movie success probability prediction

IMDB Success Predictor Project involves Web Scraping custom IMDB data between 2020 and 2021 of 10000 movies and shows sorted by number of votes ,fine

Gautam Diwan 1 Jan 18, 2022
Deep learning library featuring a higher-level API for TensorFlow.

TFLearn: Deep learning library featuring a higher-level API for TensorFlow. TFlearn is a modular and transparent deep learning library built on top of

TFLearn 9.6k Jan 02, 2023
Code for the paper "Multi-task problems are not multi-objective"

Multi-Task problems are not multi-objective This is the code for the paper "Multi-Task problems are not multi-objective" in which we show that the com

Michael Ruchte 5 Aug 19, 2022
PyTorch code to run synthetic experiments.

Code repository for Invariant Risk Minimization Source code for the paper: @article{InvariantRiskMinimization, title={Invariant Risk Minimization}

Facebook Research 345 Dec 12, 2022
PyTorch reimplementation of the Smooth ReLU activation function proposed in the paper "Real World Large Scale Recommendation Systems Reproducibility and Smooth Activations" [arXiv 2022].

Smooth ReLU in PyTorch Unofficial PyTorch reimplementation of the Smooth ReLU (SmeLU) activation function proposed in the paper Real World Large Scale

Christoph Reich 10 Jan 02, 2023
(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

(Preprint) Official PyTorch implementation of "How Do Vision Transformers Work?"

xxxnell 656 Dec 30, 2022
PyTorchMemTracer - Depict GPU memory footprint during DNN training of PyTorch

A Memory Tracer For PyTorch OOM is a nightmare for PyTorch users. However, most

Jiarui Fang 9 Nov 14, 2022
Bayesian optimisation library developped by Huawei Noah's Ark Library

Bayesian Optimisation Research This directory contains official implementations for Bayesian optimisation works developped by Huawei R&D, Noah's Ark L

HUAWEI Noah's Ark Lab 395 Dec 30, 2022
NeurIPS'21 Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows

NeurIPS'21 Tractable Density Estimation on Learned Manifolds with Conformal Embedding Flows This repo contains the code for the paper Tractable Densit

Layer6 Labs 4 Dec 12, 2022
Unofficial PyTorch implementation of the Adaptive Convolution architecture for image style transfer

AdaConv Unofficial PyTorch implementation of the Adaptive Convolution architecture for image style transfer from "Adaptive Convolutions for Structure-

65 Dec 22, 2022
Open source hardware and software platform to build a small scale self driving car.

Donkeycar is minimalist and modular self driving library for Python. It is developed for hobbyists and students with a focus on allowing fast experimentation and easy community contributions.

Autorope 2.4k Jan 04, 2023
Linescanning - Package for (pre)processing of anatomical and (linescanning) fMRI data

line scanning repository This repository contains all of the tools used during the acquisition and postprocessing of line scanning data at the Spinoza

Jurjen Heij 4 Sep 14, 2022
Collection of generative models in Pytorch version.

pytorch-generative-model-collections Original : [Tensorflow version] Pytorch implementation of various GANs. This repository was re-implemented with r

Hyeonwoo Kang 2.4k Dec 31, 2022
[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

Junhyeong Cho 18 Jul 19, 2022
Rank 1st in the public leaderboard of ScanRefer (2021-03-18)

InstanceRefer InstanceRefer: Cooperative Holistic Understanding for Visual Grounding on Point Clouds through Instance Multi-level Contextual Referring

63 Dec 07, 2022
LyaNet: A Lyapunov Framework for Training Neural ODEs

LyaNet: A Lyapunov Framework for Training Neural ODEs Provide the model type--config-name to train and test models configured as those shown in the pa

Ivan Dario Jimenez Rodriguez 21 Nov 21, 2022
End-To-End Crowdsourcing

End-To-End Crowdsourcing Comparison of traditional crowdsourcing approaches to a state-of-the-art end-to-end crowdsourcing approach LTNet on sentiment

Andreas Koch 1 Mar 06, 2022
[CVPR 2021] Unsupervised 3D Shape Completion through GAN Inversion

ShapeInversion Paper Junzhe Zhang, Xinyi Chen, Zhongang Cai, Liang Pan, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Bo Dai, Chen Change Loy "Unsupervised 3D

100 Dec 22, 2022