An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Overview

EasyDatas

An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Installation

pip install git+https://github.com/SymenYang/EasyDatas

Usage

Find files in disk

from EasyDatas.Prefab import ListFile, RecursionFiles, SpanFiles
from EasyDatas.Prefab import Chain

# Type 1: Find files recursively
# Example:
RFiles = RecursionFiles({
    "path" : path_to_root,
    "pattern" : ".*\.npy",
    "files" : True, # default to be true
    "dirs" : False # default to be true
})
RFiles.resolve()
print(len(RFiles)) # Total num of npy files in path_to_root
print(RFiles[0]) # {"path" : "/xxxx/xxxx/xxxx.npy"(pathlib.Path object)}

# Or Type 2: Hierarchically find files
HFiles = Chain([
    ListFile({
        "path" : path_to_root,
        "pattern" : ".*",
        "files" : False, # default to be true
    }),
    SpanFiles({
        "pattern" : ".*\.npy"
        "dirs" : False # default to be true
    })
])
HFiles.resolve()
print(len(HFiles)) # Total num of npy in files in path_to_root's depth-one sub-dir
print(HFiles[0]) # {"path" : "path_to_root/xxxx/xxxx.npy"(pathlib.Path object)}

ListFile, RecursionFiles, SpanFiles will output files/dirs in the dictionary order

Load files to memory

from EasyDatas.Prefab import LoadData, NumpyLoad,NumpyLoadNPY
#Type 1: use numpy.load to load a npy format file
LoadChain = Chain([
    RFiles, # defined in the previous section. Or any other EasyDatas module providing path
    NumpyLoadNPY({
        "data_name" : "data" # default to be "data"
    })
])
LoadChain.resolve()
print(len(loadChain)) # The same with RFiles
print(LoadChain[0]) # {"data" : np.ndarray}

# Type 2: write your own codes to load
import numpy as np
LoadChainCustom = Chain([
    HFiles,
    LoadData({
        "data_name" : "custom_data" # default to be "data"
        },
        function = lambda x : np.loadtxt(str(x))
    )
])
LoadChainCustom.resolve()
print(len(LoadChainCustom)) # The same with HFiles
print(LoadChainCustom[0]) # {"custom_data" : np.ndarray}

# The custom LoadData could be replaced by NumpyLoad module.

Preprocessing

from EasyDatas.Prefab import Picker, ToTensor
from EasyDatas.Core import Transform, CachedTransform

class customTransform1(CachedTransform): 
    # Cached Transform will process all datas and cache the results in disk.
    def custom_init(self):
        self.times = self.get_attr("times", 2) # default value is 2

    def deal_a_data(self, data : dict):
        data["data"] = data["data"] * self.times
        return data


class customTransform2(Transform): 
    # Non-cached transform will process a data when it is been needed.
    def deal_a_data(self, data : dict):
        data["data"] = data["data"] + 1
        return data


TrainDataset = Chain([
    LoadChain,
    Picker(
        pick_func = lambda data,idx,total_num : idx <= 0.8 * total_num
    ),
    customTransform1({
        "times" : 3
    }),
    customTransform1(),
    customTransform2(),
    ToTensor()
])
TrainDataset.resolve()
print(len(TrainDataset)) # 0.8 * len(LoadChain)
print(TrainDataset[0]) # {"data" : torch.Tensor with (raw data * 3 * 2 + 1) }

# Or we can write all of them in one chain and only resolve once
TrainDataset = Chain([
    RecursionFiles({
        "path" : path_to_root,
        "pattern" : ".*\.npy",
        "dirs" : False # default to be true
    }),
    NumpyLoadNPY({
        "data_name" : "data" # default to be "data"
    }),
    Picker(
        pick_func = lambda data,idx,total_num : idx <= 0.8 * total_num
    ),
    customTransform1({
        "times" : 3
    }),
    customTransform1(),
    customTransform2(),
    ToTensor()
])
TrainDataset.resolve()
print(len(TrainDataset)) # 0.8 * len(LoadChain)
print(TrainDataset[0]) # {"data" : torch.Tensor with (raw data * 3 * 2 + 1) }

All EasyDatas modules are the child of torch.utils.data.Dataset. Thus we can directly put them into a dataloader

About caches

An EasyDatas module will store caches only if the args["need_cache"] is True. The defualt setting is False. Cache will be save in the args["cache_root"] path, which is set to CWD in default. The cache name will contain two parts. The first is about the module's args when it was created, the second is about the module's previous modules cache name. All the information are encoded to a string and EasyDatas will use that string to determine whether there is a valid cache for this module instance. Therefore, if one module's args have been changed, all modules' cache after this module will be recomputed.

Custom cache name

One can override name_args(self) function to change the properties that need to be considerd into cache name. The default implementation is:

class EasyDatasBase
    ...
    def name_args(self):
            """
        Return args dict for getting cache file's name
        Default to return all hashable values in self.args except cache_root
        """
        ret = {}
        for key in self.args:
            if isinstance(self.args[key],collections.Hashable):
                if key == "cache_root":
                    continue
                ret[key] = self.args[key]
        return ret
    ...

Processing Datas

All EasyDatas module have two functions to deal datas. The first is deal_datas and the second is deal_a_data. In default, deal_datas will send all datas to deal_a_data one-by-one and collect the return value as the output of this module. In most situation, customizing deal_a_data is safe, clear and enough. But in some other situation, we want to deal all datas by our own, we could override deal_datas function. There are two useful functions in EasyDatasBase class that will be helpful in deal_datas, which are self.get()and self.put()

class EasyDatasBases:
    def get(self,idx = None,do_copy = True) -> dict|None:
        pass

    def put(self,data_dict : dict,idx = -1) -> None:
        pass

If idx is not provided, get will automaticaly get datas from previous module one-by-one. If it meets the end, it will return None. A module with no previous module could not use get function. If the do_copy is set to False, it will directly return previous module's data, which is a reference. Otherwise, it will deep copy the data and return.
put function will automaticaly put datas in to return and cache list. if idx is provided, the data_dict will be put in to the position. The total number of datas will be count automaticaly in put function.
Besides, in deal_a_data function, one can use put functions and return None for increasing the data items.

Other modules

There are some other modules that are not introduced beyond.

EasyDatas.Core.EasyDatasBase

Defined base functions, logging and default processing

EasyDatas.Core.RawDatas

Base class for ListFile, RecursionFiles. RawDatas needs no previous dataset and the deal_datas function needs to be overrided

EasyDatas.Core.Merge

Merge multiple EasyDatas modules by merge their data dict. The modules need to have the same length.

# assume A is an EasyDatas module with A[0] == {"data_1" : xxx}
# assume B is an EasyDatas module with B[0] == {"data_2" : xxx}
M = Merge([A,B])
print(len(M)) # The same with A and B
print(M[0]) # {"data_1" : xxx, "data_2" : xxx}

EasyDatas.Core.Stack

Stack multiple EasyDatas modules by combine their items.

# assume A is an EasyDatas module with A[0] == {"data_1" : xxx} and len(A) = 1000
# assume B is an EasyDatas module with B[0] == {"data_2" : xxx} and len(B) = 500
S = Stack([A,B])
print(len(S)) # 1500 which is len(A) + len(B)
print(S[999]) # {"data_1" : xxx}
print(S[1000]) # {"data_2" : xxx}

In most cases, Stack are used to stack modules which have same data format.

Owner
Ximing Yang
Fudan University
Ximing Yang
Does Oversizing Improve Prosumer Profitability in a Flexibility Market? - A Sensitivity Analysis using PV-battery System

Does Oversizing Improve Prosumer Profitability in a Flexibility Market? - A Sensitivity Analysis using PV-battery System The possibilities to involve

Babu Kumaran Nalini 0 Nov 19, 2021
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

3k Jan 08, 2023
Drone detection using YOLOv5

This drone detection system uses YOLOv5 which is a family of object detection architectures and we have trained the model on Drone Dataset. Overview I

Tushar Sarkar 27 Dec 20, 2022
Second-order Attention Network for Single Image Super-resolution (CVPR-2019)

Second-order Attention Network for Single Image Super-resolution (CVPR-2019) "Second-order Attention Network for Single Image Super-resolution" is pub

516 Dec 28, 2022
HybridNets: End-to-End Perception Network

HybridNets: End2End Perception Network HybridNets Network Architecture. HybridNets: End-to-End Perception Network by Dat Vu, Bao Ngo, Hung Phan 📧 FPT

Thanh Dat Vu 370 Dec 29, 2022
Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

Portrait Segmentation using Tensorflow This script removes the background from an input image. You can read more about segmentation here Setup The scr

291 Dec 24, 2022
Official Repo of my work for SREC Nandyal Machine Learning Bootcamp

About the Bootcamp A 3-day Machine Learning Bootcamp organised by Department of Electronics and Communication Engineering, Santhiram Engineering Colle

MS 1 Nov 29, 2021
Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort.

Rayvens augments Ray with events. With Rayvens, Ray applications can subscribe to event streams, process and produce events. Rayvens leverages Apache

CodeFlare 32 Dec 25, 2022
LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021

LTR_CrossEncoder: Legal Text Retrieval Zalo AI Challenge 2021 We propose a cross encoder model (LTR_CrossEncoder) for information retrieval, re-retrie

Xuan Hieu Duong 7 Jan 12, 2022
Text to Image Generation with Semantic-Spatial Aware GAN

text2image This repository includes the implementation for Text to Image Generation with Semantic-Spatial Aware GAN This repo is not completely. Netwo

CVDDL 124 Dec 30, 2022
LyaNet: A Lyapunov Framework for Training Neural ODEs

LyaNet: A Lyapunov Framework for Training Neural ODEs Provide the model type--config-name to train and test models configured as those shown in the pa

Ivan Dario Jimenez Rodriguez 21 Nov 21, 2022
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
Notes taking website build with Docker + Django + React.

Notes website. Try it in browser! / But how to run? Description. This is monorepository with notes website. Website provides web interface for creatin

Kirill Zhosul 2 Jul 27, 2022
A Neural Net Training Interface on TensorFlow, with focus on speed + flexibility

Tensorpack is a neural network training interface based on TensorFlow. Features: It's Yet Another TF high-level API, with speed, and flexibility built

Tensorpack 6.2k Jan 09, 2023
Official implementation of Rich Semantics Improve Few-Shot Learning (BMVC, 2021)

Rich Semantics Improve Few-Shot Learning Paper Link Abstract : Human learning benefits from multi-modal inputs that often appear as rich semantics (e.

Mohamed Afham 11 Jul 26, 2022
The repo for reproducing Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study

ECIR Reproducibility Paper: Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study This code corresponds to the reproducibility

ielab 3 Mar 31, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

49 Nov 23, 2022
NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go

NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go This repository provides our implementation of the CVPR 2021 paper NeuroMorp

Meta Research 35 Dec 08, 2022
The Rich Get Richer: Disparate Impact of Semi-Supervised Learning

The Rich Get Richer: Disparate Impact of Semi-Supervised Learning Preprocess file of the dataset used in implicit sub-populations: (Demographic groups

<a href=[email protected]"> 4 Oct 14, 2022
Continuous Augmented Positional Embeddings (CAPE) implementation for PyTorch

PyTorch implementation of Continuous Augmented Positional Embeddings (CAPE), by Likhomanenko et al. Enhance your Transformer positional embeddings with easy-to-use augmentations!

Guillermo Cámbara 26 Dec 13, 2022