当前位置：网站首页>Mmdetection dataloader construction

Mmdetection dataloader construction

2022-06-10 22:41:00 【Wu lele~】

List of articles

Preface
1、 Overall process
2、 Instantiation dataloader
- 2.1. GroupSampler Class implementation
- 2.2. BatchSampler class
3、 Read a batch Data flow
summary

Preface

This article will introduce mmdetection How to build dataloader Class .dataloader It mainly controls the iterative reading of data sets . It is necessary to realize dataset class . About dataset Class implementation please go to mmdetection And dataset Class construction .

1、 Overall process

stay pytorch in ,Dataloader The following important parameters are required for instance construction （ Intercept dataloader Source code ）.

    Arguments:
        dataset (Dataset): dataset from which to load the data.
        batch_size (int, optional): how many samples per batch to load
            (default: ``1``).
        shuffle (bool, optional): set to ``True`` to have the data reshuffled
            at every epoch (default: ``False``).
        sampler (Sampler or Iterable, optional): defines the strategy to draw
            samples from the dataset. Can be any ``Iterable`` with ``__len__``
            implemented. If specified, :attr:`shuffle` must not be specified.
        batch_sampler (Sampler or Iterable, optional): like :attr:`sampler`, but
            returns a batch of indices at a time. Mutually exclusive with
            :attr:`batch_size`, :attr:`shuffle`, :attr:`sampler`,
            and :attr:`drop_last`.
        num_workers (int, optional): how many subprocesses to use for data
            loading. ``0`` means that the data will be loaded in the main process.
            (default: ``0``)
        collate_fn (callable, optional): merges a list of samples to form a
            mini-batch of Tensor(s).  Used when using batched loading from a
            map-style dataset.

Briefly introduce the meaning of each parameter ：

dataset： Is to inherit Dataset Class ;
batch_size: Batch size
shuffle: True. At the beginning of a new round epoch when , Whether the data will be disrupted again
sampler: iterator ： It stores the subscript of the data set （ May be disturbed / The order ）. Is the iterator .
batch_samper: iteration sampler Middle subscript , Then according to the subscript dataset Remove from batch_size Data .
collate_fn: take batch Data is integrated into one list, Adjust width and height .

Maybe the definitions of the above parameters are a little vague . No problem , Just remember dataset,sampler,batch_sampler,dataloader Are iterators . As for iterators ： Can be understood as for … in dataset： Just use it .
since Dataloader The main parameters are , Now look at mmdetection How about Chinese build_dataloader Of . Next, I'm going to explain it in two parts ：
(1) How to instantiate a dataloader object . As shown in the figure below ：mmdetection The following four parameters are mainly implemented in .GroupSamper Inherited from torch Of sampler class .shuffle Most of them are True. and batch_sampler Parameters mmdetection Use yes pytorch Implemented in BatchSampler class .
Insert picture description here

(2) Read a batch Data flow .

2、 Instantiation dataloader

2.1. GroupSampler Class implementation

dataset Please go to dataset Class construction . Here I post GroupSampler Source code ：

class GroupSampler(Sampler):

    def __init__(self, dataset, samples_per_gpu=1):
        assert hasattr(dataset, 'flag')
        self.dataset = dataset
        self.samples_per_gpu = samples_per_gpu
        self.flag = dataset.flag.astype(np.int64)  #
        self.group_sizes = np.bincount(self.flag)  # np.bincount() Function Statistics   Subscript 01 Number of occurrences .
        self.num_samples = 0
        for i, size in enumerate(self.group_sizes):
            self.num_samples += int(np.ceil(
                size / self.samples_per_gpu)) * self.samples_per_gpu

    def __iter__(self):
        indices = []
        for i, size in enumerate(self.group_sizes):      # self.group_sizes = [942,4096] ; among 942 Represents the length ratio <1 Number of images ;
            if size == 0:
                continue
            indice = np.where(self.flag == i)[0]         #  Extract self.flag Medium equals current i The subscript . self.flag All images in the training set are stored in sequence aspect-ratio
            assert len(indice) == size
            np.random.shuffle(indice)                    #  The subscript here is confused 
            num_extra = int(np.ceil(size / self.samples_per_gpu)
                            ) * self.samples_per_gpu - len(indice)
            indice = np.concatenate(
                [indice, np.random.choice(indice, num_extra)])
            indices.append(indice)
        indices = np.concatenate(indices)                #  Merge Chen Yi list, The length is 5011 Of 
        indices = [                                      #  according to batch take list Divide ： if batch=1, Then the list is divided into 5011 Array of .
            indices[i * self.samples_per_gpu:(i + 1) * self.samples_per_gpu]
            for i in np.random.permutation(
                range(len(indices) // self.samples_per_gpu))
        ]
        indices = np.concatenate(indices)
        indices = indices.astype(np.int64).tolist()
        assert len(indices) == self.num_samples
        return iter(indices)

    def __len__(self):
        return self.num_samples

In fact, it mainly realizes __iter__ Method to make it an iterator . The general idea is ： If I had one 5000 A data set of images . Then the data set subscript is 0~4999. adopt np.random.shuffle Upset 5000 A subscript . If batch yes 2, Then get 2500 Yes . Will this 2500 The pair is stored in an array indices This list in . Finally through iter(indices) iteration .

2.2. BatchSampler class

This part mmdetection It uses pytorch Source code . I posted the source code ：

class BatchSampler(Sampler[List[int]]):

    def __init__(self, sampler: Sampler[int], batch_size: int, drop_last: bool) -> None:

        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        # Can only be called if self.sampler has __len__ implemented
        # We cannot enforce this condition, so we turn off typechecking for the
        # implementation below.
        # Somewhat related: see NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
        if self.drop_last:
            return len(self.sampler) // self.batch_size  # type: ignore
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size  # type: ignore

You can see from the source code ：BatchSampler With sampler The initialization of the . At the same time __iter__ Method , Each iteration is enough for one batch, With the help of the generator yield batch, I.e. return to a batch data .

3、 Read a batch Data flow

Here I want to use a picture to illustrate ： Words are not easy to describe ：
Insert picture description here

summary

This paper mainly introduces mmdetection How to realize dataset,sampler To construct a Dataloader, in addition , It shows dataloader How to iterate each batch data internally . If you have any questions, welcome +vx：wulele2541612007, Pull you into the group to discuss communication .

原网站

版权声明
本文为[Wu lele~]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/161/202206101653417867.html