当前位置:网站首页>[necessary for R & D personnel] how to make your own dataset and display it.

[necessary for R & D personnel] how to make your own dataset and display it.

2022-07-08 00:52:00 Vertira

2022.7.5, newest .paddle.fluid Will be eliminated by the official website , Although there are many books on the market , It's better not to use it ,. It is recommended that R & D personnel get started paddle Must from API Starting with .

Here I will introduce the use of paddle api How to make your own training data set ( Follow VOC Data set and COCO Data sets are not the same thing , Learn to make COCO and VOC Please find my previous blog for data set , It 's been written clearly , Relatively simple . This is only for R & D personnel , Not for developers or users ).

In fact, the official website is also very clear , But due to the version update , yes , we have api The calling method has changed , Resulting in several lines of official code , No hair use . Here I personally test , Fill up the hole , Share with you .

Most of them are official website content , Key code I have debugged . my paddle It's the latest of this year 2.3,2022 year 7 month .

Let's talk about the key points ( Some Mandarin , It's better said by the official website , I copied )

Data set definition and loading

Deep learning models require a lot of data to complete training and evaluation , These data samples may be pictures (image)、 Text (text)、 voice (audio) Etc , The model training process is actually a mathematical calculation process , Therefore, data samples need to go through a series of processing before being sent into the model , Such as converting data format 、 Divide the data set 、 Transform data shape (shape)、 Make data Iterative readers for batch training, etc .

In the propeller frame , The definition and loading of data sets can be completed through the following two core steps :

  1. Define datasets : The original picture saved on disk 、 Samples such as text and corresponding tags are mapped to Dataset, To facilitate subsequent indexing (index) Reading data , stay Dataset Some data transformation can also be carried out in 、 Data augmentation and other preprocessing operations . Recommended for use in the propeller frame  paddle.io.Dataset  Custom datasets , In addition to  paddle.vision.datasets  and  paddle.text  Some classic data sets are built into the propeller under the directory to facilitate direct calling .

  2. Iteratively read data sets : Automatically batch samples from datasets (batch)、 Disorder (shuffle) Wait for the operation , Facilitate iterative reading during training , At the same time, it also supports multi process asynchronous reading function, which can speed up data reading . It can be used in the propeller frame  paddle.io.DataLoader  Iteratively read data sets .

This paper takes image data set as an example , Text datasets can be referenced  NLP Application practice .

  One 、 Define datasets

1.1 It is a data set defined by the official website ( I'm not interested in this part , A little )

1.2 Use paddle.io.Dataset Custom datasets  

In the actual scene , Generally, you need to use your own data to define the data set , You can pass  paddle.io.Dataset  Base classes to implement custom datasets .

You can build a subclass that inherits from  paddle.io.Dataset , And implement the following three functions :

  1. __init__: Complete dataset initialization , Map the sample file path and the corresponding label in the disk to a list .

  2. __getitem__: Define the specified index (index) How to obtain the sample data , Finally, the corresponding index A single piece of data ( Sample data 、 Corresponding label ).

  3. __len__: Returns the total number of samples in the dataset .

Here 's how to download MNIST After the original dataset file , use  paddle.io.Dataset  Code example for defining a dataset .

#  Download the original  MNIST  Dataset and decompress 
! wget https://paddle-imagenet-models-name.bj.bcebos.com/data/mnist.tar
! tar -xf mnist.tar

  Download the data set , If you are window Direct violence

https://paddle-imagenet-models-name.bj.bcebos.com/data/mnist.tar

Then unzip it .

The following is the code for making data sets ( File name : Custom datasets .py): 

import os
import cv2
import numpy as np
from paddle.io import Dataset
import paddle.vision.transforms as T
import matplotlib.pyplot as plt

class MyDataset(Dataset):
    """
     Step one : Inherit  paddle.io.Dataset  class 
    """
    def __init__(self, data_dir, label_path, transform=None):
        """
         Step two : Realization  __init__  function , Initialize the dataset , Map samples and labels to the list 
        """
        super(MyDataset, self).__init__()
        self.data_list = []
        with open(label_path,encoding='utf-8') as f:
            for line in f.readlines():
                image_path, label = line.strip().split('\t')
                image_path = os.path.join(data_dir, image_path)
                self.data_list.append([image_path, label])
        #  Pass in the defined data processing method , As an attribute of the custom dataset class 
        self.transform = transform

    def __getitem__(self, index):
        """
         Step three : Realization  __getitem__  function , The definition specifies  index  How to get data when , And return a single piece of data ( Sample data 、 Corresponding label )
        """
        #  Index based , Take an image from the list 
        image_path, label = self.data_list[index]
        #  Read grayscale 
        image = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
        #  The default internal data format for propeller training is float32, Convert the image data format to  float32
        image = image.astype('float32')
        #  Apply data processing methods to images 
        if self.transform is not None:
            image = self.transform(image)
        # CrossEntropyLoss requirement label The format is int, take Label Format conversion to  int
        label = int(label)
        #  Returns the image and the corresponding label 
        return image, label

    def __len__(self):
        """
         Step four : Realization  __len__  function , Returns the total number of samples in the dataset 
        """
        return len(self.data_list)

#  Define the image normalization processing method , there CHW Means that the image format must be  [C The channel number ,H Height of the image ,W The width of the image ]
transform = T.Normalize(mean=[127.5], std=[127.5], data_format='CHW')
#  Print the number of data set samples 
train_custom_dataset = MyDataset('mnist/train','mnist/train/label.txt', transform)
test_custom_dataset = MyDataset('mnist/val','mnist/val/label.txt', transform)
print('train_custom_dataset images: ',len(train_custom_dataset), 'test_custom_dataset images: ',len(test_custom_dataset))

The code above , It should be the same as the downloaded file The same path . otherwise Run failed

And then after running , Output such a sentence The size of the dataset And length

train_custom_dataset images:  60000 test_custom_dataset images:  10000

In the code above , Customize a dataset class  MyDataset,MyDataset  Inherited from  paddle.io.Dataset  Base class , And implemented  __init__,__getitem__  and  __len__  Three functions .

  • stay  __init__  Function completes the reading and parsing of the label file , And all the image paths  image_path  And the corresponding label  label  Save to a list  data_list  in .

  • stay  __getitem__  The specified... Is defined in the function index Method for acquiring corresponding image data , Finished reading the image 、 Preprocessing and image label format conversion , Finally, the image and corresponding label are returned  image, label.

  • stay  __len__  Function  __init__  The list of data sets initialized in the  data_list  length .

in addition , stay  __init__  Functions and  __getitem__  Function can also implement some data preprocessing operations , Such as flipping the image 、 tailoring 、 Normalization and other operations , Finally, a single piece of processed data is returned ( Sample data 、 Corresponding label ), This operation can increase the diversity of image data , It helps to enhance the generalization ability of the model . The propeller frame is  paddle.vision.transforms  There are dozens of image data processing methods built in , Please refer to   Data preprocessing   chapter .

The following is Try reading a data set ( One of the pictures ), Show me

  Directly in the above program   Add later ;

for data in train_custom_dataset:
    image, label = data
    print('shape of image: ',image.shape)
    plt.title(str(label))
    plt.imshow(image[0])
    plt.show()
    break

You can display a picture

Come here , The production and reading of the data set are over .

If it works for you , Welcome to thumb up Collection     Pay more attention to .

Two 、 Iteratively read data sets

2.1 Use paddle.io.DataLoader Define data readers

Read through the direct iteration described above Dataset Although the method can realize the access to the data set , However, this access method can only be performed by a single thread and requires manual batching (batch). In the propeller frame , Recommended  paddle.io.DataLoader API Read data sets in multiple processes , And can automatically complete the division batch The job of .

#  Define and initialize the data reader 
train_loader = paddle.io.DataLoader(train_custom_dataset, batch_size=64, shuffle=True, num_workers=1, drop_last=True)

#  call  DataLoader  Read data iteratively 
for batch_id, data in enumerate(train_loader()):
    images, labels = data
    print("batch_id: {},  Training data shape: {},  Tag data shape: {}".format(batch_id, images.shape, labels.shape))
    break
batch_id: 0,  Training data shape: [64, 1, 28, 28],  Tag data shape: [64]

By the above methods , Initialized a data reader  train_loader, Used to load training data sets  custom_dataset. Several commonly used fields in data readers are as follows :

  • batch_size Number of samples read per batch , Example  batch_size=64  Indicates that each batch reads 64 Samples .

  • shuffle Sample disorder , Example  shuffle=True  Indicates that the sample sequence is disturbed when taking data , To reduce the possibility of over fitting .

  • drop_last Discard incomplete batch samples , Example  drop_last=True  Indicates that the number of data set samples cannot be discarded batch_size The last incomplete result of division batch sample .

  • num_workers Sync / Reading data asynchronously , adopt  num_workers  To set the number of child processes that load data ,num_workers The value of is set to be greater than 0 when , That is, start the multi process mode to load data asynchronously , It can improve the data reading speed .

After defining the data reader , Ready to use for The loop easily iterates through the batch data , Used for model training . It is worth noting that , If you use high-level API Of  paddle.Model.fit  Read the data set for training , Then just define the dataset Dataset that will do , There is no need to define it separately DataLoader, because paddle.Model.fit Actually, some of them have been encapsulated DataLoader The function of , Details available   model training 、 Evaluation and reasoning   chapter .

notes : DataLoader In fact, it is through the batch sampler BatchSampler The resulting batch index list , And get... According to the index Dataset Corresponding sample data in , To load batch data .DataLoader The batch size for sampling is defined in 、 Sequence and so on , The corresponding fields include  batch_sizeshuffledrop_last. These three fields can also be used as one  batch_sampler  Fields instead of , And in  batch_sampler  Import a customized batch sampler instance in . Choose one of the above two methods , The same effect can be achieved . In the following section, the usage of the latter custom sampler is introduced , This usage provides more flexibility in defining sampling rules .

2.2 ( Optional ) Custom sampler

The sampler defines the sampling behavior from the dataset , Such as sequential sampling 、 Batch sampling 、 Random sampling 、 Distributed sampling, etc . The sampler will follow the set sampling rules , Returns the list of indexes in the dataset , Then the data reader Dataloader You can take the corresponding samples from the data set according to the index list .

The propeller frame is  paddle.io  A variety of samplers are available under the directory , Such as batch sampler  BatchSampler、 Distributed batch sampler  DistributedBatchSampler、 Sequential sampler  SequenceSampler、 Random sampler  RandomSampler  etc. .

Here are two sample codes , Introduce the usage of sampler .

First , With BatchSampler For example , Introduced in DataLoader Use in BatchSampler Method of obtaining sampling data .

from paddle.io import BatchSampler

#  Define a batch sampler , And set the sampling data set source 、 Sampling batch size 、 Whether the order is disordered 
bs = BatchSampler(train_custom_dataset, batch_size=8, shuffle=True, drop_last=True)

print("BatchSampler  Each iteration returns an index list ")
for batch_indices in bs:
    print(batch_indices)
    break

#  stay  DataLoader  Use in  BatchSampler  Get sampling data    
train_loader = paddle.io.DataLoader(train_custom_dataset, batch_sampler=bs, num_workers=1)

print(" stay  DataLoader  Use in  BatchSampler, Return a set of sample and label data corresponding to the index  ")
for batch_id, data in enumerate(train_loader()):
    images, labels = data
    print("batch_id: {},  Training data shape: {},  Tag data shape: {}".format(batch_id, images.shape, labels.shape))
    break

BatchSampler  Each iteration returns an index list 
[53486, 39208, 42267, 46762, 33087, 54705, 55986, 20736]
 stay  DataLoader  Use in  BatchSampler, Return a set of sample and label data corresponding to the index  
batch_id: 0,  Training data shape: [8, 1, 28, 28],  Tag data shape: [8]

In the above example code , A batch sampler instance is defined  bs, Each iteration will return one  batch_size  Index list of sizes ( In the example, a round of iteration returns 8 Index values ), Data readers  train_loader  adopt  batch_sampler=bs  Field is passed into the batch sampler , You can get a corresponding set of sample data according to these indexes . In addition, you can see ,batch_sizeshuffledrop_last These three parameters are only in BatchSampler Set the .

Here is another code example , Compare the sampling behavior of several different samplers .

from paddle.io import SequenceSampler, RandomSampler, BatchSampler, DistributedBatchSampler

class RandomDataset(paddle.io.Dataset):
    def __init__(self, num_samples):
        self.num_samples = num_samples

    def __getitem__(self, idx):
        image = np.random.random([784]).astype('float32')
        label = np.random.randint(0, 9, (1, )).astype('int64')
        return image, label

    def __len__(self):
        return self.num_samples
    
train_dataset = RandomDataset(100)

print('----------------- Sequential sampling ----------------')
sampler = SequenceSampler(train_dataset)
batch_sampler = BatchSampler(sampler=sampler, batch_size=10)

for index in batch_sampler:
    print(index)
    
print('----------------- Random sampling ----------------')
sampler = RandomSampler(train_dataset)
batch_sampler = BatchSampler(sampler=sampler, batch_size=10)

for index in batch_sampler:
    print(index)

print('----------------- Distributed sampling ----------------')
batch_sampler = DistributedBatchSampler(train_dataset, num_replicas=2, batch_size=10)

for index in batch_sampler:
    print(index)
----------------- Sequential sampling ----------------
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
[30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[50, 51, 52, 53, 54, 55, 56, 57, 58, 59]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[70, 71, 72, 73, 74, 75, 76, 77, 78, 79]
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89]
[90, 91, 92, 93, 94, 95, 96, 97, 98, 99]
----------------- Random sampling ----------------
[44, 29, 37, 11, 21, 53, 65, 3, 26, 23]
[17, 4, 48, 84, 86, 90, 92, 76, 97, 69]
[35, 51, 71, 45, 25, 38, 32, 83, 22, 57]
[47, 55, 39, 46, 78, 61, 68, 66, 18, 41]
[77, 81, 15, 63, 91, 54, 24, 75, 59, 99]
[73, 88, 20, 43, 93, 56, 95, 60, 87, 72]
[70, 98, 1, 64, 0, 16, 33, 14, 80, 89]
[36, 40, 62, 50, 9, 34, 8, 19, 82, 6]
[74, 27, 30, 58, 31, 28, 12, 13, 7, 49]
[10, 52, 2, 94, 67, 96, 79, 42, 5, 85]
----------------- Distributed sampling ----------------
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
[20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
[40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
[60, 61, 62, 63, 64, 65, 66, 67, 68, 69]
[80, 81, 82, 83, 84, 85, 86, 87, 88, 89]

From the output of the code, we can see :

  • Sequential sampling : Output the index of each sample in a sequential manner .

  • Random sampling : First, disrupt the sequence of samples , Then output the sample index after disorder .

  • Distributed sampling : Commonly used in distributed training scenarios , Cut the sample data into multiple parts , Put them on different cards for training . In the example  num_replicas=2, The sample will be divided into two cards , So here we only output the index of half the samples .

3、 ... and 、 summary

This section introduces the processing flow before the data is sent into the model training in the propeller frame , Summarize the whole process and the key used API As shown in the figure below .

 

chart 1: Data set definition and loading process

It mainly includes two steps: defining a data set and defining a data reader , In addition, the sampler can be called in the data reader to realize more flexible sampling . among , When defining a dataset , This section only normalizes the data set , To learn more about data enhancement related operations , You can refer to   Data preprocessing .

After all the above data processing work is completed , You can enter the next task : model training 、 Evaluation and reasoning .

Refer to the following :

Data set definition and loading - Using document -PaddlePaddle Deep learning platform

Reference resources : Data set definition and loading Data preprocessing
paddle.io.Dataset  and  paddle.io.DataLoader : Custom datasets and loading functions API

(7 Bar message ) PaddlePaddle An introduction to the practice —— Handwritten digit recognition _ Bread Hunter blog -CSDN Blog _paddlepaddle Handwritten digit recognition

原网站

版权声明
本文为[Vertira]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/189/202207072205555153.html