当前位置：网站首页>[Pytorch study notes] 11. Take a subset of the Dataset and shuffle the order of the Dataset (using Subset, random_split)

[Pytorch study notes] 11. Take a subset of the Dataset and shuffle the order of the Dataset (using Subset, random_split)

2022-08-05 05:42:00 【takedachia】

（pytorch版本：1.2）

文章目录

我们在使用Dataset定义好数据集后,These problems are often encountered when dealing with datasets：如何把Dataset拆分成两个子集（as used to specify training and test sets、k折交叉验证等）？How to do random splits？How to scramble oneDataset内数据的顺序？

Dataset取子集、拆分

使用 torch.utils.data.Subset() Data sets can be subsetted.
在这里插入图片描述
传入一个Dataset,A sequence sliceindices,to get a subset.

1.我们可以传入一个range()：

indices = range(18353) # Take the label as the first0个到第18352个数据
sub_imgs = torch.utils.data.Subset(imgs, indices)
len(imgs), len(sub_imgs)

在这里插入图片描述

2.interval can be taken：

indices = range(18353, 27153) # Take the label as the first18353个到第27152个数据
sub_imgs = torch.utils.data.Subset(imgs, indices)
len(imgs), len(sub_imgs)

在这里插入图片描述

3.可以传入一个List.有ListYou can use list comprehensions：

indices = [x for x in range(1234)]
sub_imgs = torch.utils.data.Subset(imgs, indices)
len(imgs), len(sub_imgs)

在这里插入图片描述

打乱Dataset内数据的顺序

We can pass in an out-of-order one directlyindexIt can achieve the purpose of out-of-order data set：

from torch import randperm
lenth = randperm(len(Leaf_dataset_train)).tolist() # Generate out-of-order indexes
rand_train = torch.utils.data.Subset(imgs, lenth)

# Show the first image、original label
X = rand_train[0]
plt.imshow(torch.transpose(X[0],0,2)), lenth[0]

在这里插入图片描述

After we shuffle the order, we can take subsets to perform on the datasetkfold cross-validation and other behaviors.

随机拆分Dataset

使用 torch.utils.data.random_split() The dataset can be split directly,Randomly divided into multiple portions.
在这里插入图片描述
可以传入一个List,注意传入的ListThe size of each subset is included in the sequence（数量）,And the sum of these numbers must be等于传入Dataset的长度.
示例：

# 这里Leaf_dataset_trainmust be equal in size 17000+1353
train_set, test_set = torch.utils.data.random_split(Leaf_dataset_train, [17000, 1353])
print(len(train_set), len(test_set))