当前位置:网站首页>2.3 【kaggle数据集 - dog breed 举例】数据预处理、重写Dataset、DataLoader读取数据
2.3 【kaggle数据集 - dog breed 举例】数据预处理、重写Dataset、DataLoader读取数据
2022-07-01 09:03:00 【Enzo 想砸电脑】
1、数据下载
从kaggle下载相关数据:kaggle地址
整个数据集包括3个部分:训练集文件夹、测试集文件夹、训练集对应的 label 的csv文件

数据我下载到了项目文件夹里,重命名为 “dog_breed_original_data”
然后,我们读取 label,看下相关信息
(这里,我用了 yaml 文件,为了方便之后修改文件名或者超参数,也为了大家复制代码直接跑着方便些)
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # 读取 训练集 中的所有文件
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # 读取 测试集 中的所有文件
print('\ntrain_img_files number:', len(train_img_files)) # 训练集中的文件个数
print('test_img_files number:', len(test_img_files)) # 测试集中的文件个数

label 一共10222条,train文件夹中图片10222张,test文件夹中图片10375张。
稍后, 我们将train文件夹中的图片 8-2 分,作为 训练集 和验证集。
=======================================================
2、数据预处理
整个数据预处理包括如下2个部分
1)将图片数据分成2个部分:前80%用作训练集、后20%用作验证集(验证集)
为了保证 图片 和 label 一一对应的正确性,我们从 label.csv 中有顺的读取 id(图片的名称), 拼出图片地址。
2)标签也分成两部分,前80% 的部分对应训练集,后20% 的部分对应验证集
将犬种的名称枚举出来,并用数字一一映射表示; 再将表示犬种的label由名称映射到其对应的数字上
# -----------------------------------
# 将 label 中读出的两列,都转换成 numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# 将图片拆分为两部分: 训练集(80%) 和 训练集(20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # 取 80% 的数据作为训练集
file_vali = file[num:] # 取 20% 的数据作为验证集
# -----------------------------------
# 枚举品种的名称,并映射到对应的数字上
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # 共120个品种
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# 将每一个样本的 label 都映射到其对应的编号
# 并分为:训练集(80%) 和 训练集(20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、重写Dataset,并配置数据迭代器 DataLoader
# -----------------------------------
# 重写 Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、完整版汇总
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # 读取 训练集 中的所有文件
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # 读取 测试集 中的所有文件
# print('\ntrain_img_files number:', len(train_img_files)) # 训练集中的文件个数
# print('test_img_files number:', len(test_img_files)) # 测试集中的文件个数
# -----------------------------------
# 将 label 中读出的两列,都转换成 numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# 将图片拆分为两部分: 训练集(80%) 和 训练集(20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # 取 80% 的数据作为训练集
file_vali = file[num:] # 取 20% 的数据作为验证集
# -----------------------------------
# 枚举品种的名称,并映射到对应的数字上
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # 共120个品种
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# 将每一个样本的 label 都映射到其对应的编号
# 并分为:训练集(80%) 和 训练集(20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# 重写 Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml 配置文件
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet上的均值和方差
std: [0.229, 0.224, 0.225] # ImageNet上的均值和方差
边栏推荐
- 易点易动助力企业设备高效管理,提升设备利用率
- Foundation: 2 The essence of image
- Full mark standard for sports items in the high school entrance examination (Shenzhen, Anhui and Hubei)
- Ape anthropology topic 20 (the topic will be updated from time to time)
- Input标签的type设置为number,去掉上下箭头
- The fixed assets management system enables enterprises to dynamically master assets
- 【ESP 保姆级教程】疯狂毕设篇 —— 案例:基于阿里云、小程序、Arduino的温湿度监控系统
- In the middle of the year, where should fixed asset management go?
- How can enterprises and developers take the lead in the outbreak of cloud native landing?
- Meituan machine test in 2022
猜你喜欢
![[interview brush 101] linked list](/img/52/d159bc66c0dbc44c1282a96cf6b2fd.png)
[interview brush 101] linked list

FreeRTOS学习简易笔记

How to solve the problem of fixed assets management and inventory?

NiO zero copy

Principle and application of single chip microcomputer timer, serial communication and interrupt system

Glitch free clock switching technology

Ape anthropology topic 20 (the topic will be updated from time to time)

Why is the Ltd independent station a Web3.0 website!

Screenshot tips

TV size and viewing distance
随机推荐
Why is the Ltd independent station a Web3.0 website!
Microcomputer principle - bus and its formation
Understand shallow replication and deep replication through code examples
Summary of reptile knowledge points
Common interview questions for embedded engineers 2-mcu_ STM32
Shell script -for loop and for int loop
R语言观察日志(part24)--初始化设置
【MFC开发(17)】高级列表控件List Control
Shell script - positional parameters (command line parameters)
Is it safe to dig up money and make new shares
Computer tips
C语言学生信息管理系统
Nacos - service discovery
Advanced level of C language pointer (Part 1)
Nacos - 配置管理
Bimianhongfu queren()
【ESP 保姆级教程】疯狂毕设篇 —— 案例:基于阿里云和Arduino的化学环境系统检测,支持钉钉机器人告警
FreeRTOS学习简易笔记
中小企业固定资产管理办法哪种好?
[interview brush 101] linked list