当前位置:网站首页>2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2022-07-01 09:08:00 【Enzo tried to smash the computer】
Catalog
1、 Data download
from kaggle Download relevant data :kaggle Address
The entire dataset includes 3 Parts of : Training set folder 、 Test set folder 、 The training set corresponds to label Of csv file
I downloaded the data to the project folder , Rename it to “dog_breed_original_data”
then , We read label, Look at the relevant information
( here , I used it yaml file , For the convenience of modifying the file name or super parameter later , It is also convenient for everyone to copy the code and run directly )
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
print('test_img_files number:', len(test_img_files)) # Number of files in the test set
label altogether 10222 strip ,train Pictures in folder 10222 Zhang ,test Pictures in folder 10375 Zhang .
later , We will train Pictures in the folder 8-2 branch , As Training set And validation set .
=======================================================
2、 Data preprocessing
The whole data preprocessing includes the following 2 Parts of
1) Divide the image data into 2 Parts of : front 80% Used as a training set 、 after 20% Used as a validation set ( Verification set )
In order to ensure picture and label The correctness of one-to-one correspondence , We from label.csv in Youshun The read id( The name of the picture ), Spell out the picture address .
2) The label is also divided into two parts , front 80% Part of the corresponding training set , after 20% The part of the corresponding validation set
Enumerate the names of the breeds , And it is represented by one-to-one mapping of numbers ; And then we will show the breed of dog label Mapping from a name to its corresponding number
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、 rewrite Dataset, And configure the data iterator DataLoader
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、 Full version summary
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
# print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
# print('test_img_files number:', len(test_img_files)) # Number of files in the test set
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml The configuration file
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet The mean and variance on
std: [0.229, 0.224, 0.225] # ImageNet The mean and variance on
边栏推荐
- Differences among tasks, threads and processes
- Jetson Nano 安装TensorFlow GPU及问题解决
- [interview brush 101] linked list
- [MFC development (17)] advanced list control list control
- 大型工厂设备管理痛点和解决方案
- What are the differences between the architecture a, R and m of arm V7, and in which fields are they applied?
- 2.3 【kaggle数据集 - dog breed 举例】数据预处理、重写Dataset、DataLoader读取数据
- 易点易动助力企业设备高效管理,提升设备利用率
- 猿人学第20题(题目会不定时更新)
- pcl_viewer命令
猜你喜欢
How to manage fixed assets efficiently in one stop?
足球篮球体育比赛比分直播平台源码/app开发建设项目
Football and basketball game score live broadcast platform source code /app development and construction project
Nacos - service discovery
【MFC开发(16)】树形控件Tree Control
[interview brush 101] linked list
Understanding and implementation of AVL tree
Microcomputer principle - bus and its formation
Nacos - 配置管理
【pytorch】2.4 卷积函数 nn.conv2d
随机推荐
小鸟识别APP
【ESP 保姆级教程】疯狂毕设篇 —— 案例:基于阿里云和Arduino的化学环境系统检测,支持钉钉机器人告警
LogBack
I use flask to write the website "one"
How to manage fixed assets efficiently in one stop?
[ESP nanny level tutorial preview] crazy node JS server - Case: esp8266 + DS18B20 temperature sensor +nodejs local service + MySQL database
It technology ebook collection
Embedded Engineer Interview frequently asked questions
jeecg 重启报40001
Shell脚本-case in语句
pcl_viewer命令
Can diffusion models be regarded as an autoencoder?
[ESP nanny level tutorial preview] crazy node JS server - Case: esp8266 + MQ Series + nodejs local service + MySQL storage
通过 代码实例 理解 浅复制 与 深复制
Common interview questions for embedded engineers 2-mcu_ STM32
Installing Oracle EE
Promise异步编程
Flink面试题
Bimianhongfu queren()
It is designed with high bandwidth, which is almost processed into an open circuit?