当前位置:网站首页>2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2022-07-01 09:08:00 【Enzo tried to smash the computer】
Catalog
1、 Data download
from kaggle Download relevant data :kaggle Address
The entire dataset includes 3 Parts of : Training set folder 、 Test set folder 、 The training set corresponds to label Of csv file

I downloaded the data to the project folder , Rename it to “dog_breed_original_data”
then , We read label, Look at the relevant information
( here , I used it yaml file , For the convenience of modifying the file name or super parameter later , It is also convenient for everyone to copy the code and run directly )
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
print('test_img_files number:', len(test_img_files)) # Number of files in the test set

label altogether 10222 strip ,train Pictures in folder 10222 Zhang ,test Pictures in folder 10375 Zhang .
later , We will train Pictures in the folder 8-2 branch , As Training set And validation set .
=======================================================
2、 Data preprocessing
The whole data preprocessing includes the following 2 Parts of
1) Divide the image data into 2 Parts of : front 80% Used as a training set 、 after 20% Used as a validation set ( Verification set )
In order to ensure picture and label The correctness of one-to-one correspondence , We from label.csv in Youshun The read id( The name of the picture ), Spell out the picture address .
2) The label is also divided into two parts , front 80% Part of the corresponding training set , after 20% The part of the corresponding validation set
Enumerate the names of the breeds , And it is represented by one-to-one mapping of numbers ; And then we will show the breed of dog label Mapping from a name to its corresponding number
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、 rewrite Dataset, And configure the data iterator DataLoader
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、 Full version summary
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
# print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
# print('test_img_files number:', len(test_img_files)) # Number of files in the test set
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml The configuration file
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet The mean and variance on
std: [0.229, 0.224, 0.225] # ImageNet The mean and variance on
边栏推荐
- Ape anthropology topic 20 (the topic will be updated from time to time)
- Microcomputer principle - bus and its formation
- Differences among tasks, threads and processes
- 【ESP 保姆级教程 预告】疯狂Node.js服务器篇 ——案例:ESP8266 + DS18B20温度传感器 +NodeJs本地服务+ MySQL数据库
- Set the type of the input tag to number, and remove the up and down arrows
- Shell脚本-变量的定义、赋值和删除
- Flink面试题
- 猿人学第20题(题目会不定时更新)
- 【电赛训练】红外光通信装置 2013年电赛真题
- Shell脚本-字符串
猜你喜欢

足球篮球体育比赛比分直播平台源码/app开发建设项目

Principle and application of single chip microcomputer timer, serial communication and interrupt system

如何做好固定资产管理?易点易动提供智能化方案

集团公司固定资产管理的痛点和解决方案

2.3 【kaggle数据集 - dog breed 举例】数据预处理、重写Dataset、DataLoader读取数据

What are the differences between the architecture a, R and m of arm V7, and in which fields are they applied?

Principles of Microcomputer - Introduction

Redis -- lattice connects to redis cluster

Nacos - gestion de la configuration

Microcomputer principle - bus and its formation
随机推荐
日常办公耗材管理解决方案
【ESP 保姆级教程 预告】疯狂Node.js服务器篇 ——案例:ESP8266 + DHT11 +NodeJs本地服务+ MySQL数据库
pcl_viewer命令
Graduation season, I want to tell you
Record a redis timeout
Which method is good for the management of fixed assets of small and medium-sized enterprises?
Yidian Yidong helps enterprises to efficiently manage equipment and improve equipment utilization
Common interview questions for embedded engineers 2-mcu_ STM32
Shell script - positional parameters (command line parameters)
Shell脚本-特殊变量:Shell $#、$*、[email protected]、$?、$$
C language student information management system
Mysql 优化
Shell脚本-数组定义以及获取数组元素
Embedded Engineer Interview Question 3 Hardware
Principles of Microcomputer - Introduction
3D打印Arduino 四轴飞行器
Embedded Engineer Interview frequently asked questions
3D printing Arduino four axis aircraft
Pain points and solutions of fixed assets management of group companies
[interview brush 101] linked list