当前位置:网站首页>2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2022-07-01 09:08:00 【Enzo tried to smash the computer】
Catalog
1、 Data download
from kaggle Download relevant data :kaggle Address
The entire dataset includes 3 Parts of : Training set folder 、 Test set folder 、 The training set corresponds to label Of csv file

I downloaded the data to the project folder , Rename it to “dog_breed_original_data”
then , We read label, Look at the relevant information
( here , I used it yaml file , For the convenience of modifying the file name or super parameter later , It is also convenient for everyone to copy the code and run directly )
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
print('test_img_files number:', len(test_img_files)) # Number of files in the test set

label altogether 10222 strip ,train Pictures in folder 10222 Zhang ,test Pictures in folder 10375 Zhang .
later , We will train Pictures in the folder 8-2 branch , As Training set And validation set .
=======================================================
2、 Data preprocessing
The whole data preprocessing includes the following 2 Parts of
1) Divide the image data into 2 Parts of : front 80% Used as a training set 、 after 20% Used as a validation set ( Verification set )
In order to ensure picture and label The correctness of one-to-one correspondence , We from label.csv in Youshun The read id( The name of the picture ), Spell out the picture address .
2) The label is also divided into two parts , front 80% Part of the corresponding training set , after 20% The part of the corresponding validation set
Enumerate the names of the breeds , And it is represented by one-to-one mapping of numbers ; And then we will show the breed of dog label Mapping from a name to its corresponding number
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、 rewrite Dataset, And configure the data iterator DataLoader
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、 Full version summary
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
# print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
# print('test_img_files number:', len(test_img_files)) # Number of files in the test set
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml The configuration file
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet The mean and variance on
std: [0.229, 0.224, 0.225] # ImageNet The mean and variance on
边栏推荐
- Shell脚本-for循环和for int循环
- 【pytorch】2.4 卷积函数 nn.conv2d
- Shell脚本-位置参数(命令行参数)
- Class loading
- Yidian Yidong helps enterprises to efficiently manage equipment and improve equipment utilization
- Shell script case in and regular expressions
- 美团2022年机试
- 【ESP 保姆级教程】疯狂毕设篇 —— 案例:基于阿里云、小程序、Arduino的温湿度监控系统
- 【MFC开发(16)】树形控件Tree Control
- Installing Oracle EE
猜你喜欢

Ape anthropology topic 20 (the topic will be updated from time to time)

Installing Oracle EE

2.2 【pytorch】torchvision.transforms

How to manage fixed assets efficiently in one stop?

Insert mathematical formula in MD document and mathematical formula in typora

How to solve the problem of fixed assets management and inventory?

Which method is good for the management of fixed assets of small and medium-sized enterprises?

Imitation of Baidu search results top navigation bar effect

Pain points and solutions of equipment management in large factories

Bird recognition app
随机推荐
jeecg 重启报40001
MySQL optimization
2.3 【pytorch】数据预处理 torchvision.datasets.ImageFolder
动态代理
[ESP nanny level tutorial preview] crazy node JS server - Case: esp8266 + DS18B20 temperature sensor +nodejs local service + MySQL database
Mysql8.0 learning record 17 -create table
Is it safe to dig up money and make new shares
Jeecg restart alarm 40001
Microcomputer principle - bus and its formation
美团2022年机试
How to solve the problem of fixed assets management and inventory?
Nacos - service discovery
Redis source code learning (29), compressed list learning, ziplist C (II)
Leetcode daily question brushing record --540 A single element in an ordered array
TV size and viewing distance
Flink interview questions
Shell script - special variables: shell $, $*, [email protected], $$$
[ESP nanny level tutorial] crazy completion chapter - Case: gy906 infrared temperature measurement access card swiping system based on the Internet of things
大型工厂设备管理痛点和解决方案
Shell script echo command escape character