当前位置:网站首页>2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2022-07-01 09:08:00 【Enzo tried to smash the computer】
Catalog
1、 Data download
from kaggle Download relevant data :kaggle Address
The entire dataset includes 3 Parts of : Training set folder 、 Test set folder 、 The training set corresponds to label Of csv file

I downloaded the data to the project folder , Rename it to “dog_breed_original_data”
then , We read label, Look at the relevant information
( here , I used it yaml file , For the convenience of modifying the file name or super parameter later , It is also convenient for everyone to copy the code and run directly )
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
print('test_img_files number:', len(test_img_files)) # Number of files in the test set

label altogether 10222 strip ,train Pictures in folder 10222 Zhang ,test Pictures in folder 10375 Zhang .
later , We will train Pictures in the folder 8-2 branch , As Training set And validation set .
=======================================================
2、 Data preprocessing
The whole data preprocessing includes the following 2 Parts of
1) Divide the image data into 2 Parts of : front 80% Used as a training set 、 after 20% Used as a validation set ( Verification set )
In order to ensure picture and label The correctness of one-to-one correspondence , We from label.csv in Youshun The read id( The name of the picture ), Spell out the picture address .
2) The label is also divided into two parts , front 80% Part of the corresponding training set , after 20% The part of the corresponding validation set
Enumerate the names of the breeds , And it is represented by one-to-one mapping of numbers ; And then we will show the breed of dog label Mapping from a name to its corresponding number
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、 rewrite Dataset, And configure the data iterator DataLoader
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、 Full version summary
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
# print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
# print('test_img_files number:', len(test_img_files)) # Number of files in the test set
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml The configuration file
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet The mean and variance on
std: [0.229, 0.224, 0.225] # ImageNet The mean and variance on
边栏推荐
- Naoqi robot summary 28
- DataBinding源码分析
- Nacos - 配置管理
- 【pytorch】nn.AdaptiveMaxPool2d
- Daily practice of C language - day 80: currency change
- Principles of Microcomputer - internal and external structure of microprocessor
- Meituan machine test in 2022
- 【pytorch】nn.CrossEntropyLoss() 与 nn.NLLLoss()
- FreeRTOS learning easy notes
- Pain points and solutions of equipment management in large factories
猜你喜欢

Nacos - 配置管理

Principles of Microcomputer - internal and external structure of microprocessor

Ranking list of domestic databases in February, 2022: oceanbase regained the "three consecutive increases", and gaussdb is expected to achieve the largest increase this month

钓鱼识别app

Performance improvement 2-3 times! The second generation Kunlun core server of Baidu AI Cloud was launched

Imitation of Baidu search results top navigation bar effect

Football and basketball game score live broadcast platform source code /app development and construction project

Nacos - 配置管理

Vsync+ triple cache mechanism +choreographer

How to manage fixed assets well? Easy to point and move to provide intelligent solutions
随机推荐
3D打印Arduino 四轴飞行器
易点易动助力企业设备高效管理,提升设备利用率
Mysql8.0 learning record 17 -create table
It technology ebook collection
小鸟识别APP
Flink interview questions
How to solve the problem of fixed assets management and inventory?
Input标签的type设置为number,去掉上下箭头
Shell脚本-位置参数(命令行参数)
序列化、监听、自定义注解
The jar package embedded with SQLite database is deployed by changing directories on the same machine, and the newly added database records are gone
美团2022年机试
Full mark standard for sports items in the high school entrance examination (Shenzhen, Anhui and Hubei)
The fixed assets management system enables enterprises to dynamically master assets
Common interview questions for embedded engineers 2-mcu_ STM32
LogBack
How to manage fixed assets efficiently in one stop?
Mysql 优化
Flink面试题
日常办公耗材管理解决方案