当前位置:网站首页>2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2.3 [kaggle dataset - dog feed example] data preprocessing, rewriting dataset, dataloader reading data
2022-07-01 09:08:00 【Enzo tried to smash the computer】
Catalog
1、 Data download
from kaggle Download relevant data :kaggle Address
The entire dataset includes 3 Parts of : Training set folder 、 Test set folder 、 The training set corresponds to label Of csv file

I downloaded the data to the project folder , Rename it to “dog_breed_original_data”
then , We read label, Look at the relevant information
( here , I used it yaml file , For the convenience of modifying the file name or super parameter later , It is also convenient for everyone to copy the code and run directly )
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
print(df.info())
print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
print('test_img_files number:', len(test_img_files)) # Number of files in the test set

label altogether 10222 strip ,train Pictures in folder 10222 Zhang ,test Pictures in folder 10375 Zhang .
later , We will train Pictures in the folder 8-2 branch , As Training set And validation set .
=======================================================
2、 Data preprocessing
The whole data preprocessing includes the following 2 Parts of
1) Divide the image data into 2 Parts of : front 80% Used as a training set 、 after 20% Used as a validation set ( Verification set )
In order to ensure picture and label The correctness of one-to-one correspondence , We from label.csv in Youshun The read id( The name of the picture ), Spell out the picture address .
2) The label is also divided into two parts , front 80% Part of the corresponding training set , after 20% The part of the corresponding validation set
Enumerate the names of the breeds , And it is represented by one-to-one mapping of numbers ; And then we will show the breed of dog label Mapping from a name to its corresponding number
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
3、 rewrite Dataset, And configure the data iterator DataLoader
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_data = TrainSet()
train_loader = data.DataLoader(train_data, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
4、 Full version summary
import os
import yaml
import numpy as np
import pandas as pd
from torch.utils import data
from torchvision import transforms, utils
from PIL import Image
dir_root = os.getcwd()
with open(os.path.join(dir_root, 'config.yml'), "r") as f:
y = yaml.load(f, Loader=yaml.FullLoader)
df = pd.read_csv(os.path.join(dir_root, y['file']['labels_csv']))
# print(df.info())
# print(df.head())
train_img_files = os.listdir(os.path.join(dir_root, y['file']['train_img'])) # Read Training set All files in
test_img_files = os.listdir(os.path.join(dir_root, y['file']['test_img'])) # Read Test set All files in
# print('\ntrain_img_files number:', len(train_img_files)) # Number of files in the training set
# print('test_img_files number:', len(test_img_files)) # Number of files in the test set
# -----------------------------------
# take label Two columns read out in , All converted into numpy
# -----------------------------------
label_breed = pd.Series.to_numpy(df['breed'])
label_id = pd.Series.to_numpy(df['id'])
# -----------------------------------
# Split the picture into two parts : Training set (80%) and Training set (20%)
# -----------------------------------
file = [os.path.join(dir_root, y['file']['train_img'], i + '.jpg') for i in label_id]
num = np.int(len(file)*0.8)
file_train = file[:num] # take 80% As a training set
file_vali = file[num:] # take 20% As a validation set
# -----------------------------------
# Enumerate the names of varieties , And map to the corresponding number
# -----------------------------------
breed_list = list(set(label_breed))
# print(len(breed_list)) # common 120 varieties
dic = {
}
for i in range(len(breed_list)):
dic[breed_list[i]] = i
# -----------------------------------
# For each sample label Are mapped to their corresponding numbers
# And divided into : Training set (80%) and Training set (20%)
# -----------------------------------
label_num = []
for i in range(len(label_breed)):
label_num.append(dic[label_breed[i]])
label_num = np.array(label_num)
train_label = label_num[:num]
vali_label = label_num[num:]
# -----------------------------------
# rewrite Dataset
# -----------------------------------
class TrainSet(data.Dataset):
def __init__(self):
self.images = file_train
self.labels = train_label
self.preprocess = transforms.Compose([transforms.ToTensor(), transforms.Normalize(mean=y['mean'], std=y['std'])])
def __getitem__(self, index):
file_name = self.images[index]
img_pil = Image.open(file_name)
img_pil = img_pil.resize((224, 224))
img_tensor = self.preprocess(img_pil)
label = self.labels[index]
return img_tensor, label
def __len__(self):
return len(self.images)
train_dataset = TrainSet()
train_loader = data.DataLoader(train_dataset, batch_size=4, shuffle=True)
for i, train_data in enumerate(train_loader):
print('i:', i)
img, label = train_data
print(img)
print(label)
yaml The configuration file
file:
train_img: 'dog_breed_original_data/train'
test_img: 'dog_breed_original_data/test'
labels_csv: 'dog_breed_original_data/labels.csv'
mean: [0.485, 0.456, 0.406] # ImageNet The mean and variance on
std: [0.229, 0.224, 0.225] # ImageNet The mean and variance on
边栏推荐
- Log4j log framework
- Shell script - array definition and getting array elements
- Mysql8.0 learning record 17 -create table
- Input标签的type设置为number,去掉上下箭头
- Principles of Microcomputer - Introduction
- Flink interview questions
- Understand shallow replication and deep replication through code examples
- [interview brush 101] linked list
- 【ESP 保姆级教程 预告】疯狂Node.js服务器篇 ——案例:ESP8266 + DHT11 +NodeJs本地服务+ MySQL数据库
- Nacos - 配置管理
猜你喜欢

Pain points and solutions of equipment management in large factories

如何解决固定资产管理和盘点的难题?

Microcomputer principle - bus and its formation

2.4 激活函数

nacos简易实现负载均衡

Which method is good for the management of fixed assets of small and medium-sized enterprises?

Nacos - Configuration Management

【电赛训练】红外光通信装置 2013年电赛真题

Mysql 优化

3. Detailed explanation of Modbus communication protocol
随机推荐
Microcomputer principle - bus and its formation
中小企业固定资产管理办法哪种好?
用C语言编程:用公式计算:e≈1+1/1!+1/2! …+1/n!,精度为10-6
Principles of Microcomputer - internal and external structure of microprocessor
Differences among tasks, threads and processes
【pytorch】nn.AdaptiveMaxPool2d
An overview of the design of royalties and service fees of mainstream NFT market platforms
Imitation of Baidu search results top navigation bar effect
Redis——Lettuce连接redis集群
美团2022年机试
Dynamic proxy
Shell脚本-select in循环
【pytorch】2.4 卷积函数 nn.conv2d
安装Oracle EE
Shell script - special variables: shell $, $*, [email protected], $$$
Installing Oracle EE
It technology ebook collection
Redis -- lattice connects to redis cluster
【MFC开发(16)】树形控件Tree Control
[ESP nanny level tutorial] crazy completion chapter - Case: ws2812 light control system based on Alibaba cloud, applet and Arduino