[Kaggle project actual combat record] Steps and ideas sharing of a picture classification project - taking leaf classification as an example (using Pytorch)
This is an exercise project in hands-on deep learning(树叶分类),通过这个项目,Can learn from data preprocessing、建立数据集、Scratch-up experience in all aspects of deep learning projects from data augmentation to model training.
This article will record the steps and thoughts of my own to complete this project.The most basic techniques are used,Beginners will.
1 查看原数据
Let’s take a look at what the original data looks like first.
After unzipping the dataset, you will find the following subfolderimage里存放了共27153张图片,其中标号前18353张图片为训练集,后8800张图片为测试集(测试集没有给label).
训练集的标签信息在train.csv中,有176类.项目的目的is the prediction behind8800A classification of leaf pictures.
所以这样的数据集需要处理一下才能读入Dataset类中,And we'd better write one ourselvesDataset类.
2 数据预处理,建立Dataset
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision.datasets import ImageFolder
from torchvision import transforms
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
Then the first purpose is to prepare to build your own datasetDataset类.
I want to use firsttorchvision.datasets.ImageFolder()方法把image下的图片读入一个临时的Dataset,In this way, the image dataset will be ready-made,Then change the label inside.This way you can take advantage of this temporaryDatasetMake a fuss to establish yourselfDataset类.
用ImageFolderThere will be a pit in reading the file,That is, it will be read in in the sorted order of the strings of the filenames(即 1.jpg → 10.jpg → 11.jpg … → 2.jpg …),所以我们先把imageThe file names in the folder are reprocessed.
# 先给文件名称重命名一下,数字不满5位的一律补全0,因为届时用ImageFolder读取是按字符串顺序读取的
# 即 3.jpg → 00003.jpg
import os
path = '../classify-leaves/images'
file_list = os.listdir(path)
for file in file_list:
front, end = file.split('.') # 取得文件名和后缀
front = front.zfill(5) # 文件名补0,5表示补0后名字共5位
new_name = '.'.join([front, end])
# print(new_name)
os.rename(path + '\\' + file, path + '\\' + new_name)
The entire image dataset can then be read in.
# Read the entire temporary dataset
data_images = ImageFolder(root='../classify-leaves')
To build your own dataset,First, we need to distinguish between the training set and the validation set.
Next, read the training set sumlabel:
train_csv = pd.read_csv('../classify-leaves/train.csv')
# 显示:
然后需要把labelClass names in are converted to class numbers,To facilitate reading at that timeDataset:
# 获取某个元素的索引的方法
# 这个class_to_numThat is, as a mapping of category numbers to category names
class_to_num = train_csv.label.unique()
print(np.where(class_to_num == 'quercus_montana')[0][0])
# 将训练集的labelCorresponds to the category number
train_csv['class_num'] = train_csv['label'].apply(lambda x: np.where(class_to_num == x)[0][0])
# 显示:
With the information of the entire image dataset,And the length of the training set and label信息,We can build our ownDataset了(定制DatasetThe method can refer to me这篇文章).
我打算在这个DatasetIn the training set or validation set as needed,In addition, data enhancement can be passed intransform方法.
我把DatasetDesigned to be able to pass in an entire dataset object directlyimgs(由之前的ImageFolder方法得到),and the labels of the training set in itlabels,这样imgs的长度会大于labels,The extra part is the validation set,为方便起见,验证集的label自动设为-1.
# 创建数据集对象 —— leaf_dataset
class leaf_dataset(Dataset): # 需要继承Dataset类
def __init__(self, imgs, labels, train=True, transform=None):
""" 传入数据集imgs、标签labels. imgs多于labelsThe data of length is automatically used as the validation set,自动设为“-1”类 Args: imgs (Dataset): Pass in the entire image dataset,由ImageFolder读取 labels (pandas: series): 训练集的标签 train (True or False):Whether to load the training set,FalseThen load the validation set transform:传入transform方法, 不设置则默认为 Resize((224, 224)) + ToTensor() """
to_train = len(labels)
to_valid = len(imgs)
if len(imgs) > len(labels): # labelsis the training set label,通常会小于imgs的大小,So make up the label of the validation set
indices1 = range(to_train)
imgs_to_train = torch.utils.data.Subset(imgs, indices1)
indices2 = range(to_train, to_valid)
imgs_to_valid = torch.utils.data.Subset(imgs, indices2)
labels_valid = pd.Series([-1]*(len(imgs) - len(labels))) # Label the validation set-1类,It is consistent with the training set style
if train == True:
self.imgs = imgs_to_train
self.labels = labels
self.imgs = imgs_to_valid
self.labels = labels_valid
else: # labels和imgsThere is no problem with the validation set when it is equal(若imgs长度小于labels,届时Dataloaderwill discard the excesslabels部分)
self.imgs = imgs
self.labels = labels
if transform:
self.transform = transform
self.transform = transforms.Compose([
transforms.Resize((224, 224)),
]) # 如果没设定transform,Take the default conversion action
def __len__(self):
return len(self.imgs)
def __getitem__(self, idx):
label = self.labels[idx]
data_in = self.imgs[idx][0] # 届时传入一个ImageFolder对象,需要取[0]获取数据,不要标签
data = self.transform(data_in)
return data, label
Preview training and validation sets
设计完Dataset后,之前ImageFolderThe temporary dataset read is oursimgs,The label number of the training set is ourslabels.
imgs = data_images # 总数据集
labels = train_csv.class_num # 训练集标签
print(len(imgs), len(labels))
# Sequences are not of equal length,超过labelsThe length part serves as the validation set
Leaf_dataset_train = leaf_dataset(imgs=imgs, labels=labels, train=True)
# 传入DataLoader看一下
train_iter = DataLoader(dataset=Leaf_dataset_train, batch_size=128, shuffle=False)
X, y = next(iter(train_iter))
print(X[0].shape, y[0])
# 定义绘图函数 show_images
def show_images(imgs, num_rows, num_cols, scale=2):
figsize = (num_cols * scale, num_rows * scale)
_, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
for i in range(num_rows):
for j in range(num_cols):
axes[i][j].imshow(imgs[i * num_cols + j])
return axes
# 展示一下
toshow = [torch.transpose(X[i],0,2) for i in range(16)]
show_images(toshow, 2, 8, scale=2)
Take a look at the validation set,这边的valid_iterIt will also be used when verifying and uploading.
Leaf_dataset_valid = leaf_dataset(imgs=imgs, labels=labels, train=False)
# 传入DataLoader看一下
valid_iter = DataLoader(dataset=Leaf_dataset_valid, batch_size=128, shuffle=False)
X, y = next(iter(valid_iter))
print(X[0].shape, y[0])
# 展示一下
toshow = [torch.transpose(X[i],0,2) for i in range(16)]
show_images(toshow, 2, 8, scale=2)
3 定义模型、优化器
We need to change the number of final output layers of the model to our number of categories len(class_to_num),即176类.
from torchvision import models
pretrained_net = models.resnet34(pretrained=True)
# 使用的torchvision的resnet34预训练模型
# 查看输出层
# 类别数
# 可见此时pretrained_net最后的输出个数等于目标数据集的类别数1000.所以我们应该将最后的fc成修改我们需要的输出类别数 176
pretrained_net.fc = torch.nn.Linear(512, len(class_to_num))
# 显示:
# Optimizer selection
lr = 0.0001
optimizer = torch.optim.AdamW(pretrained_net.parameters(), lr=lr, weight_decay=0.001)
4 设置训练集和测试集
because of the original data,The test set provided is actually the validation set,Results need to be submitted,Verify accuracy online.
所以,You need to build a training set yourselfLeaf_datasetDivide it at random,分成训练集和测试集.
The test set can be much smaller than the training set,Just observe the effect of local training.
# A part of the test set is split from the training set,In order to facilitate the first time to see the effect of training,The number of test sets can be small.
# 随机拆分,Set the test set ratioratio:
def to_split(dataset, ratio=0.1):
num = len(dataset)
part1 = int(num * (1 - ratio))
part2 = num - part1
train_set, test_set = torch.utils.data.random_split(dataset, [part1, part2])
return train_set, test_set
train_set, test_set = to_split(Leaf_dataset_train, ratio=0.01) # The test set ratio is set0.01
print(len(train_set), len(test_set))
# 显示:
5 训练
Set the data augmentation method
我们在leaf_dataset类中已经设置了Resize((224, 224)) + ToTensor()的基础transform方法,Need to set the training time and 测试/验证 The method of image augmentation used.
This method will then be called directly in the training and testing functions.
Here I only use random levels when retraining、垂直翻转,It seems that these two enhancement methods have the most obvious improvement in training effect.
此外NormalizeFixed for training and testing(related to pretrained models,For details, refer to Hands-on Deep Learning 图像增广 这一节.)
# 在leaf_dataset类中已经设置了Resize((224, 224)) + ToTensor()的基础transform方法,Here, set the training time and test again\The method of image augmentation used for verification.
# This method will then be called directly during training and testing.
# Normalize:
# We are using pretrained models,Do the same preprocessing as in pretraining.
# 如果你使用的是torchvision的models,那就要求: All pre-trained models expect input images normalized in the same way,
# i.e. mini-batches of 3-channel RGB images of shape (3 x H x W), where H and W are expected to be at least 224.
# The images have to be loaded in to a range of [0, 1] and then normalized using mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225].
# 指定RGB三个通道的均值和方差来将图像通道归一化
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# Datasetis already available by default Resize((224, 224)) + ToTensor(),再设RandomHorizontalFlipand pretrained modelsnormalize
train_augs = transforms.Compose([
# Enhancements when testing
test_augs = transforms.Compose([
Define the evaluation accuracy and training function.The function will call the previously defined data augmentation method.
# 定义train函数,使用GPU训练并评价模型
import time
# Evaluate the accuracy on the test set
def evaluate_accuracy(data_iter, net, device=None):
"""Evaluate model prediction accuracy"""
if device is None and isinstance(net, torch.nn.Module):
# 如果没指定device就用net的device
device = list(net.parameters())[0].device
acc_sum, n = 0.0, 0
with torch.no_grad():
for X, y in data_iter:
# Do data augmentation on the test set(normalize)
X = test_augs(X)
if isinstance(net, torch.nn.Module):
net.eval() # 将模型net调成 评估模式,这会关闭dropout
# accumulate this onebatchThe number of correct judgments in the data
acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
net.train() # 将模型net调回 训练模式
else: # For custom models(几乎用不到)
if('is_training' in net.__code__.co_varnames): # 如果有is_training这个参数
# 将 is_training 设置成False
acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
def train(train_iter, test_iter, net, loss, optimizer, device, num_epochs):
net = net.to(device)
print('training on ', device)
batch_count = 0
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
for X, y in train_iter:
X = X.to(device)
# Use data augmentation during training
X = train_augs(X)
y = y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
train_l_sum += l.cpu().item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
n += y.shape[0]
batch_count += 1
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
% (epoch+1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))
Set up the training set、测试集、损失函数、batch_size、训练轮数,以及模型,开始训练.
def train_fine_tuning(net, optimizer, batch_size=128, num_epochs=20):
train_iter = DataLoader(train_set, batch_size)
test_iter = DataLoader(test_set, batch_size)
loss = torch.nn.CrossEntropyLoss()
train(train_iter, test_iter, net, loss, optimizer, device, num_epochs)
train_fine_tuning(pretrained_net, optimizer)
6 保存模型
After training, the model can be stored locally,Easy to re-read and deploy.
# pretrained_net 是 torchvision.models.resnet34() 类
path = 'net.pt'
torch.save(pretrained_net.state_dict(), path)
7 验证数据,上传
我们现在需要在test.csvMedium prediction category
test_csv = pd.read_csv('../classify-leaves/test.csv')
# 显示:
valid_iteris the validation set read sequentially,You can check whether the second image is the second image corresponding to the validation set(18354.jpg,会有90度旋转)
# Take a look at the first on the validation set1张图片
X, y = next(iter(valid_iter))
# Check out the validation set1个数据.valid_iterare read in the original order.
Define a forecastpredict函数,返回一个List,包含了8800predicted category number,Then map the category number to the category name.
# 定义预测函数
def valid_output(valid_iter, net, device=None):
if device is None and isinstance(net, torch.nn.Module):
# 如果没指定device就用net的device
device = list(net.parameters())[0].device
with torch.no_grad():
y_output = []
for X, y in valid_iter:
# Data augmentation is performed on the validation set(normalize)
X = X.to(device)
X = test_augs(X)
net.eval() # 将模型netPut into evaluation mode
y_hat = net(X).argmax(dim=1)
y_hat = y_hat.cpu().tolist()
y_output += y_hat
return y_output
output = valid_output(valid_iter, pretrained_net)
# 输出8800
output_label = [class_to_num[i] for i in output] # Map category numbers to category names
test_csv['label'] = output_label
test_csv.to_csv('test.csv', index=False)
# 将生成的test.csv上传
将csv文件上传到KaggleYou can see the results.
It can also be achieved using the simplest techniques95%.
A brief summary of technical points
Divide the local training set into training set and test set(很小一部分)进行训练
数据增强:A random vertical is used on the training set、水平翻转
模型:Use a better performing pretrained model,使用Resnet32
(The code used in this article can also refer to mineGithub)
