当前位置：网站首页>[pytorch] kaggle large image dataset data analysis + visualization

[pytorch] kaggle large image dataset data analysis + visualization

2022-06-13 02:09:00 【liyihao76】

[pytorch] Kaggle Large data sets Data analysis + visualization

Parameters
meta data The original data
Exploratory data analysis (EDA)
Data visualization

The data in the competition contains data from 28 Of different research institutions 30 Different species （ Whales and dolphins ） Of 15,000 Of several unique individual marine mammals 51033 Zhang image . The competition requirement is to test the set of individuals id The classification of .
kaggle Game data details and data set download ：Happywhale - Whale and Dolphin Identification

Project code ：Happywhale: Data Distribution

Parameters

import os
from glob import glob
from tqdm.notebook import tqdm
import numpy as np
import math
import random
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import cv2
import imagesize

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import timm
try:
    from cuml import TSNE, UMAP # if gpu is ON
except:
    from sklearn.manifold import TSNE # for cpu
import wandb
import IPython.display as ipd

class CFG:
    seed          = 42
    base_path     = './happy-whale-and-dolphin'
    embed_path    = './happywhale-embedding-dataset' # `None` for creating embeddings otherwise load
    ckpt_path     = './happy-whale-and-dolphin/checkpoint/tf_efficientnet_b0_aa-827b6e33.pth' # checkpoint for finetuned model by debarshichanda
    num_samples   = None # None for all samples
    device        = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    competition   = 'happywhale'
    _wandb_kernel = 'awsaf49'

#  Reproducibility  Reproducibility 
def seed_torch(seed_value):
    random.seed(seed_value) # Python
    np.random.seed(seed_value) # cpu vars
    torch.manual_seed(seed_value) # cpu vars 
    if torch.cuda.is_available(): 
        torch.cuda.manual_seed(seed_value)
        torch.cuda.manual_seed_all(seed_value) # gpu vars
    if torch.backends.cudnn.is_available:
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    print('# SEEDING DONE')
seed_torch(CFG.seed)

meta data The original data

The original data
Add path information

df = pd.read_csv(f'{
      CFG.base_path}/train.csv')  #  Read training files   Now? df There are only picture names in the 
df['image_path'] = CFG.base_path+'/train_images/'+df['image'] # df Add a column after it   Relative path of picture 
df['split'] = 'Train' #  Add a column of   For the training set 

test_df = pd.read_csv(f'{
      CFG.base_path}/sample_submission.csv')
test_df['image_path'] = CFG.base_path+'/test_images/'+test_df['image']
test_df['split'] = 'Test'

print('Train Images: {:,} | Test Images: {:,}'.format(len(df), len(test_df)))
# Train Images: 51,033 | Test Images: 27,956

Divide the data into two broad categories , Fixed species tagging error

#  take  beluga、globis  Convert to  wales  In order to get  2  Class label .
#  Depending on whether the species contains whales or dolphins , Thus, the data is divided into two categories   Save its results to class In the column 
# convert beluga, globis to whales
df.loc[df.species.str.contains('beluga'), 'species'] = 'beluga_whale'
df.loc[df.species.str.contains('globis'), 'species'] = 'short_finned_pilot_whale'
df.loc[df.species.str.contains('pilot_whale'), 'species'] = 'short_finned_pilot_whale'
df['class'] = df.species.map(lambda x: 'whale' if 'whale' in x else 'dolphin')

# #  Data recovery 
# fix duplicate labels
# https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/304633
df['species'] = df['species'].str.replace('bottlenose_dolpin','bottlenose_dolphin')
df['species'] = df['species'].str.replace('kiler_whale','killer_whale')

Get image size function

# imagesize  It parses the header of the image file and returns the image size 
def get_imgsize(row):
    row['width'], row['height'] = imagesize.get(row['image_path'])
    return row

Generate a new data file （ Contains all kinds of information ）

# Train
tqdm.pandas(desc='Train ')
df = df.progress_apply(get_imgsize, axis=1)
df.to_csv('train.csv', index=False)

# Test
tqdm.pandas(desc='Test ')
test_df = test_df.progress_apply(get_imgsize, axis=1)
test_df.to_csv('test.csv',index=False)

Running results
Insert picture description here
axis = 1 or ‘columns’: apply function to each row.

New data file

Exploratory data analysis (EDA)

Different Species

There is an imbalance in the number of different species

data = df.species.value_counts().reset_index()
fig = px.bar(data, x='index', y='species', color='species',title='Species', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()

Insert picture description here

Dolphin Vs Whale

A comparison of whales and dolphins , More whale samples

data = df['class'].value_counts().reset_index()
fig = px.bar(data, x='index', y='class', color='class', title='Whale Vs Dolphin', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()

Insert picture description here

ImageSize Vs Class

It can be seen that ,Whale and Dolphin Of ImageSize The distribution is similar , Except for some height cases .

fig = px.histogram(df,
                   x="width", 
                   color="class",
                   barmode='group',
                   log_y=True,
                   title='Width Vs Class')
display(fig.show())

fig = px.histogram(df,
                   x="height", 
                   color="class",
                   barmode='group',
                   log_y=True,
                   title='Height Vs Class')
display(fig.show())

Insert picture description here

ImageSize Vs Split(Train/Test)

It can be noted that , The width distribution of training data and test data looks very similar . therefore , We can adjust the size at will .
For height , We have some differences .

fig = px.histogram(pd.concat([df, test_df]),
                   x="width", 
                   color="split",
                   barmode='group',
                   log_y=True,
                   title='Width Vs Split');
display(fig.show())

fig = px.histogram(pd.concat([df, test_df]),
                   x="height", 
                   color="split",
                   barmode='group',
                   log_y=True,
                   title='Height Vs Split');
display(fig.show())

Insert picture description here

Data visualization

Reading data ：1. Inherit dataset class , rewrite 3 A function . One data at a time 2. Use dataloader Every time there is batch_size Data .

def load_image(path):
    img = cv2.imread(path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    return img

class ImageDataset(Dataset):
    def __init__(self,
                 path,
                 target=None,
                 input_shape=(128, 256),
                 transform=None,
                 channel_first=True,
                ):
        super(ImageDataset, self).__init__()
        self.path = path
        self.target = target
        self.input_shape = input_shape
        self.transform = transform
        self.channel_first = channel_first
    def __len__(self):
        return len(self.path)
    
    def __getitem__(self, idx):
        img = load_image(self.path[idx])
        img = cv2.resize(img, dsize=self.input_shape)
        if self.transform is not None:
            img = self.transform(image=img)["image"]
        if self.channel_first:
            img = img.transpose((2, 0, 1))
        if self.target is not None:
            target = self.target[idx]
            return img, target
        else:
            return img

def get_dataset(path, target=None, batch_size=32, input_shape=(224, 224)):
    dataset = ImageDataset(path=path,
                           target=target,
                           input_shape=input_shape,
                          )

    dataloader = DataLoader(
        dataset,
        batch_size=batch_size,
        num_workers=2,
        shuffle=False,
        pin_memory=True,
    )
    return dataloader

train_loader = get_dataset(path=df.image_path.tolist(),
                       target=df.species.tolist(),
                       input_shape=(224,224),
                      )
test_loader = get_dataset(path=test_df.image_path.tolist(),
                       target=None,
                       input_shape=(224,224),
                      )

Observe the training set image

def plot_batch(batch, row=2, col=2, channel_first=True):
    if isinstance(batch, tuple) or isinstance(batch, list):
        imgs, tars = batch
    else:
        imgs, tars = batch, None
    plt.figure(figsize=(col*3, row*3))
    for i in range(row*col):
        plt.subplot(row, col, i+1)
        img = imgs[i].numpy()
        if channel_first:
            img = img.transpose((1, 2, 0))
        plt.imshow(img)#plt.imshow((img * 255).astype(np.uint8))  If 0-1 Words 
        if tars is not None:
            plt.title(tars[i])
        plt.axis('off')
    plt.tight_layout()
    plt.show()
    
def gen_colors(n=10):
    cmap   = plt.get_cmap('rainbow')
    colors = [cmap(i) for i in np.linspace(0, 1, n + 2)]
    colors = [(c[2] * 255, c[1] * 255, c[0] * 255) for c in colors]
    return colors

batch = iter(train_loader).next()
plot_batch(batch, row=2, col=5)

Insert picture description here
Test set image

batch = iter(test_loader).next()
plot_batch(batch, row=2, col=5)

Insert picture description here

原网站

版权声明
本文为[liyihao76]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202280546486275.html

当前位置：网站首页>[pytorch] kaggle large image dataset data analysis + visualization

[pytorch] kaggle large image dataset data analysis + visualization

[pytorch] Kaggle Large data sets Data analysis + visualization

Parameters

meta data The original data

Exploratory data analysis (EDA)

Different Species

Dolphin Vs Whale

ImageSize Vs Class

ImageSize Vs Split(Train/Test)

Data visualization

边栏推荐

猜你喜欢

随机推荐