当前位置:网站首页>[pytorch] kaggle large image dataset data analysis + visualization
[pytorch] kaggle large image dataset data analysis + visualization
2022-06-13 02:09:00 【liyihao76】
[pytorch] Kaggle Large data sets Data analysis + visualization
The data in the competition contains data from 28 Of different research institutions 30 Different species ( Whales and dolphins ) Of 15,000 Of several unique individual marine mammals 51033 Zhang image . The competition requirement is to test the set of individuals id The classification of .
kaggle Game data details and data set download :Happywhale - Whale and Dolphin Identification
Project code :Happywhale: Data Distribution
Parameters
import os
from glob import glob
from tqdm.notebook import tqdm
import numpy as np
import math
import random
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import cv2
import imagesize
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
import timm
try:
from cuml import TSNE, UMAP # if gpu is ON
except:
from sklearn.manifold import TSNE # for cpu
import wandb
import IPython.display as ipd
class CFG:
seed = 42
base_path = './happy-whale-and-dolphin'
embed_path = './happywhale-embedding-dataset' # `None` for creating embeddings otherwise load
ckpt_path = './happy-whale-and-dolphin/checkpoint/tf_efficientnet_b0_aa-827b6e33.pth' # checkpoint for finetuned model by debarshichanda
num_samples = None # None for all samples
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
competition = 'happywhale'
_wandb_kernel = 'awsaf49'
# Reproducibility Reproducibility
def seed_torch(seed_value):
random.seed(seed_value) # Python
np.random.seed(seed_value) # cpu vars
torch.manual_seed(seed_value) # cpu vars
if torch.cuda.is_available():
torch.cuda.manual_seed(seed_value)
torch.cuda.manual_seed_all(seed_value) # gpu vars
if torch.backends.cudnn.is_available:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
print('# SEEDING DONE')
seed_torch(CFG.seed)
meta data The original data

Add path information
df = pd.read_csv(f'{
CFG.base_path}/train.csv') # Read training files Now? df There are only picture names in the
df['image_path'] = CFG.base_path+'/train_images/'+df['image'] # df Add a column after it Relative path of picture
df['split'] = 'Train' # Add a column of For the training set
test_df = pd.read_csv(f'{
CFG.base_path}/sample_submission.csv')
test_df['image_path'] = CFG.base_path+'/test_images/'+test_df['image']
test_df['split'] = 'Test'
print('Train Images: {:,} | Test Images: {:,}'.format(len(df), len(test_df)))
# Train Images: 51,033 | Test Images: 27,956
Divide the data into two broad categories , Fixed species tagging error
# take beluga、globis Convert to wales In order to get 2 Class label .
# Depending on whether the species contains whales or dolphins , Thus, the data is divided into two categories Save its results to class In the column
# convert beluga, globis to whales
df.loc[df.species.str.contains('beluga'), 'species'] = 'beluga_whale'
df.loc[df.species.str.contains('globis'), 'species'] = 'short_finned_pilot_whale'
df.loc[df.species.str.contains('pilot_whale'), 'species'] = 'short_finned_pilot_whale'
df['class'] = df.species.map(lambda x: 'whale' if 'whale' in x else 'dolphin')
# # Data recovery
# fix duplicate labels
# https://www.kaggle.com/c/happy-whale-and-dolphin/discussion/304633
df['species'] = df['species'].str.replace('bottlenose_dolpin','bottlenose_dolphin')
df['species'] = df['species'].str.replace('kiler_whale','killer_whale')
Get image size function
# imagesize It parses the header of the image file and returns the image size
def get_imgsize(row):
row['width'], row['height'] = imagesize.get(row['image_path'])
return row
Generate a new data file ( Contains all kinds of information )
# Train
tqdm.pandas(desc='Train ')
df = df.progress_apply(get_imgsize, axis=1)
df.to_csv('train.csv', index=False)
# Test
tqdm.pandas(desc='Test ')
test_df = test_df.progress_apply(get_imgsize, axis=1)
test_df.to_csv('test.csv',index=False)
Running results 
axis = 1 or ‘columns’: apply function to each row.
New data file 
Exploratory data analysis (EDA)
Different Species
There is an imbalance in the number of different species
data = df.species.value_counts().reset_index()
fig = px.bar(data, x='index', y='species', color='species',title='Species', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()

Dolphin Vs Whale
A comparison of whales and dolphins , More whale samples
data = df['class'].value_counts().reset_index()
fig = px.bar(data, x='index', y='class', color='class', title='Whale Vs Dolphin', text_auto=True)
fig.update_traces(textfont_size=12, textangle=0, textposition="outside", cliponaxis=False)
fig.show()

ImageSize Vs Class
It can be seen that ,Whale and Dolphin Of ImageSize The distribution is similar , Except for some height cases .
fig = px.histogram(df,
x="width",
color="class",
barmode='group',
log_y=True,
title='Width Vs Class')
display(fig.show())
fig = px.histogram(df,
x="height",
color="class",
barmode='group',
log_y=True,
title='Height Vs Class')
display(fig.show())

ImageSize Vs Split(Train/Test)
It can be noted that , The width distribution of training data and test data looks very similar . therefore , We can adjust the size at will .
For height , We have some differences .
fig = px.histogram(pd.concat([df, test_df]),
x="width",
color="split",
barmode='group',
log_y=True,
title='Width Vs Split');
display(fig.show())
fig = px.histogram(pd.concat([df, test_df]),
x="height",
color="split",
barmode='group',
log_y=True,
title='Height Vs Split');
display(fig.show())

Data visualization
Reading data :1. Inherit dataset class , rewrite 3 A function . One data at a time 2. Use dataloader Every time there is batch_size Data .
def load_image(path):
img = cv2.imread(path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
return img
class ImageDataset(Dataset):
def __init__(self,
path,
target=None,
input_shape=(128, 256),
transform=None,
channel_first=True,
):
super(ImageDataset, self).__init__()
self.path = path
self.target = target
self.input_shape = input_shape
self.transform = transform
self.channel_first = channel_first
def __len__(self):
return len(self.path)
def __getitem__(self, idx):
img = load_image(self.path[idx])
img = cv2.resize(img, dsize=self.input_shape)
if self.transform is not None:
img = self.transform(image=img)["image"]
if self.channel_first:
img = img.transpose((2, 0, 1))
if self.target is not None:
target = self.target[idx]
return img, target
else:
return img
def get_dataset(path, target=None, batch_size=32, input_shape=(224, 224)):
dataset = ImageDataset(path=path,
target=target,
input_shape=input_shape,
)
dataloader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=2,
shuffle=False,
pin_memory=True,
)
return dataloader
train_loader = get_dataset(path=df.image_path.tolist(),
target=df.species.tolist(),
input_shape=(224,224),
)
test_loader = get_dataset(path=test_df.image_path.tolist(),
target=None,
input_shape=(224,224),
)
Observe the training set image
def plot_batch(batch, row=2, col=2, channel_first=True):
if isinstance(batch, tuple) or isinstance(batch, list):
imgs, tars = batch
else:
imgs, tars = batch, None
plt.figure(figsize=(col*3, row*3))
for i in range(row*col):
plt.subplot(row, col, i+1)
img = imgs[i].numpy()
if channel_first:
img = img.transpose((1, 2, 0))
plt.imshow(img)#plt.imshow((img * 255).astype(np.uint8)) If 0-1 Words
if tars is not None:
plt.title(tars[i])
plt.axis('off')
plt.tight_layout()
plt.show()
def gen_colors(n=10):
cmap = plt.get_cmap('rainbow')
colors = [cmap(i) for i in np.linspace(0, 1, n + 2)]
colors = [(c[2] * 255, c[1] * 255, c[0] * 255) for c in colors]
return colors
batch = iter(train_loader).next()
plot_batch(batch, row=2, col=5)

Test set image
batch = iter(test_loader).next()
plot_batch(batch, row=2, col=5)


边栏推荐
- [programming idea] communication interface of data transmission and decoupling design of communication protocol
- [51nod.3210] binary Statistics (bit operation)
- Interruption of 51 single chip microcomputer learning notes (external interruption, timer interruption, interrupt nesting)
- Huawei equipment configures private IP routing FRR
- [learning notes] xr872 GUI littlevgl 8.0 migration (display part)
- Vscode configuration header file -- Take opencv and its own header file as an example
- STM32 IIC protocol controls pca9685 steering gear drive board
- Review the history of various versions of ITIL, and find the key points for the development of enterprise operation and maintenance
- Number of special palindromes in basic exercise of test questions
- Logging system in chromium
猜你喜欢

Use of Arduino series pressure sensors and detected data displayed by OLED (detailed tutorial)

Mac下搭建MySQL环境
![[the second day of the actual combat of the smart lock project based on stm32f401ret6 in 10 days] light up with the key ----- input and output of GPIO](/img/98/77191c51c1bab28448fe197ea13a33.jpg)
[the second day of the actual combat of the smart lock project based on stm32f401ret6 in 10 days] light up with the key ----- input and output of GPIO

传感器:MQ-5燃气模块测量燃气值(底部附代码)

Parameter measurement method of brushless motor
![[the 4th day of the 10 day smart lock project based on stm32f401ret6] what is interrupt, interrupt service function, system tick timer](/img/c4/0d97def5fb587b8301bcb907fc6fcf.jpg)
[the 4th day of the 10 day smart lock project based on stm32f401ret6] what is interrupt, interrupt service function, system tick timer

Huawei equipment is configured with dual reflectors to optimize the backbone layer of the virtual private network

1、 Set up Django automation platform (realize one click SQL execution)

The new wild prospect of JD instant retailing from the perspective of "hour shopping"

Day 1 of the 10 day smart lock project (understand the SCM stm32f401ret6 and C language foundation)
随机推荐
Decompression and compression of chrome resource file Pak
cmake_ example
Functional translation
Combining strings and numbers using ssstream
DFS and BFS to solve Treasure Island exploration
LabVIEW大型项目开发提高质量的工具
Why is "iFLYTEK Super Brain 2030 plan" more worthy of expectation than "pure" virtual human
The commercial value of Kwai is being seen by more and more brands and businesses
Read routing table
【Unity】打包WebGL項目遇到的問題及解决記錄
Huawei equipment is configured with dual reflectors to optimize the backbone layer of the virtual private network
Restrict cell input type and display format in CXGRID control
QT realizes mind mapping function (II)
Parameter measurement method of brushless motor
Establishment of microservice development environment
Laptop touch pad operation
CXGRID keeps the original display position after refreshing the data
[programming idea] communication interface of data transmission and decoupling design of communication protocol
Application and examples of C language structure
Huawei equipment is configured with CE dual attribution