[Pytorch study notes] 8. How to use WeightedRandomSampler (weight sampler) when the training category is unbalanced data
2022-08-05 06:03:00 【takedachia】
When we usually deal with class imbalanced data(Imbalanced Data)时,For example when you have a binary classification dataset,其中90%samples were marked as positive,10%samples were negative,If you take it directly to the model training,Then you will find the prediction accuracy of the modelAccuracyAlways somewhere90%The value floats left and right.
当然会这样,Because first of all, what the model learns is basically positive features,It will only be judged as positive;其次,Give it again90%Positive validation set,The model will have a 90% accuracy in predicting all positives.
Training on class-imbalanced data
For class-imbalanced data,在训练模型和评估模型always has its own method.在评估模型时,We often use confusion matrix、F1-score、PR曲线等,rather than accuracy、ROC曲线(关于ROC、PRThe difference between the curves can be seen in mine这篇文章).
This article discusses训练The method to use when imbalanced data.
In order to make the model learning achieve better results,Usually we let the model change the sample category into 1:1(二分类的情况).
This involves resampling of samples,Mainly divided into oversampling and undersampling:
Pytorchweight sampler WeightedRandomSampler
How to implement resampling?PytorchWeight samplers are provided torch.utils.data.WeightedRandomSampler.
Take a cursory look at the documentation:
Much need to explain this weight sampler:
①First pass in a weight informationweights,这个weights是所有样本(The entire dataset to be sampled)中每个样本的权重.
So the length of this vector passed in can be equal to the length of the dataset.
Each weight value is how likely you are to pick that sample.The size of the weight value here does not require all sums to be1.可以预见的是,The weight values of samples of the same class should all be set to 该Reciprocal of category proportions.such as positive proportions20%,Its weight should be set to 5 5 5;The weight of the corresponding negative class samples should be 1 / 80 % = 1.25 1 / 80% = 1.25 1/80%=1.25.
②第二个参数num_samplesIndicates how many to draw.
③replacement表示是否放回抽样,Usually put back,Not only does it ensure that the data distribution has not changed,It can also allow the lesser category to be drawn repeatedly,In order to ensure a balance between the number and the majority class.
④Note that this sampler needs to be passed in at that timeDataloader(),It guidesDataloader()How to be right firstdataset进行采样,然后读取.And once this is passed insampler,Dataloadercan no longer be passed inshuffle参数了.
But it's given belowExample实在是太烂了,很难理解发生了什么.
可以看到sampler和dataset最终都为Dataloader服务的,From the labels of the original data, label the corresponding weights one by one and pass them into the sampler(You also need to set the number of samples and whether to put them back),The raw data is converted into dataset.
sampler和dataset传入Dataloader后,Dataloader就会以datasetPress for the blueprintsamplerSpecified rule sampling,Take a good sample as a generator pressbatch_size输出.
I wrote a demo code based on the flow graph myself,You can observe the last output of a batch of data,0、1The proportions are all balanced.
(The code below can also be directly viewed by meGitHub)
①Generate artificial data
import pandas as pd
import numpy as np
df_y_neg = pd.DataFrame(np.random.randint(0,1,size=900)) # 生成900个0
df_y_pos = pd.DataFrame(np.random.randint(1,2,size=100)) # 生成100个1
df_y = pd.concat([df_y_neg, df_y_pos], ignore_index=True).sample(frac=1).reset_index(drop=True) # Confused,Generate label columnsy
df_X = pd.DataFrame(np.random.normal(1, 0.1, size=(1000, 10))) # 生成1000个数据,10个维度
df = pd.concat([df_X, df_y], axis=1) # Put together to form a dataset
df.columns = ['x'+str(n) for n in range(10)] + ['y'] # Label the column name
num_pos = df.loc[df['y'] == 1].shape[0]
num_neg = df.loc[df['y'] == 0].shape[0]
pos_weight = (num_pos + num_neg) / num_pos
neg_weight = (num_pos + num_neg) / num_neg
print(num_pos, num_neg)
print(pos_weight, neg_weight) # Calculate the weight
③Add weight column
df['y_weight'] = df['y'].apply(lambda x : pos_weight if x==1 else neg_weight) # Add weight column
import torch
data_y_w = torch.tensor(df['y_weight'].to_numpy(), dtype=torch.float) # 注意一下,DataFrame要to_numpy()转成numpy再传入tensor,不然会报错
num_samples = df.shape[0] # How many samples were drawn in total.Can be set to the same number as the dataset,You can also set it yourself.
# Define the sampler,Pass in the prepared weights array,Total number of samples,and select Replacement Sampling
sampler = torch.utils.data.sampler.WeightedRandomSampler(data_y_w, num_samples, replacement=True)
data_features = torch.tensor(df.iloc[:, :-1].values, dtype=torch.float) # The data discards the weight column,传入tensor
data_features_X = data_features[:,:-1] # 取X
data_features_y = data_features[:, -1].long() # 取y,标签转成long型
# TensorDataset 可以用来对 tensor 进行打包,类似zip.Forms are data characteristics+标签.
dataset = torch.utils.data.TensorDataset(data_features_X, data_features_y)
# 建立Dataloader
batch_size = 64
data_iter = torch.utils.data.DataLoader(
dataset = dataset, sampler = sampler, batch_size = batch_size)
# View a batch of data
X, y = next(iter(data_iter))
Resampling during training
我们在训练模型时,一般在一个epochUse it next timeDataloader:
for epoch in range(num_epochs):
for X, y in data_iter:
所以每轮epochThe sampled results are different,保证多个epochAll data are collected as low as possible.
