pytorch It is convenient for the following common public datasets API Interface , But when we need to use our own data sets to train neural networks , You need to customize the dataset , stay pytorch in , Provides some classes , It's easy for us to define our own data set
- torch.utils.data.Dataset: All subclasses that inherit from him should override __len()__ , __getitem()__ These two methods
- __len()__ : Returns the amount of data in the dataset
- __getitem()__ : Returns a data that supports subscript indexing
- torch.utils.data.DataLoader: Wrapping data sets , You can set batch_size、 whether shuffle....
First step
Self defined Dataset All need to inherit torch.utils.data.Dataset class , And rewrite its two member methods :
- __len()__: Reading data , Return data and tags
- __getitem()__: Returns the length of the dataset
from torch.utils.data import Dataset class AudioDataset(Dataset): def __init__(self, ...): """ Class initialization """ pass def __getitem__(self, item): """ How to read the data every time , Return data and tags """ return data, label def __len__(self): """ Returns the length of the entire dataset """ return total
matters needing attention :Dataset Only responsible for data abstraction , One call getiitem Only one sample is returned
Case study :
File directory structure
- p225
- ***.wav
- ***.wav
- ***.wav
- ...
- dataset.py
Purpose : Read p225 Audio data in the folder
1 class AudioDataset(Dataset): 2 def __init__(self, data_folder, sr=16000, dimension=8192): 3 self.data_folder = data_folder 4 self.sr = sr 5 self.dim = dimension 6 7 # Get a list of audio names 8 self.wav_list = [] 9 for root, dirnames, filenames in os.walk(data_folder): 10 for filename in fnmatch.filter(filenames, "*.wav"): # Implement the filtering or filtering of special characters in the list , Return match “.wav” A list of characters 11 self.wav_list.append(os.path.join(root, filename)) 12 13 def __getitem__(self, item): 14 # Read an audio file , Return every audio data 15 filename = self.wav_list[item] 16 wb_wav, _ = librosa.load(filename, sr=self.sr) 17 18 # take frame 19 if len(wb_wav) >= self.dim: 20 max_audio_start = len(wb_wav) - self.dim 21 audio_start = np.random.randint(0, max_audio_start) 22 wb_wav = wb_wav[audio_start: audio_start + self.dim] 23 else: 24 wb_wav = np.pad(wb_wav, (0, self.dim - len(wb_wav)), "constant") 25 26 return wb_wav, filename 27 28 def __len__(self): 29 # The total number of audio files 30 return len(self.wav_list)
matters needing attention :19-24 That's ok : The length of each audio is different , If you read the data directly and return it , It will cause dimension mismatch and error , Therefore, you can only take one audio file and read one frame at a time , This obviously doesn't use all the voice data ,
The second step
Instantiation Dataset object
Dataset= AudioDataset("./p225", sr=16000)
If you want to pass batch To read data, you can skip to step 3 , If you want to read data one by one, you can see my next operation
# Instantiation AudioDataset object train_set = AudioDataset("./p225", sr=16000) for i, data in enumerate(train_set): wb_wav, filname = data print(i, wb_wav.shape, filname) if i == 3: break # 0 (8192,) ./p225\p225_001.wav # 1 (8192,) ./p225\p225_002.wav # 2 (8192,) ./p225\p225_003.wav # 3 (8192,) ./p225\p225_004.wav
The third step
If you want to pass batch Reading data , Need to use DataLoader For packaging
Why use DataLoader?
- The input to deep learning is mini_batch form
- The sample loading may need to be randomly disordered ,shuffle operation
- Sample loading requires multithreading
pytorch Provided DataLoader Encapsulates the above functions , It's more convenient to use .
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=default_collate, pin_memory=False, drop_last=False)
Parameters :
- dataset: Loaded dataset (Dataset object )
- batch_size: How many samples should be loaded per batch ( The default value is :1)
- shuffle: Every epoch Whether to scramble the data
- sampler: Define a strategy for extracting samples from a dataset . If specified , You can't specify shuffle .
- batch_sampler: Be similar to sampler, But one index at a time . And batch_size、shuffle、sampler and drop_last Mutually exclusive .
- num_workers: Number of processes loaded with multiple processes ,0 Represents not using multithreading
- collate_fn: How to splice multiple sample data into one batch, The default splicing method is generally used
- pin_memory: Whether to save the data in pin memory District ,pin memory Data in go to GPU It's going to be faster
- drop_last:dataset The number of data in may not be batch_size Integer multiple ,drop_last by True There will be less than one more batch The data is discarded
return : Data loader
Case study :
# Instantiation AudioDataset object train_set = AudioDataset("./p225", sr=16000) train_loader = DataLoader(train_set, batch_size=8, shuffle=True) for (i, data) in enumerate(train_loader): wav_data, wav_name = data print(wav_data.shape) # torch.Size([8, 8192]) print(i, wav_name) # ('./p225\\p225_293.wav', './p225\\p225_156.wav', './p225\\p225_277.wav', './p225\\p225_210.wav', # './p225\\p225_126.wav', './p225\\p225_021.wav', './p225\\p225_257.wav', './p225\\p225_192.wav')
Let's have some chestnuts to digest :
chestnuts 1
This example is the one that has been used in this article , chestnuts 1 It's just a merger
File directory structure
- p225
- ***.wav
- ***.wav
- ***.wav
- ...
- dataset.py
Purpose : Read p225 Audio data in the folder
1 import fnmatch 2 import os 3 import librosa 4 import numpy as np 5 from torch.utils.data import Dataset 6 from torch.utils.data import DataLoader 7 8 9 class Aduio_DataLoader(Dataset): 10 def __init__(self, data_folder, sr=16000, dimension=8192): 11 self.data_folder = data_folder 12 self.sr = sr 13 self.dim = dimension 14 15 # Get a list of audio names 16 self.wav_list = [] 17 for root, dirnames, filenames in os.walk(data_folder): 18 for filename in fnmatch.filter(filenames, "*.wav"): # Implement the filtering or filtering of special characters in the list , Return match “.wav” A list of characters 19 self.wav_list.append(os.path.join(root, filename)) 20 21 def __getitem__(self, item): 22 # Read an audio file , Return every audio data 23 filename = self.wav_list[item] 24 print(filename) 25 wb_wav, _ = librosa.load(filename, sr=self.sr) 26 27 # take frame 28 if len(wb_wav) >= self.dim: 29 max_audio_start = len(wb_wav) - self.dim 30 audio_start = np.random.randint(0, max_audio_start) 31 wb_wav = wb_wav[audio_start: audio_start + self.dim] 32 else: 33 wb_wav = np.pad(wb_wav, (0, self.dim - len(wb_wav)), "constant") 34 35 return wb_wav, filename 36 37 def __len__(self): 38 # The total number of audio files 39 return len(self.wav_list) 40 41 42 train_set = Aduio_DataLoader("./p225", sr=16000) 43 train_loader = DataLoader(train_set, batch_size=8, shuffle=True) 44 45 46 for (i, data) in enumerate(train_loader): 47 wav_data, wav_name = data 48 print(wav_data.shape) # torch.Size([8, 8192]) 49 print(i, wav_name) 50 # ('./p225\\p225_293.wav', './p225\\p225_156.wav', './p225\\p225_277.wav', './p225\\p225_210.wav', 51 # './p225\\p225_126.wav', './p225\\p225_021.wav', './p225\\p225_257.wav', './p225\\p225_192.wav')
matters needing attention :
- 27-33 That's ok : The length of each audio is different , If you read the data directly and return it , It will cause dimension mismatch and error , Therefore, you can only take one audio file and read one frame at a time , This obviously doesn't use all the voice data ,
- 48 That's ok : We are __getitem__ There's no such thing as numpy Array to tensor Format , But no 48 The row display data is tensor Format . It needs attention here
chestnuts 2
Compared to the case 1, Case two is the point , Because we can't just read one frame from one audio file at a time , And then read another audio file , Usually , A piece of audio has many frames , What we need is to read one in sequence batch_size Audio frame of , First read the first audio file , If one is satisfied batch, You don't have to read the second batch, If there is less than one batch Then read the second audio file , To add .
I give a suggestion , First read each audio file in order , By window length 8192、 Frame shift 4096 Frame the voice , Then joining together . obtain ( frames , Frame length ,1)(frame_num, frame_len, 1) The array of is saved to h5 in . And then use the above torch.utils.data.Dataset and torch.utils.data.DataLoader Reading data .
Specific implementation code :
First step : Create a H5_generation Scripts are used to convert data into h5 Format file :
The second step : adopt Dataset from h5 Format file to read data
import numpy as np from torch.utils.data import Dataset from torch.utils.data import DataLoader import h5py def load_h5(h5_path): # load training data with h5py.File(h5_path, 'r') as hf: print('List of arrays in input file:', hf.keys()) X = np.array(hf.get('data'), dtype=np.float32) Y = np.array(hf.get('label'), dtype=np.float32) return X, Y class AudioDataset(Dataset): """ Data loader """ def __init__(self, data_folder): self.data_folder = data_folder self.X, self.Y = load_h5(data_folder) # (3392, 8192, 1) def __getitem__(self, item): # Return an audio data X = self.X[item] Y = self.Y[item] return X, Y def __len__(self): return len(self.X) train_set = AudioDataset("./speaker225_resample_train.h5") train_loader = DataLoader(train_set, batch_size=64, shuffle=True, drop_last=True) for (i, wav_data) in enumerate(train_loader): X, Y = wav_data print(i, X.shape) # 0 torch.Size([64, 8192, 1]) # 1 torch.Size([64, 8192, 1]) # ...
I'm trying to __init__ In the middle of h5 file , But it can cause a memory explosion , It's strange , So I had to leave ,
Reference resources
pytorch Study ( Four )— Custom datasets ( It's quite detailed )