I'm currently trying to use PyTorch's DataLoader to process data to feed into my deep learning model, but am facing some difficulty.
The data that I need is of shape (minibatch_size=32, rows=100, columns=41)
. The __getitem__
code that I have within the custom Dataset
class that I wrote looks something like this:
def __getitem__(self, idx):
x = np.array(self.train.iloc[idx:100, :])
return x
The reason I wrote it like that is because I want the DataLoader to handle input instances of shape (100, 41)
at a time, and we have 32 of these single instances.
However, I noticed that contrary to my initial belief the idx
argument the DataLoader passes to the function is not sequential (this is crucial because my data is time series data). For example, printing the values gave me something like this:
idx = 206000
idx = 113814
idx = 80597
idx = 3836
idx = 156187
idx = 54990
idx = 8694
idx = 190555
idx = 84418
idx = 161773
idx = 177725
idx = 178351
idx = 89217
idx = 11048
idx = 135994
idx = 15067
Is this normal behavior? I'm posting this question because the data batches that are being returned are not what I initially wanted them to be.
The original logic that I used to preprocess the data before using the DataLoader was:
- Read data in from either
txt
orcsv
file. - Calculate how many batches are in the data and slice the data accordingly. For example, since one input instance is of shape
(100, 41)
and 32 of these form one minibatch, we usually end up with around 100 or so batches and reshape the data accordingly. - One input is of shape
(32, 100, 41)
.
I'm not sure how else I should be handling the DataLoader hook methods. Any tips or advice are greatly appreciated. Thanks in advance.
2
? "we usually end up with around 100" do you mean your dataset has 32*100 sample? – enamoria(100, 40)
, and there are 32 of those that form one minibatch. – Sean