How to define the len method for PyTorch Dataloader when I have separate length datasets?

Question

I'm currently loading in my data with one single dataset class. Within the dataset, I split the train, test, and validation data separately. For example:

class Data():
    def __init__(self):
        self.load()

    def load(self):
        with open(file=file_name, mode='r') as f:
            self.data = f.readlines()

        self.train = self.data[:checkpoint]
        self.valid = self.data[checkpoint:halfway]
        self.test = self.data[halfway:]

Many of the details have been omitted for the sake of readability. Basically, I read in one big dataset and make the splits manually.

My question is arising from how to override the __len__ method when the lengths of my train, valid, and test data all differ?

The reason I want to do this is because I want to keep the split data in one single class, and I also want to create separate Dataloaders for each, and so something like:

def __len__(self):
    return len(self.train)

wouldn't be appropriate for self.test and self.valid.

Perhaps I'm fundamentally misunderstanding the Dataloader, but how should I approach this issue? Thanks in advance.

Giorgos Myrianthous Giorgos Myrianthous · Accepted Answer · 2019-12-08T13:40:15

I think the most appropriate method to get the length of each split, is to simply use:

# Number of training points
len(self.train)

# Number of testing points
len(self.test)

# Number of validation points
len(self.valid)

Alternatively, if you want to refer to the length of splits for a particular instance of your object:

data = Data()
print(len(data.train))
print(len(data.test))
print(len(data.valid))

__len__ allows you to implement the way you want to count the elements of an object. Therefore, I would implement it as follows, and use the aforementioned calls to get the counts per split:

def __len__(self):
    return len(self.data)

How to define the __len__ method for PyTorch Dataloader when I have separate length datasets?

1 Answers

How to define the len method for PyTorch Dataloader when I have separate length datasets?