5
votes

The torchvision package provides easy access to commonly used datasets. You would use them like this:

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

Apparently, you can only switch between train=True and train=False. The docs explain:

train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.

But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.

I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader. Is it possible to split a DataLoader into two separate streams of data?

1

1 Answers

6
votes

Meanwhile, I stumbled upon the method random_split. So, you don't split the DataLoader, but you split the Dataset:

torch.utils.data.random_split(dataset, lengths)