The torchvision package provides easy access to commonly used datasets. You would use them like this:
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
shuffle=False, num_workers=2)
Apparently, you can only switch between train=True
and train=False
. The docs explain:
train (bool, optional) – If True, creates dataset from training.pt, otherwise from test.pt.
But this goes against the common practice of having a three-way split. For serious work, I need another DataLoader
with a validation set. Also, it would be nice to specify the split proportions myself. They don't say what percentage of the dataset is reserved for testing, maybe I would like to change that.
I assume that this is a conscious design decision. Everyone working on one of these datasets is supposed to use the same testset. That makes results comparable. But I still need to get a validation set out of the trainloader
. Is it possible to split a DataLoader
into two separate streams of data?