0
votes

I'm currently training a convulotional network on the Cifar10 dataset. Let's say I have 50,000 training images and 5,000 validation images.

To start testing my model, let's say I start with just 10,000 images to get an idea of how successful the model will be.

After 40 epochs of training and a batch size of 128 - ie every epoch I'm running my Optimizer to minimize loss 10,000 / 128 ~ 78 times on a batch of 128 images (SGD).

Now, let's say I found a model that achieves 70% accuracy on the validation set. Satisfied, I move on to train on the full training set.

This time, for every epoch I run the Optimizer to minimize loss 5 * 10,000 / 128 ~ 391 times.

This makes me think that my accuracy at each epoch should be higher on than on the limited set of 10,000. Much to my dismay, the accuracy on the limited training set increases much more quickly. At the end of the 40 epochs with the full training set, my accuracy is 30%.

Thinking the data may be corrupt, I perform limited runs on training images 10-20k, 20-30k, 30-40k, and 40-50k. Surprisingly, each of these runs resulted in an accuracy ~70%, close to the accuracy for images 0-10k.

Thus arise two questions:

  1. Why would validation accuracy go down when the data set is larger and I've confirmed that that each segment of data indeed provides decent results on its own?
  2. For larger training set, would I need to train through more epochs even though each epoch represents a larger number of training steps (391 vs 78)?
1
Did you shuffle the data before feeding it into the network? If you use batches, where the samples are always of the same class SGD has a hard time converging.Thomas Pinetz
@ThomasPinetz Yes I did.Eric H.

1 Answers

0
votes

It turns out my intuition was right, but my code was not.

Basically, I had been running my validation accuracy with training data (data used to train the model), not with validation data (data the model had not yet seen).

After correcting this error, the validation accuracy did inevitably improve with a bigger training data set, as expected.