1
votes

I am relatively new in Deep learning and its framework. Currently, I am experimenting with Caffe framework and trying to fine tune the Vgg16_places_365.

I am using the Amazone EC2 instance g2.8xlarge with 4 GPUs (each has 4 GB of RAM). However, when I try to train my model (using a single GPU), I got this error:

Check failed: error == cudaSuccess (2 vs. 0) out of memory

After I did some research, I found that one of the ways to solve this out of memory problem is by reducing the batch size in my train.prototxt

Caffe | Check failed: error == cudaSuccess (2 vs. 0) out of memory.

Initially, I set the batch size into 50, and iteratively reduced it until 10 (since it worked when batch_size = 10). Now, the model is being trained and I am pretty sure it will take quite long time. However, as a newcomer in this domain, I am curious about the relation between this batch size and another parameter such as the learning rate, stepsize and even the max iteration that we specify in the solver.prototxt.

How significant the size of the batch will affect the quality of the model (like accuracy may be). How the other parameters can be used to leverage the quality. Also, instead of reducing the batch size or scale up my machine, is there another way to fix this problem?

2

2 Answers

1
votes

To answer your first question regarding the relationship between parameters such as batch size, learning rate and maximum number of iterations, you are best of reading about the mathematical background. A good place to start might be this stats.stackexchange question: How large should the batch size be for stochastic gradient descent?. The answer will briefly discuss the relation between batch size and learning rate (from your question I assume learning rate = stepsize) and also provide some references for further reading.

To answer your last question, with the dataset you are finetuning on and the model (i.e. the VGG16) being fixed (i.e. the input data of fixed size, and the model of fixed size), you will have a hard time avoiding the out of memory problem for large batch sizes. However, if you are willing to reduce the input size or the model size you might be able to use larger batch sizes. Depending on how (and what) exactly you are finetuning, reducing the model size may already be achieved by discarding learned layers or reducing the number/size of fully connected layers.

The remaining questions, i.e. how significant the batchsize influences quality/accuracy and how other parameters influence quality/accuracy, are hard to answer without knowing the concrete problem you are trying to solve. The influence of e.g. the batchsize on the achieved accuracy might depend on various factors such as the noise in your dataset, the dimensionality of your dataset, the size of your dataset as well as other parameters such as learning rate (=stepsize) or momentum parameter. For these sort of questions, I recommend the textbook by Goodfellow et al., e.g. chapter 11 may provide some general guidelines on choosing these hyperparmeters (i.e. batchsize, learning rate etc.).

1
votes

another way to solve your problem is using all the GPUs on your machine. If you have 4x4=16GB RAM on your GPUs, that would be enough. If you are running caffe in command mode, just add the --gpu argument as follows (assuming you have 4 GPUs indexed as default 0,1,2,3):

 build/tools/caffe train --solver=solver.prototxt --gpu=0,1,2,3

However if you are using the python interface, running with multiple GPUs is not yet supported.

I can point out some general hints to answer your question on the batchsize: - The smaller the batchsize is, the more stochastic your learning would be --> less probability of overfitting on the training data; higher probability of not converging. - each iteration in caffe fetches one batch of data, runs forward and ends with a backpropagation. - Let's say your training data is 50'000 and your batchsize is 10; then in 1000 iterations, 10'000 of your data has been fed to the network. In the same scenario scenario, if your batchsize is 50, in 1000 iterations, all your training data are seen by the network. This is called one epoch. You should design your batchsize and maximum iterations in a way that your network is trained for a certain number of epochs. - stepsize in caffe, is the number of iterations your solver will run before multiplying the learning rate with the gamma value (if you have set your training approach as "step").