
When training deep learning model, I found that GPU is not fully utilise if I set the train and validate(test) batch size to be same, say 32, 64, ..., 512.

Then I check NVIDIA Titan X specifications:

  1. NVIDIA CUDA® Cores: 3584
  2. Memory: 12 GB GDDR5X

In order to reduce test time for CNN model, I want to increase the number of samples in a batch as large as possible. I tried:

  • set number of samples per batch to 3584, cuda out of memrory error.
  • set number of samples per batch to 2048, cuda out of memrory error.
  • set number of samples per batch to 1024, works. but I am not sure whether GPU is fully utilised or not.


How to easily pick the number of samples per batch to fully utilize GPU on deep model forward operation?


1 Answers


Use watch nvidia-smi to check how much GPU memory your processes are using.


From Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. https://arxiv.org/abs/1609.04836 :

The stochastic gradient descent method and its variants are algorithms of choice for many Deep Learning tasks. These methods operate in a small-batch regime wherein a fraction of the training data, usually 32--512 data points, is sampled to compute an approximation to the gradient. It has been observed in practice that when using a larger batch there is a significant degradation in the quality of the model, as measured by its ability to generalize. There have been some attempts to investigate the cause for this generalization drop in the large-batch regime, however the precise answer for this phenomenon is, hitherto unknown. In this paper, we present ample numerical evidence that supports the view that large-batch methods tend to converge to sharp minimizers of the training and testing functions -- and that sharp minima lead to poorer generalization. In contrast, small-batch methods consistently converge to flat minimizers, and our experiments support a commonly held view that this is due to the inherent noise in the gradient estimation. We also discuss several empirical strategies that help large-batch methods eliminate the generalization gap and conclude with a set of future research ideas and open questions.


The lack of generalization ability is due to the fact that large-batch methods tend to converge to sharp minimizers of the training function. These minimizers are characterized by large positive eigenvalues in $\nabla^2 f(x)$ and tend to generalize less well. In contrast, small-batch methods converge to flat minimizers characterized by small positive eigenvalues of $\nabla^2 f(x)$. We have observed that the loss function landscape of deep neural networks is such that large-batch methods are almost invariably attracted to regions with sharp minima and that, unlike small batch methods, are unable to escape basins of these minimizers.


enter image description here