Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?

Question

I am asking this question because I am successfully training a segmentation network on my GTX 2070 on laptop with 8GB VRAM and I use exactly the same code and exactly the same software libraries installed on my desktop PC with a GTX 1080TI and it still throws out of memory.

Why does this happen, considering that:

The same Windows 10 + CUDA 10.1 + CUDNN 7.6.5.32 + Nvidia Driver 418.96 (comes along with CUDA 10.1) are both on laptop and on PC.
The fact that training with TensorFlow 2.3 runs smoothly on the GPU on my PC, yet it fails allocating memory for training only with PyTorch.
PyTorch recognises the GPU (prints GTX 1080 TI) via the command : print(torch.cuda.get_device_name(0))
PyTorch allocates memory when running this command: torch.rand(20000, 20000).cuda() #allocated 1.5GB of VRAM.

What is the solution to this?

Thank you downvoter for expressing your option. I would like to see in your comment the reason for downvoting, in order to also learn the mistake in my question/assumptions. — Timbus Calin
That is interesting. Shouldn't happen and it doesn't seem to be the "reduce batch size" use case :) Were you using some kind of custom Dataset/Sampler/DataLoader? Were you moving the data onto the GPU in one of these components? Have you tried to create a minimal reproducible example? I'd be able to try and reproduce the problem and investigate why the solution works, but as it is, the question has none of the required information. If you still have it, would you mind to to post the full stacktrace as well? — Berriel
Yes, I am using this jupyter notebook which generates the problem. Note that the error does not happen on my laptop but on the PC it does(the exact same configurations), regardless of the PyTorch[1.2,1.6] and the equivalent torchvision version. github.com/qubvel/segmentation_models.pytorch/blob/master/…. Since it was solved by reducing the number of workers in the DataLoader, I assume it is related to the processor or threads? — Timbus Calin
Did you check if using no augmentation (augmentation=None) still causes the problem? Other than that, everything seems to be ok. I'll give it a try later. — Berriel

Timbus Calin Timbus Calin · Accepted Answer · 2020-08-17T10:27:48

Most of the people (even in the thread below) jump to suggest that decreasing the batch_size will solve this problem. In fact, it does not in this case. For example, it would have been illogical for a network to train on 8GB VRAM and yet to fail to train on 11GB VRAM, considering that there were no other applications consuming video memory on the system with 11GB VRAM and the exact same configuration is installed and used.

The reason why this happened in my case was that, when using the DataLoader object, I set a very high (12) value for the workers parameter. Decreasing this value to 4 in my case solved the problem.

In fact, although at the bottom of the thread, the answer provided by Yurasyk at https://github.com/pytorch/pytorch/issues/16417#issuecomment-599137646 pointed me in the right direction.

Solution: Decrease the number of workers in the PyTorch DataLoader. Although I do not exactly understand why this solution works, I assume it is related to the threads spawned behind the scenes for data fetching; it may be the case that, on some processors, such an error appears.

Why do I get CUDA out of memory when running PyTorch model [with enough GPU memory]?

1 Answers