1
votes

I was trying to train a neural network that uses ResNet 152 as backbone but I was getting CUDA out of memory error. After that, I added the code fragment below to enable PyTorch to use more memory.

torch.cuda.empty_cache()
torch.cuda.set_per_process_memory_fraction(1., 0)

However, I am still not able to train my model despite the fact that PyTorch uses 6.06 GB of memory and fails to allocate 58.00 MiB where initally there are 7+ GB of memory unused in my GPU.

RuntimeError: CUDA out of memory. 
Tried to allocate 58.00 MiB (GPU 0; 7.80 GiB total capacity; 6.05 GiB already allocated; 
48.94 MiB free; 7.80 GiB allowed; 6.19 GiB reserved in total by PyTorch)

The result that I obtain after running nvidia-smi command.


| N/A   47C    P8     6W /  N/A |    362MiB /  7982MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A       947      G   /usr/lib/xorg/Xorg                 70MiB |
|    0   N/A  N/A      1549      G   /usr/lib/xorg/Xorg                159MiB |
|    0   N/A  N/A      1722      G   /usr/bin/gnome-shell               34MiB |
|    0   N/A  N/A      6506      G   ...AAAAAAAAA= --shared-files       85MiB |
+-----------------------------------------------------------------------------+

How can I increase the 6.19 GiB memory reserved in total by PyTorch to use more memory from my GPU? Thank you!

OS: Ubuntu 20.04

GPU: Nvidia GeForce RTX 2070-SUPER Max Q Super Design

PyTorch version: 1.8.1+cu111

Cuda toolkit: 11.2

Nvidia Cuda driver: 460.80

You're not giving us any information about the model or where it (and its parameters) reside. Without these we can only guess. - erip
@erip Sorry I thought that since this is a general issue it wouldn't matter. I am training RESA Net github.com/ZJULearning/resa from this repository with a backbone resnet34 pretrained. - Yigithan Gediz
For instance, while the model is training, I am able to load another model from a jupyter kernel to see some predictions which takes approximately another 1.3 GB of the GPU memory. At the end when I look at the GPU situation, I saw that 7.7 GB of GPU memory was being used while the training and testing processes were running together. Because of this I think that in principle, I should be able to allocate more memory to training process. - Yigithan Gediz