2
votes

I'm training a Fully convolutional network (FCN32) for semantic segmentation on Tesla K80 with more than 11G memory.

The input image is pretty large: 352x1216. Network structure is shown below. I used batch_size=1, but still encounter the out_of_memory error.

Criterion is nn.BCEWithLogitsLoss()

The network works fine when I run on CPU.


    Layer (type)               Output Shape       #  Param 
        Conv2d-1        [-1, 64, 352, 1216]           1,792
        Conv2d-2        [-1, 64, 352, 1216]          36,928
     MaxPool2d-3         [-1, 64, 176, 608]               0
        Conv2d-4        [-1, 128, 176, 608]          73,856
        Conv2d-5        [-1, 128, 176, 608]         147,584
     MaxPool2d-6         [-1, 128, 88, 304]               0
        Conv2d-7         [-1, 256, 88, 304]         295,168
        Conv2d-8         [-1, 256, 88, 304]         590,080
        Conv2d-9         [-1, 256, 88, 304]         590,080
    MaxPool2d-10         [-1, 256, 44, 152]               0
       Conv2d-11         [-1, 512, 44, 152]       1,180,160
       Conv2d-12         [-1, 512, 44, 152]       2,359,808
       Conv2d-13         [-1, 512, 44, 152]       2,359,808
    MaxPool2d-14          [-1, 512, 22, 76]               0
       Conv2d-15          [-1, 512, 22, 76]       2,359,808
       Conv2d-16          [-1, 512, 22, 76]       2,359,808
       Conv2d-17          [-1, 512, 22, 76]       2,359,808
    MaxPool2d-18          [-1, 512, 11, 38]               0
       Conv2d-19         [-1, 4096, 11, 38]     102,764,544
       Conv2d-20         [-1, 4096, 11, 38]      16,781,312
       Conv2d-21          [-1, 1, 11, 38]           4,097  ConvTranspose2d-22         [-1, 1, 352, 1216]             4,096

Error message:

--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) in () 36 print (loss) 37 #torch.cuda.empty_cache() ---> 38 loss.backward() 39 optimizer.step() 40

/anaconda/envs/py35/lib/python3.5/site-packages/torch/tensor.py in backward(self, gradient, retain_graph, create_graph) 91 products. Defaults to False. 92 """ ---> 93 torch.autograd.backward(self, gradient, retain_graph, create_graph) 94 95 def register_hook(self, hook):

/anaconda/envs/py35/lib/python3.5/site-packages/torch/autograd/init.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables) 88 Variable._execution_engine.run_backward( 89 tensors, grad_tensors, retain_graph, create_graph, ---> 90 allow_unreachable=True) # allow_unreachable flag 91 92

RuntimeError: CUDA error: out of memory

2

2 Answers

2
votes

Usually this happens because of memory on your GPU. If you have more powerful GPUs, your problem could be solved (as you mentioned in your answer).
But if you do not have, you can scale down your images into about 256*x sizes. It is also good practice for performance's sake.

0
votes

I found out the reason... It's hardware related. I changed to another machine and the error disappeared.