0
votes

I have faced Cuda out of memory error for a simple fully connected layer model. I have tried torch.cuda.empty_cache() and gc.collect(). Also, I delete unnecessary variables by del and tried by reducing batch size. But the error does not resolved. Also, the error appears only for SUN datasets where 1440 test images are used to evaluate. But the code runs well for AWA2 datasets where no. of test images is 7913. I am using google colab here. I have used RTX 2060 too. Here is the code snippet, where it gets error:

def euclidean_dist(x, y):
    # x: N x D
    # y: M x D
    torch.cuda.empty_cache()
    n = x.size(0)
    m = y.size(0)
    d = x.size(1)
    assert d == y.size(1)
    x = x.unsqueeze(1).expand(n, m, d)
    y = y.unsqueeze(0).expand(n, m, d)
    del n,m,d
    return torch.pow(x - y, 2).sum(2)

def compute_accuracy(test_att, test_visual, test_id, test_label):
    global s2v
    s2v.eval()
    with torch.no_grad():
        test_att = Variable(torch.from_numpy(test_att).float().to(device))
        test_visual = Variable(torch.from_numpy(test_visual).float().to(device))
        outpre = s2v(test_att, test_visual)
        del test_att, test_visual
        outpre = torch.argmax(torch.softmax(outpre, dim=1), dim=1)
    
    outpre = test_id[outpre.cpu().data.numpy()]
    
    #compute averaged per class accuracy
    test_label = np.squeeze(np.asarray(test_label))
    test_label = test_label.astype("float32")
    unique_labels = np.unique(test_label)
    acc = 0
    for l in unique_labels:
        idx = np.nonzero(test_label == l)[0]
        acc += accuracy_score(test_label[idx], outpre[idx])
    acc = acc / unique_labels.shape[0]
    return acc   

The error is:

Traceback (most recent call last):   File "GBU_new_v2.py", line 234, in <module>
    acc_seen_gzsl = compute_accuracy(attribute, x_test_seen, np.arange(len(attribute)), test_label_seen)   File "GBU_new_v2.py", line 111, in compute_accuracy
    outpre = s2v(test_att, test_visual)   File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)   File "GBU_new_v2.py", line 80, in forward
    a1 = euclidean_dist(feat, a1)   File "GBU_new_v2.py", line 62, in euclidean_dist
    return torch.pow(x - y, 2).sum(2)#.sqrt() # return: N x M RuntimeError: CUDA out of memory. Tried to allocate 14.12 GiB (GPU 0;
15.90 GiB total capacity; 14.19 GiB already allocated; 669.88 MiB free; 14.55 GiB reserved in total by PyTorch)
1

1 Answers

2
votes

It seems like you have batches defined only for training, while during test you attempt to process the entire test set simultaneously.
You should split your test set into smaller "batches" and evaluate one batch at a time to combine all the batch scores at the end for one score for the model.