Training with Pytorch: error due to CUDA memory issue

Question

I am trying to train a model on the Cityscapes dataset, for segmentation. I use torchvision deeplabv3_resnet50 model and it's Cityscapes dataset class and transforms. In case it matters, I am running the code in Jupyter notebook.

The datasets are working, as are the dataloaders. When I attempt to train, I always get this error, at the point when the first batch is trying to be put thru the network (y_ = net(xb) in one_epoch function).

RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 6.00 GiB total capacity; 4.20 GiB already allocated; 6.87 MiB free; 4.20 GiB reserved in total by PyTorch)

What is strange, is that no matter what the batch size (bs) is, the the amount of memory free according to the error is a value a little less than the amount of memory that is trying to be allocated, e.g. for bs=16 I get:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 6.00 GiB total capacity; 2.90 GiB already allocated; 1.70 GiB free; 2.92 GiB reserved in total by PyTorch)

I have a much more complicated model running, that will work with bs=16. This model builds everything from scratch. But I really want to be able to use the simplicity that torchvision seems to have with it's model zoo and datasets.

My code is below, not much more than the bare essentials, enough to show if it is running ok on the GPU.

def one_epoch(net, loss, dl, opt=None, metric=None):

if opt:
    net.train()  # only affects some layers
else:
    net.eval()
    rq_stored = []
    for p in net.parameters():
        rq_stored.append(p.requires_grad)
        p.requires_grad = False

L, M = [], []
dl_it = iter(dl)
for xb, yb in tqdm(dl_it, leave=False):
    xb, yb = xb.cuda(), yb.cuda()
    y_ = net(xb)
    l = loss(y_, yb)
    if opt:
        opt.zero_grad()
        l.backward()
        opt.step()
    L.append(l.detach().cpu().numpy())
    if metric: M.append(metric(y_, yb).cpu().numpy())

if not opt:
    for p,rq in zip(net.parameters(), rq_stored): p.requires_grad = rq

return L, M

accuracy = lambda y_,yb: (y_.max(dim=1)[1] == yb).float().mean()

def fit(net, tr_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=3, lr=3e-3, wd=1e-3):   

opt = optim.Adam(net.parameters(), lr=lr, weight_decay=wd)

Ltr_hist, Lval_hist = [], []
for epoch in trange(epochs):
    Ltr,  _    = one_epoch(net, loss, tr_dl,  opt)
    Lval, Aval = one_epoch(net, loss, val_dl, None, accuracy)
    Ltr_hist.append(np.mean(Ltr))
    Lval_hist.append(np.mean(Lval))
    print(f'epoch: {epoch+1}\ttraining loss: {np.mean(Ltr):0.4f}\tvalidation loss: {np.mean(Lval):0.4f}\tvalidation accuracy: {np.mean(Aval):0.2f}')

return Ltr_hist, Lval_hist

class To3ch(object):
def __call__(self, pic):
    if pic.shape[0]==1: pic = pic.repeat(3,1,1)
    return pic

bs = 1
imagenet_stats = ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

transf = transforms.Compose([
    transforms.ToTensor(),
    To3ch(),
    transforms.Normalize(*imagenet_stats)
])

train_ds = datasets.Cityscapes('C:/cityscapes_ds', split='train', target_type='semantic', transform=transf, target_transform=transf)
val_ds = datasets.Cityscapes('C:/cityscapes_ds', split='val', target_type='semantic', transform=transf, target_transform=transf)

train_dl  = DataLoader(train_ds,  batch_size=bs,   shuffle=True,  num_workers=0)
val_dl = DataLoader(val_ds, batch_size=2*bs, shuffle=False, num_workers=0)

net = models.segmentation.deeplabv3_resnet50(num_classes=20)
fit(net.cuda(), train_dl, val_dl, loss=nn.CrossEntropyLoss(), epochs=1, lr=1e-4, wd=1e-4, plot=True)

@talonmies - what was wrong with the cuda tag? Just so I know in future, thanks. — msm1089
You have no CUDA programming question, that is why. This is about Torch, not CUDA — talonmies

Berriel Berriel · Accepted Answer · 2020-06-06T14:56:01

You didn't specify, but if you're using the original Cityscapes, this OOM is completely expected.

The original Cityscapes dataset has large images (something like 1024x2048, IIRC), and it looks like you have a 6GB GPU. FYI, I cannot fit batch_size=2 in a 12GB GPU with inputs of this size.

When training DeepLab models, it is common to apply transformations on the input (e.g., random crops, resize, scaling, etc.), and it looks like you don't apply any.

When you say:

I have a much more complicated model running, that will work with bs=16.

Perhaps you're looking at a different kind of complexity, something that has less impact on memory requirements than you think.

Training with Pytorch: error due to CUDA memory issue

1 Answers