Pytorch CUDA OutOfMemory Error while training

Question

I'm trying to train a PyTorch FLAIR model in AWS Sagemaker. While doing so getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)

For training I used sagemaker.pytorch.estimator.PyTorch class.

I tried with different variants of instance types from ml.m5, g4dn to p3(even with a 96GB memory one). In the ml.m5 getting the error with CPUmemoryIssue, in g4dn with GPUMemoryIssue and in the P3 getting GPUMemoryIssue mostly because Pytorch is using only one of the GPU of 12GB out of 8*12GB.

Not getting anywhere to complete this training, even in local tried with a CPU machine and got the following error:

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!

The model training script:

    corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
                                          
    print("finished loading corpus")

    word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

    document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

    classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

    trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

    trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")

P.S.: I was able to train the same architecture with a smaller dataset in my local GPU machine with a 4GB GTX 1650 DDR5 memory and it was really quick.

yeah I also thought so, but the record difference is around 1000 records, that's all.. — Desmond
No, the point is "the similar architecture". Small changes can have a big of an impact. — Berriel
sorry, I mislead on that, it is actually same architecture model. Just the difference is dataset that also 4000 vs 5000 records. The main point is, I think the issue is in Sagemaker training, in local same would run with decent GPU, just I don't have that infrastructure. Could you help me solve this for the model to be trained in the Sagemaker. — Desmond

Ashwiniku918 Ashwiniku918 · Accepted Answer · 2020-08-17T15:29:42

This error is because your GPU ran out of memory. You can try a few things

Reduce the size of training data
Reduce the size of your model i.e. Number of hidden layer or maybe depth
You can also try to reducing the Batch size

Pytorch CUDA OutOfMemory Error while training

2 Answers