1
votes

I'm trying to train a PyTorch FLAIR model in AWS Sagemaker. While doing so getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)

For training I used sagemaker.pytorch.estimator.PyTorch class.

I tried with different variants of instance types from ml.m5, g4dn to p3(even with a 96GB memory one). In the ml.m5 getting the error with CPUmemoryIssue, in g4dn with GPUMemoryIssue and in the P3 getting GPUMemoryIssue mostly because Pytorch is using only one of the GPU of 12GB out of 8*12GB.

Not getting anywhere to complete this training, even in local tried with a CPU machine and got the following error:

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!

The model training script:

    corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
                                          
    print("finished loading corpus")

    word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

    document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

    classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

    trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

    trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")

P.S.: I was able to train the same architecture with a smaller dataset in my local GPU machine with a 4GB GTX 1650 DDR5 memory and it was really quick.

2
I guess the point is: "a smaller dataset".Klaus D.
yeah I also thought so, but the record difference is around 1000 records, that's all..Desmond
No, the point is "the similar architecture". Small changes can have a big of an impact.Berriel
sorry, I mislead on that, it is actually same architecture model. Just the difference is dataset that also 4000 vs 5000 records. The main point is, I think the issue is in Sagemaker training, in local same would run with decent GPU, just I don't have that infrastructure. Could you help me solve this for the model to be trained in the Sagemaker.Desmond

2 Answers

0
votes

This error is because your GPU ran out of memory. You can try a few things

  1. Reduce the size of training data

  2. Reduce the size of your model i.e. Number of hidden layer or maybe depth

  3. You can also try to reducing the Batch size

0
votes

Okay, so after 2 days of continuous debugging was able to find out the root cause. What I understood is Flair does not have any limitation on the sentence length, in the sense the word count, it is taking the highest length sentence as the maximum. So there it was causing issue, as in my case there were few content with 1.5 lakh rows which is too much to load the embedding of into the memory, even a 16GB GPU. So there it was breaking.

To solve this: For content with this much lengthy words, you can take chunk of n words(10K in my case) from these kind of content from any portion(left/right/middle anywhere) and trunk the rest, or simply ignore those records for training if it is very minimal in comparative count.

After this I hope you will be able to progress with your training, as it happened in my case.

P.S.: If you are following this thread and face similar issue feel free to comment back so that I can explore and help on your case of the issue.