I'm trying to train a PyTorch FLAIR model in AWS Sagemaker. While doing so getting the following error:
RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)
For training I used sagemaker.pytorch.estimator.PyTorch class.
I tried with different variants of instance types from ml.m5, g4dn to p3(even with a 96GB memory one). In the ml.m5 getting the error with CPUmemoryIssue, in g4dn with GPUMemoryIssue and in the P3 getting GPUMemoryIssue mostly because Pytorch is using only one of the GPU of 12GB out of 8*12GB.
Not getting anywhere to complete this training, even in local tried with a CPU machine and got the following error:
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!
The model training script:
corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
print("finished loading corpus")
word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]
document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)
classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)
trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")
P.S.: I was able to train the same architecture with a smaller dataset in my local GPU machine with a 4GB GTX 1650 DDR5 memory and it was really quick.