How do I fix a memory allocation error in SageMaker without increasing instance size?

Question

How can I resolve memory issues when training a CNN on SageMaker by increasing the number of instances, rather than changing the amount of memory each instance has?

Using a larger instance does work, but I want to solve my problem by distributing across more instances. Using more instances ends up giving me a memory allocation error instead.

Here is the code I am running in a Jupyter notebook cell:

from sagemaker.tensorflow import TensorFlow

estimator = TensorFlow(entry_point='train_aws.py',
                       role=role,
                       framework_version='1.12.0',
                       training_steps= 100,                                  
                       evaluation_steps= 100,
                       hyperparameters={'learning_rate': 0.01},
                       train_instance_count=2,
                       train_instance_type='ml.c4.xlarge') 

estimator.fit(inputs)

I thought that adding more instances would increase the amount of memory, but it just gave me an allocation error instead.

Try to minimize the learning rate value hyperparameters={'learning_rate': 0.00000001}. You should provide full code for proper analysis. — user1410665

Guy Guy · Accepted Answer · 2019-05-23T18:21:39

Adding more instances is increasing the overall memory that you have but not the maximum memory that each training instance can use.

Most likely reducing the batch size in your code should help you recover from the error.

When you are creating a training job in SageMaker, your code is installed on multiple instances and the data from S3 is copied to these instances as well. Your code is accessing the data from the local volume of the instance (usually over EBS), in a similar way to how it would train locally. On each instance, it will do the following steps:

starts a Docker container optimized for TensorFlow.
downloads the dataset.
setup up training related environment variables
setup up distributed training environment if configured to use parameter server
starts asynchronous training

To benefit from the distribution you should enable the distribution options of TensorFlow (see: https://sagemaker.readthedocs.io/en/stable/using_tf.html#distributed-training)

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(entry_point='train_aws.py', 
                          role=role,
                          train_instance_count=2, 
                          train_instance_type='ml.c4.xlarge',
                          framework_version='1.12.0', 
                          py_version='py3',
                          training_steps= 100,                                  
                          evaluation_steps= 100,
                          hyperparameters={'learning_rate': 0.01},
                          distributions={'parameter_server': {'enabled': True}})

tf_estimator.fit('s3://bucket/path/to/training/data')

Also note that you can distribute your data from S3 evenly to the training instances, which can make your training faster as each instance only sees part of the data.

How do I fix a memory allocation error in SageMaker without increasing instance size?

1 Answers