How can I resolve memory issues when training a CNN on SageMaker by increasing the number of instances, rather than changing the amount of memory each instance has?
Using a larger instance does work, but I want to solve my problem by distributing across more instances. Using more instances ends up giving me a memory allocation error instead.
Here is the code I am running in a Jupyter notebook cell:
from sagemaker.tensorflow import TensorFlow
estimator = TensorFlow(entry_point='train_aws.py',
role=role,
framework_version='1.12.0',
training_steps= 100,
evaluation_steps= 100,
hyperparameters={'learning_rate': 0.01},
train_instance_count=2,
train_instance_type='ml.c4.xlarge')
estimator.fit(inputs)
I thought that adding more instances would increase the amount of memory, but it just gave me an allocation error instead.
hyperparameters={'learning_rate': 0.00000001}. You should provide full code for proper analysis. - user1410665