AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

Question

I am using AWS Sagemaker for model training and deployment, this is sample example for model training

from sagemaker.estimator import Estimator
hyperparameters = {'train-steps': 10}
instance_type = 'ml.m4.xlarge'

estimator = Estimator(role=role,
                      train_instance_count=1,
                      train_instance_type=instance_type,
                      image_name=ecr_image,
                      hyperparameters=hyperparameters)

estimator.fit(data_location)

The docker image mentioned here is a tensorflow system.

Suppose it will take 1000 seconds to train the model, now I will increase the instance count to 5 then the training time will increase 5 times i.e. 5000 seconds. As per my understanding the training job will be distributed to 5 machines so ideally it will take 200 seconds per machine but seems its doing separate training on each machine. Can someone please let me know its working over distributed system in general or with Tensorflow.

I tried to find out the answer on this documentation https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-dg.pdf but seems the way of working on distributed machines is not mentioned here.

saaskis saaskis · Accepted Answer · 2019-01-04T06:59:03

Are you using TensorFlow estimator APIs in your script? If yes, I think you should run the script by wrapping it in sagemaker.tensorflow.TensorFlow class as described in the documentation here. If you run training that way, parallelization and communication between instances should work out-of-the-box.

But note that scaling will not be linear when you increase the number of instances. Communicating between instances takes time and there could be non-parallelizable bottlenecks in your script like loading data to memory.

AWS Sagemaker | Why multiple instances training taking time multiplied to instance number

1 Answers