1
votes

I would like to launch multiple Amazon EC2 spot instances (fleet?) using a custom AMI (docker?) for performing a deep-learning training task. I would like all the instances to share a common set of files for the purposes of training the model.

The idea here is not to lose training history and keep a backup in EBS (network drive?) when the spot instance is terminated by AWS due to pricing-limit/demand. The task state can be updated in a file and then resumed when instances are available.

Is it possible to launch all instances and let them work cooperatively to complete the training task? What kind of a setup could accomplish this?

1

1 Answers

2
votes

Firstly, you might be interested in the Deep Learning AMI from the AWS Marketplace, which comes fully-configured with popular Deep Learning tools.

If the software you are using wishes to save its data to a local file system (as opposed to Amazon S3), then you could use Deep Learning AMI to share a file system amongst multiple Amazon EC2 instances (including Spot instances). Amazon EFS is similar to a NAS and can be used simultaneously across multiple instances.

The EFS volume could be mounted via a User Data script, together with a setup script to load and run your desired application (which can be easier than making a new AMI).