I would like to launch multiple Amazon EC2 spot instances (fleet?) using a custom AMI (docker?) for performing a deep-learning training task. I would like all the instances to share a common set of files for the purposes of training the model.
The idea here is not to lose training history and keep a backup in EBS (network drive?) when the spot instance is terminated by AWS due to pricing-limit/demand. The task state can be updated in a file and then resumed when instances are available.
Is it possible to launch all instances and let them work cooperatively to complete the training task? What kind of a setup could accomplish this?