1
votes

So in TensorFlow's guide for using GPUs there is a part about using multiple GPUs in a "multi-tower fashion":

...
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d): # <---- manual device placement
...

Seeing this, one might be tempted to leverage this style for multiple GPU training in a custom Estimator to indicate to the model that it can be distributed across multiple GPUs efficiently.

To my knowledge, if manual device placement is absent TensorFlow does not have some form of optimal device mapping (expect perhaps if you have the GPU version installed and a GPU is available, using it over the CPU). So what other choice do you have?

Anyway, you carry on with training your estimator and export it to a SavedModel via estimator.export_savedmodel(...) and wish to use this SavedModel later... perhaps on a different machine, one which may not have as many GPUs as the device on which the model was trained (or maybe no GPUs)

so when you run

from tensorflow.contrib import predictor
predict_fn = predictor.from_saved_model(model_dir)

you get

Cannot assign a device for operation <OP-NAME>. Operation was 
explicitly assigned to <DEVICE-NAME> but available devices are 
[<AVAILABLE-DEVICE-0>,...]

An older S.O. Post suggests that changing device placement was not possible... but hopefully over time things have changed.

Thus my question is:

  1. when loading a SavedModel can I change the device placement to be appropriate for the device it is loaded on. E.g. if I train a model with 6 GPUs and a friend wants to run it at home with their e-GPU, can they set '/device:GPU:1' through '/device:GPU:5' to '/device:GPU:0'?

  2. if 1 is not possible, is there a (painless) way for me, in the custom Estimator's model_fn, to specify how to generically distribute a graph?

e.g.

with tf.device('available-gpu-3')

where available-gpu-3 is the third available GPU if there are three or more GPUs, otherwise the second or first available GPU, and if no GPU it is CPU

This matters because if there is a shared machine with is training two models, say one model on '/device:GPU:0' then the other model is trained explicitly on GPUs 1 and 2... so on another 2 GPU machine, GPU 2 will not be available....

1

1 Answers

1
votes

I am doing some research on this topic recently and to my knowledge, your question 1 can work only if you clear all devices when you export the model in the original tensorflow code, with flag clear_devices=True.

In my own code, it looks like

builder = tf.saved_model.builder.SavedModelBuilder('osvos_saved')
builder.add_meta_graph_and_variables(sess, ['serve'], clear_devices=True)
builder.save()

If you only have a exported model, seems not possible. You can refer to this issue.

I'm currently trying to find a way to fix this, as stated in my stackoverflow question. Hope the workaround can help you.