0
votes

I am submitting a keras multi_gpu_model with gpu=8 using the following config.yaml

trainingInput: scaleTier: CUSTOM masterType: complex_model_l_gpu workerType: standard_gpu parameterServerType: standard_gpu workerCount: 0 parameterServerCount: 0

I am getting the following error.

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 247, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 243, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 112, in run run_training(args, unique_id) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 139, in run_training unet, template_model = model_lib.train_model(args) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 133, in train_model model, template_model = unet_network(args.image_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 106, in unet_network model = multi_gpu_model(template_model, gpus=8) File "/root/.local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 132, in multi_gpu_model available_devices)) ValueError: To call multi_gpu_model with gpus=8, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducing gpus.

According to the documentation I should have 8 gpus available. Anyone seen this? Know how to resolve?

Per the request in the notes below, I ran a vanilla tf gpu graph with the following:

with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='a')
with tf.device('/gpu:0'):
    b = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='b')
with tf.device('/gpu:1'):
    c = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='c')
with tf.device('/gpu:2'):
    d = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='d')
with tf.device('/gpu:3'):
    e = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='e')
with tf.device('/gpu:4'):
    f = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='f')
with tf.device('/gpu:5'):
    g = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='g')
with tf.device('/gpu:6'):
    h = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='h')
with tf.device('/gpu:7'):
    i = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='i')

cd = tf.matmul(c, d)

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    print sess.run(cd)

and got the following error:

Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 39, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 35, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 31, in run print sess.run(cd) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) InvalidArgumentError: Cannot assign a device for operation 'i': Operation was explicitly assigned to /device:GPU:7 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. [[Node: i = Constdtype=DT_FLOAT, value=Tensor, _device="/device:GPU:7"]] Caused by op u'i', defined at: File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 39, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 35, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 26, in run i = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='i') File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant name=name).outputs[0] File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'i': Operation was explicitly assigned to /device:GPU:7 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. [[Node: i = Constdtype=DT_FLOAT, value=Tensor, _device="/device:GPU:7"]]

1
Which runtime version did you use for the job?Guoqing Xu
I am assuming you mean which runtime version of keras which was 2.1.3, I did force tensorflow to be 1.4.0 rather than 1.4.1 if that was the ask...Brian F
This might be related to Keras. Can you try a vanilla TensorFlow GPU sample code in a new job please? Meanwhile, please send your project and job id to [email protected] so that we can investigate the issue from our side as well. Thanks!Guoqing Xu
I tried a vanilla TensorFlow GPU sample. A graph with just with tf.device('/cpu:0'): a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='a')with tf.device('/gpu:0'): b = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='b') for the cpu and all 8 expected gpus. I got a similar error. I added the error message from this one above. Same project info as I sent via email Job ID mwp_dsb_cnn_20180212_202740Brian F

1 Answers

3
votes

Could you check your job log to see if you re-installed CPU Tensorflow? You should see something like: "Downloading tensorflow-1.4.0..."

Please note that Tensorflow GPU packages are in https://pypi.python.org/pypi/tensorflow-gpu/1.4.0 instead of https://pypi.python.org/pypi/tensorflow/1.4.0. And you don't need to re-install Tensorflow if you pass in runtime_version as 1.4.