I am submitting a keras multi_gpu_model with gpu=8 using the following config.yaml
trainingInput:
scaleTier: CUSTOM
masterType: complex_model_l_gpu
workerType: standard_gpu
parameterServerType: standard_gpu
workerCount: 0
parameterServerCount: 0
I am getting the following error.
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 247, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 243, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 112, in run run_training(args, unique_id) File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 139, in run_training unet, template_model = model_lib.train_model(args) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 133, in train_model model, template_model = unet_network(args.image_size) File "/root/.local/lib/python2.7/site-packages/trainer/model.py", line 106, in unet_network model = multi_gpu_model(template_model, gpus=8) File "/root/.local/lib/python2.7/site-packages/keras/utils/training_utils.py", line 132, in multi_gpu_model available_devices)) ValueError: To call
multi_gpu_model
withgpus=8
, we expect the following devices to be available: ['/cpu:0', '/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3', '/gpu:4', '/gpu:5', '/gpu:6', '/gpu:7']. However this machine only has: ['/cpu:0']. Try reducinggpus
.
According to the documentation I should have 8 gpus available. Anyone seen this? Know how to resolve?
Per the request in the notes below, I ran a vanilla tf gpu graph with the following:
with tf.device('/cpu:0'):
a = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='a')
with tf.device('/gpu:0'):
b = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='b')
with tf.device('/gpu:1'):
c = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='c')
with tf.device('/gpu:2'):
d = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='d')
with tf.device('/gpu:3'):
e = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='e')
with tf.device('/gpu:4'):
f = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='f')
with tf.device('/gpu:5'):
g = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='g')
with tf.device('/gpu:6'):
h = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='h')
with tf.device('/gpu:7'):
i = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='i')
cd = tf.matmul(c, d)
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
print sess.run(cd)
and got the following error:
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 39, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 35, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 31, in run print sess.run(cd) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run run_metadata_ptr) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1120, in _run feed_dict_tensor, options, run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run options, run_metadata) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call raise type(e)(node_def, op, message) InvalidArgumentError: Cannot assign a device for operation 'i': Operation was explicitly assigned to /device:GPU:7 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. [[Node: i = Constdtype=DT_FLOAT, value=Tensor, _device="/device:GPU:7"]] Caused by op u'i', defined at: File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 39, in main() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 35, in main run() File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 26, in run i = tf.constant([1.0, 2.0, 3.0, 4.0], shape=[2, 2], name='i') File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/constant_op.py", line 214, in constant name=name).outputs[0] File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op op_def=op_def) File "/root/.local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'i': Operation was explicitly assigned to /device:GPU:7 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device. [[Node: i = Constdtype=DT_FLOAT, value=Tensor, _device="/device:GPU:7"]]