1
votes

I try to train my models with multi-GPUS. So I run the cifar10_multi_gpu.py (https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py) .

1. My location:


OS Platform : Linux version 3.10.0-327.el7.x86_64

TensorFlow installed : pip install --upgrade ./tensorflow_gpu-1.0.0rc0-cp35-cp35m-linux_x86_64.whl

Python version: Python 3.5.2

CUDA/cuDNN version: cuda_8.0.61_375.26_linux.run / cudnn-8.0-linux-x64-v5.1.tgz

2. GPU setup is correct

import tensorflow as tf

with tf.device('/cpu:0'):

     a = tf.constant([1.0, 2.0, 3.0], shape=[3], name='a')

     b = tf.constant([1.0, 2.0, 3.0], shape=[3], name='b')  

with tf.device('/gpu:1'):

     c = a + b

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

sess.run(c)

add: (Add): /job:localhost/replica:0/task:0/gpu:1 I

tensorflow/core/common_runtime/simple_placer.cc:841] add: (Add)/job:localhost/replica:0/task:0/gpu:1 b: (Const): /job:localhost/replica:0/task:0/cpu:0 I

tensorflow/core/common_runtime/simple_placer.cc:841] b: (Const)/job:localhost/replica:0/task:0/cpu:0 a: (Const): /job:localhost/replica:0/task:0/cpu:0 I

tensorflow/core/common_runtime/simple_placer.cc:841] a: (Const)/job:localhost/replica:0/task:0/cpu:0

array([ 2., 4., 6.], dtype=float32)

3. InvalidArgumentError: python cifar10_multi_gpu.py

I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /GPU:0 for node 'tower_0/fifo_queue_Dequeue' because the input edge from 'prefetch_queue/fifo_queue' is a reference connection and already has a device field set to /CPU:0

Traceback (most recent call last): File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1022, in _do_call return fn(*args)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1000, in _run_fn self._extend_graph()

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1049, in _extend_graph self._session, graph_def.SerializeToString(), status)

File "/home/xx/anaconda3/lib/python3.5/contextlib.py", line 66, in exit next(self.gen)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 469, in raise_exception_on_not_ok_status pywrap_tensorflow.TF_GetCode(status))

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'tower_0/softmax_linear/weight_loss_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

[[Node: tower_0/softmax_linear/weight_loss_1 = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](tower_0/softmax_linear/weight_loss_1/tags, tower_0/softmax_linear/weight_loss)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "cifar10_multi_gpu_train.py", line 280, in tf.app.run() File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 44, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough))

File "cifar10_multi_gpu_train.py", line 276, in main train()

File "cifar10_multi_gpu_train.py", line 237, in train sess.run(init)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 767, in run run_metadata_ptr)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 965, in _run feed_dict_string, options, run_metadata)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run target_list, options, run_metadata)

File "/home/xx/anaconda3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call raise type(e)(node_def, op, message)

tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot assign a device to node 'tower_0/softmax_linear/weight_loss_1': Could not satisfy explicit device specification '/device:GPU:0' because no supported kernel for GPU devices is available.

[[Node: tower_0/softmax_linear/weight_loss_1 = ScalarSummary[T=DT_FLOAT, _device="/device:GPU:0"](tower_0/softmax_linear/weight_loss_1/tags, tower_0/softmax_linear/weight_loss)]]

I try many solutions but failed. Thanks for any advice in advance.

1

1 Answers

0
votes

Sorry you're hitting problems! I checked with one of the original authors of that script, and here was his response:

It looks like the device placement is not working well.

  • According to the author's test, he checked to see that he can access "cpu:0" and "gpu:1" but he never checked "gpu:0". I would check that.

  • The author should also set allow_soft_placement=True in the SessionConfig to allow for relaxed device placement.