1
votes

Apologies if this has been reported already at some other place, I have been looking for it quite some time, without success.

While running the simple mnist example (available on github /fchollet/keras/blob/master/examples/mnist_cnn.py) with keras+tensorflow using a P100 GPGPU we encounter an issue at the intersection of keras/tensorflow/cuda:

Using TensorFlow backend.
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla P100-PCIE-16GB
major: 6 minor: 0 memoryClockRate (GHz) 1.3285
pciBusID 0000:02:00.0
Total memory: 15.89GiB
Free memory: 15.51GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0)
F tensorflow/core/common_runtime/gpu/gpu_device.cc:121] Check failed: err == cudaSuccess (71 vs. 0)
srun: error: nid02011: task 0: Aborted
srun: Terminating job step 1262138.0

We are using keras 2.0.2, tensorflow 1.0.0. cuda 8.0.53. We seem to be having this issue both in python2.7.12 and python3.5.2 (keras 1.2 and 2.0 ...)

Bare tensorflow runtest are going fine, which lead us to think that this is really at the intersection of keras/tensorflow/cuda.

The same test runs fine on various machine with the same version of the software but with TitanX GPGPU.

seem to be tracing this back to tensorflow line 121

CUDA error types

cudaErrorNotSupported = 71
This error indicates the attempted operation is not supported on the current system or device.

I am clueless on where to look next to solve this issue. I would greatly appreciate any feedback and guidance on this matter.

1
github.com/tensorflow/tensorflow/issues/9080 -- are you running on a system with CUDA MPS installed? If so that might be the issuetalonmies
Many thanks @talonmies this turns out to be very relevant and post-date when I started looking for answers before going to stackorverflow.vlimant

1 Answers

3
votes

The underlying source of the problem here appears to be an incompatibility between Tensorflow and the CUDA MPS service (see a related Tensorflow tracker issue here). It should only effect clusters and large systems which use the MPS service to improve the granularity of access to GPU devices.

This should probably be raised as a bug with both NVIDIA and the Tensorflow development team.

Edited to add the the diagnosis from the Tensorflow tracker issue:

It appears the underlying reason is the extensive use of stream callbacks in Tensorflow, which MPS has not supported before the recent Volta hardware release from NVIDIA. Apparently it is also possible to build Tensorflow from source with options which will make it work correctly with MPS on earlier hardware as well. See the linked tracker discussion for more details.

[This answer was assembled from comments and added as a community wiki entry in order to get it off the unanswered list for the CUDA tag]