Apologies if this has been reported already at some other place, I have been looking for it quite some time, without success.
While running the simple mnist example (available on github /fchollet/keras/blob/master/examples/mnist_cnn.py
) with keras+tensorflow using a P100 GPGPU we encounter an issue at the intersection of keras/tensorflow/cuda:
Using TensorFlow backend. I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate (GHz) 1.3285 pciBusID 0000:02:00.0 Total memory: 15.89GiB Free memory: 15.51GiB I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0: Y I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:02:00.0) F tensorflow/core/common_runtime/gpu/gpu_device.cc:121] Check failed: err == cudaSuccess (71 vs. 0) srun: error: nid02011: task 0: Aborted srun: Terminating job step 1262138.0
We are using keras 2.0.2, tensorflow 1.0.0. cuda 8.0.53. We seem to be having this issue both in python2.7.12 and python3.5.2 (keras 1.2 and 2.0 ...)
Bare tensorflow runtest are going fine, which lead us to think that this is really at the intersection of keras/tensorflow/cuda.
The same test runs fine on various machine with the same version of the software but with TitanX GPGPU.
seem to be tracing this back to tensorflow line 121
cudaErrorNotSupported = 71 This error indicates the attempted operation is not supported on the current system or device.
I am clueless on where to look next to solve this issue. I would greatly appreciate any feedback and guidance on this matter.