I have a Python 3 program that involves the execution of a cuda kernel.
The code runs fine when I launch it in the following configuration
- GeForce GTX 1080 Ti GPU
- Ubuntu 16.04
- CUDA version 8.0.61
- NVIDIA driver version 384.111
- Python version 3.5.2
- PyCUDA version (2017, 1, 1).
However, when using a GeForce GTX 970 on the very same machine, I get this error:
cuMemFree failed: the launch timed out and was terminated
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
Note that this error does not occur when I call the kernel with a rather small number of threads (i.e. with a small grid dimension at constant threads per block).
In this post, Andreas explains the meaning of that error message:
This means your context went away while PyCUDA was still talking to it. This will happen most often if you perform some invalid operation (such as access out-of-bounds memory in a kernel).
In other words, it seems to indicate that something is wrong with the kernel I wrote. However, as the code does not raise an error when launched on the other GPU, I was wondering if other issues can raise the same error, too.
So my questions are:
- Can the above error also be caused when running a correctly written kernel in an unfavourable environment?
- Can it be caused by a wrong combination of NVIDIA driver, CUDA version, PyCUDA version and GPU model?
- What do I have in general to consider regarding driver version, CUDA version, PyCUDA version and GPU model to assure that things function properly?
I can understand that many people here are allergic to questions without code and minimal example. I tried to compose a simple example that would reproduce the error, but I couldn't. Kernels that would like double an input argument or so run fine up to the limit of memory errors... So I hope to just get some advice into what direction to look when searching the error.