0
votes

I have a Python 3 program that involves the execution of a cuda kernel.

The code runs fine when I launch it in the following configuration

  • GeForce GTX 1080 Ti GPU
  • Ubuntu 16.04
  • CUDA version 8.0.61
  • NVIDIA driver version 384.111
  • Python version 3.5.2
  • PyCUDA version (2017, 1, 1).

However, when using a GeForce GTX 970 on the very same machine, I get this error:

cuMemFree failed: the launch timed out and was terminated
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)

Note that this error does not occur when I call the kernel with a rather small number of threads (i.e. with a small grid dimension at constant threads per block).

In this post, Andreas explains the meaning of that error message:

This means your context went away while PyCUDA was still talking to it. This will happen most often if you perform some invalid operation (such as access out-of-bounds memory in a kernel).

In other words, it seems to indicate that something is wrong with the kernel I wrote. However, as the code does not raise an error when launched on the other GPU, I was wondering if other issues can raise the same error, too.

So my questions are:

  • Can the above error also be caused when running a correctly written kernel in an unfavourable environment?
  • Can it be caused by a wrong combination of NVIDIA driver, CUDA version, PyCUDA version and GPU model?
  • What do I have in general to consider regarding driver version, CUDA version, PyCUDA version and GPU model to assure that things function properly?

I can understand that many people here are allergic to questions without code and minimal example. I tried to compose a simple example that would reproduce the error, but I couldn't. Kernels that would like double an input argument or so run fine up to the limit of memory errors... So I hope to just get some advice into what direction to look when searching the error.

2
You have missed the most important part of that error -- ` the launch timed out and was terminated`. The difference in kernel execution time between the two devices is probably the causetalonmies
@talonmies Right, I missed that part of the error - thanks!Amos Egel
I have edited the question title to include the timeout part of the error message.Amos Egel

2 Answers

1
votes

It was talonmies' comment to the question that lead me to the answer.

The issue was that one of the cards (the GTX 970) was at the same time used for graphical output of the system. As explained here and here, this implies that there is a "watchdog" preventing CUDA kernels to run longer than some maximum time before they are stopped.

The solution for me was to stop the X server by sudo service lightdm stop. Then, the program ran on both cards without error.

0
votes

Adding to Amos's answer above, for Linx 18.04 I had to use sudo service gdm stop. In addition, if that still doesn't work (it didn't for me), try opening a terminal using ctrl+alt+f3 and running your program through this.