0
votes

I'm using PyCuda to run a kernel that is expected to take at least two hours to complete, but it is failing after around one hour with the simple error of:

pycuda._driver.Error: cuCtxSynchronize failed: unknown error

I'm using Windows, and I added the registry key TdrDelay and set it to 120000000 to ensure that Windows is not timing out my kernel.

This error doesn't happen when I adjust the parameters of the kernel so it is expected to complete in about 30 minutes. Why could the synchronize call be failing after the kernel has run for a long time?

Could my graphics card be overheating and preemptively terminating the kernel? Could there be a CUDA setting that terminates a kernel if it runs for too long? Could running the kernel in NVidia Visual Profiler help figure out what the problem might be?

1
my guess would be that you are still hitting a tdr timeout. I'm not sure that your setting does what you think it does. Yes, your graphics card could be overheating, but this isn't usually possible (the GPU should have a mechanism to manage temperature, regardless of load). You can monitor temperatures with nvidia-smi. There are no CUDA settings that terminate a long-running kernel (other than the aforementioned windows WDDM TDR). I doubt the visual profiler will shed any useful light on this.Robert Crovella
The TdrDelay definitely does something, because before I added that key my kernel was timing out after two seconds. Maybe TdrDelay has some maximum value.Thomas

1 Answers

1
votes

I was able to get my long running kernel to complete without error by adding the registry key "TdrLevel" alongside "TdrDelay" and setting its value to 0.