CUDA synchronize function fails during long running kernel

Question

I'm using PyCuda to run a kernel that is expected to take at least two hours to complete, but it is failing after around one hour with the simple error of:

pycuda._driver.Error: cuCtxSynchronize failed: unknown error

I'm using Windows, and I added the registry key TdrDelay and set it to 120000000 to ensure that Windows is not timing out my kernel.

This error doesn't happen when I adjust the parameters of the kernel so it is expected to complete in about 30 minutes. Why could the synchronize call be failing after the kernel has run for a long time?

Could my graphics card be overheating and preemptively terminating the kernel? Could there be a CUDA setting that terminates a kernel if it runs for too long? Could running the kernel in NVidia Visual Profiler help figure out what the problem might be?

my guess would be that you are still hitting a tdr timeout. I'm not sure that your setting does what you think it does. Yes, your graphics card could be overheating, but this isn't usually possible (the GPU should have a mechanism to manage temperature, regardless of load). You can monitor temperatures with nvidia-smi. There are no CUDA settings that terminate a long-running kernel (other than the aforementioned windows WDDM TDR). I doubt the visual profiler will shed any useful light on this. — Robert Crovella
The TdrDelay definitely does something, because before I added that key my kernel was timing out after two seconds. Maybe TdrDelay has some maximum value. — Thomas

Thomas Thomas · Accepted Answer · 2018-05-17T18:20:50

I was able to get my long running kernel to complete without error by adding the registry key "TdrLevel" alongside "TdrDelay" and setting its value to 0.

CUDA synchronize function fails during long running kernel

1 Answers