I'm using PyCuda to run a kernel that is expected to take at least two hours to complete, but it is failing after around one hour with the simple error of:
pycuda._driver.Error: cuCtxSynchronize failed: unknown error
I'm using Windows, and I added the registry key TdrDelay and set it to 120000000 to ensure that Windows is not timing out my kernel.
This error doesn't happen when I adjust the parameters of the kernel so it is expected to complete in about 30 minutes. Why could the synchronize call be failing after the kernel has run for a long time?
Could my graphics card be overheating and preemptively terminating the kernel? Could there be a CUDA setting that terminates a kernel if it runs for too long? Could running the kernel in NVidia Visual Profiler help figure out what the problem might be?
nvidia-smi
. There are no CUDA settings that terminate a long-running kernel (other than the aforementioned windows WDDM TDR). I doubt the visual profiler will shed any useful light on this. – Robert Crovella