How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

Question

I get a Cuda error 6 (also known as cudaErrorLaunchTimeout and CUDA_ERROR_LAUNCH_TIMEOUT) with this (simplified) code:

for(int i = 0; i < 650; ++i)
{
    int param = foo(i); //some CPU computation here, but no memory copy
    MyKernel<<<dimGrid, dimBlock>>>(&data, param);
}

The Cuda error 6 indicates that the kernel took too much time to return. The duration of a single MyKernel is only ~60 ms though. The block size is a classic 16×16.

Now, when I call cudaDeviceSynchronize() every, say, 50 iterations, the error doesn't occur:

for(int i = 0; i < 650; ++i)
{
    int param = foo(i); //some CPU computation here, but no memory copy
    MyKernel<<<dimGrid, dimBlock>>>(&data, param);
    if(i % 50 == 0) cudaDeviceSynchronize();
}

I would like to avoid this synchronization, because it slows the program down a lot.

Since kernel launches are asynchronous, I guess the error occurs because the watchdog measures the execution duration of a kernel from its asynchronous launch, and not from the actual beginning of its execution.

I am new to Cuda. Is this a common case for the error 6 to occur? Is there a way to avoid this error without altering the performance?

Yes it is. I've heard about the registry trick, but I don't want to force my clients to apply it. — floriang
The problem is probably stemming from the Windows driver batching GPU operations together to amortize latency — talonmies

Robert Crovella Robert Crovella · Accepted Answer · 2015-01-14T15:15:57

The watchdog isn't measuring execution time of kernels, per se. The watchdog is keeping track of requests in the command queue that goes to the GPU, and determining if any of them have not been acknowledged by the GPU within a timeout period.

As @talonmies indicated in the comments, my best guess is that (if you are certain that no kernel execution exceeds the timeout period) this behavior is due to the CUDA driver WDDM batching mechanism, which seeks to reduce average latency by batching GPU commands together and sending to the GPU, in batches.

You don't have direct control over the batching behavior, and so in general, trying to work around this without disabling or modifying the windows TDR mechanism will be an imprecise exercise.

The general (somewhat undocumented) suggestion for a low-cost "flush" of the command queue, which you might try experimenting with, is to use cudaEventQuery(0); (as suggested here) in place of cudaDeviceSynchronize();, perhaps every 50 kernel launches or so. To some degree the specifics may depend on the machine configuration, and the GPU in use.

I'm not sure how effective it will be in your case. I don't think that it can be advanced as a "guarantee" of avoiding a TDR event without a lot more experimentation. Your mileage may vary.

How to avoid Cuda error 6 (Launch Timeout) with consecutive asynchronous kernel launches?

2 Answers