2
votes

I have implemented a pipeline where many kernels are launched in a specific stream. The kernels are enqueued into the stream and executed when the scheduler decides it’s best.

In my code, after every kernel enqueue, I check if there’s any error by calling cudaGetLastError which, according to the documentation, "it returns the last error from a runtime call. This call, may also return error codes from previous asynchronous launches". Thus, if the kernel has only been enqueued, not executed, I understand that the error returned refers only if the kernel was enqueued correctly (parameters checking, grid and block size, shared memory, etc...).

My problem is: I enqueue many different kernels without waiting for finalization of the execution of each kernel. Imagine now, I have a bug in one of my kernels (let's call it Kernel1) which causes a illegal memory access (for instance). If I check the cudaGetLastError right after enqueuing it, the return value is success because it was correctly enqueued. So my CPU thread moves on and keep enqueuing kernels to the stream. At some point Kernel1 is executed and raised the illegal memory access. Thus, next time I check for cudaGetLastError I will get the cuda error but, by that time, the CPU thread is another point forward in the code. Consequently, I know there's been an error, but I have no idea which kernel raised it.

An option is to synchronize (block the CPU thread) until the execution of every kernel have finished and then check the error code, but this is not an option for performance reasons.

The question is, is there any way we can query which kernel raised a given error code returned by cudaGetLastError? If not, which is in your opinion the best way to handle this?

2

2 Answers

3
votes

There is an environment variable CUDA_​LAUNCH_​BLOCKING which you can use to serialize kernel execution of an otherwise asynchronous sequence of kernel launches. This should allow you to isolate the kernel instance which is causing an error, either via internal error checking in your host code, or via an external tool like cuda-memcheck.

1
votes

I have tested 3 different options:

  1. Set CUDA_​LAUNCH_​BLOCKING environment variable to 1. This forces to block the CPU thread until the kernel execution has finished. We can check after each execution if there's been an error catching the exact point of failure. Although, this has an obvious performance impact but this may help to bound the bug in a production environment without having to perform any change at the client side.
  2. Distribute the production code compiled with the flag -lineinfo and run the code again with cuda-memncheck. This has no performance impact and we do not need to perform any change in the client either. Although, we have to execute the binary in a slightly different environment and in some cases, like a service running GPU tasks, can be difficult to achieve.
  3. Insert a callback after each kernel call. In the userData parameter, include a unique id for the kernel-call, and possibly some information on the parameters used. This can be directly distributed in a production environment and always gives us the exact point of failure and we don't need to perform any change at the client side. Although, the performance impact of this approach is huge. Apparently, the callback functions, are processed by a driver thread and cause for the performance impact. I wrote a code to test it

    #include <cuda_runtime.h>
    
    #include <vector>
    #include <chrono>
    #include <iostream>
    
    #define BLOC_SIZE       1024
    #define NUM_ELEMENTS    BLOC_SIZE * 32
    #define NUM_ITERATIONS  500
    
    __global__ void KernelCopy(const unsigned int *input, unsigned int *result) {
      unsigned int pos = blockIdx.x * BLOC_SIZE + threadIdx.x;
      result[pos] = input[pos];
    }
    
    void CUDART_CB myStreamCallback(cudaStream_t stream, cudaError_t status, void *data) {
      if (status) {
        std::cout << "Error: " << cudaGetErrorString(status) << "-->";
      }
    }
    
    #define CUDA_CHECK_LAST_ERROR   cudaStreamAddCallback(stream, myStreamCallback, nullptr, 0)
    
    int main() {
      cudaError_t c_ret;
      c_ret = cudaSetDevice(0);
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      unsigned int *input;
      c_ret = cudaMalloc((void **)&input, NUM_ELEMENTS * sizeof(unsigned int));
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      std::vector<unsigned int> h_input(NUM_ELEMENTS);
      for (unsigned int i = 0; i < NUM_ELEMENTS; i++) {
        h_input[i] = i;
      }
    
      c_ret = cudaMemcpy(input, h_input.data(), NUM_ELEMENTS * sizeof(unsigned int), cudaMemcpyKind::cudaMemcpyHostToDevice);
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      unsigned int *result;
      c_ret = cudaMalloc((void **)&result, NUM_ELEMENTS * sizeof(unsigned int));
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      cudaStream_t stream;
      c_ret = cudaStreamCreate(&stream);
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      std::chrono::steady_clock::time_point start;
      std::chrono::steady_clock::time_point end;
    
      start = std::chrono::steady_clock::now();
      for (unsigned int i = 0; i < 500; i++) {
        dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
        KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
        CUDA_CHECK_LAST_ERROR;
      }
      cudaStreamSynchronize(stream);
      end = std::chrono::steady_clock::now();
      std::cout << "With callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';
    
      start = std::chrono::steady_clock::now();
      for (unsigned int i = 0; i < 500; i++) {
        dim3 grid(NUM_ELEMENTS / BLOC_SIZE);
        KernelCopy <<< grid, BLOC_SIZE, 0, stream >>> (input, result);
        c_ret = cudaGetLastError();
        if (c_ret) {
          std::cout << "Error: " << cudaGetErrorString(c_ret) << "-->";
        }
      }
      cudaStreamSynchronize(stream);
      end = std::chrono::steady_clock::now();
      std::cout << "Without callback took (ms): " << std::chrono::duration<float, std::milli>(end - start).count() << '\n';
    
      c_ret = cudaStreamDestroy(stream);
      if (c_ret != cudaSuccess) {
        return -1;
      }
      c_ret = cudaFree(result);
      if (c_ret != cudaSuccess) {
        return -1;
      }
      c_ret = cudaFree(input);
      if (c_ret != cudaSuccess) {
        return -1;
      }
    
      return 0;
    }
    

Ouput:

With callback took (ms): 47.8729

Without callback took (ms): 1.9317

(CUDA 9.2, Windows 10, Visual Studio 2015, Nvidia Tesla P4)

To me, in a production environment, the only valid approach is number 2.