4
votes

I have come across a serialization issue in CUDA kernels where concurrent execution is expected. I an using cudaEvents as markers to track kernel executions.

In my experiments on concurrent kernels with multiple streams, we observed that using events on their respective streams causes concurrent kernels to get serialized.

The code below demonstrates this issue. I tested this on two different devices which have concurrent kernel execution capabilities listed below:

  1. Tesla C2070, Driver version 4.10, Runtime version 4.10, CUDA capability 2.0
  2. Tesla M2090, Driver version 4.10, Runtime version 4.10, CUDA capability 2.0

You can run the program with and w/o events by changing USE_EVENTS macro and you will observe the difference due to concurrent execution vs. serial execution.

#include<cuda.h>
#include<pthread.h>
#include<stdio.h>
#include<stdlib.h>
#include<stdint.h>

#define CUDA_SAFE_CALL( call) do {                                        \
cudaError_t err = call;                                                    \
if( cudaSuccess != err) {                                                \
fprintf(stderr, "Cuda error in call at file '%s' in line %i : %s.\n", \
__FILE__, __LINE__, cudaGetErrorString( err) );              \
exit(-1);                                                     \
} } while (0)



// Device code
__global__ void VecAdd(uint64_t len)
{
    volatile int a;
    for(uint64_t n = 0 ; n < len ; n ++)
        a++; 
    return ;
}

#define USE_EVENTS

int
main(int argc, char *argv[])
{

    cudaStream_t stream[2];
    for(int i = 0 ; i < 2 ; i++) 
        CUDA_SAFE_CALL(cudaStreamCreate(&stream[i]));

#ifdef USE_EVENTS
    cudaEvent_t e[4];
    CUDA_SAFE_CALL(cudaEventCreate(&e[0]));
    CUDA_SAFE_CALL(cudaEventCreate(&e[1]));
    CUDA_SAFE_CALL(cudaEventRecord(e[0],stream[0]));
#endif
    VecAdd<<<1, 32, 0, stream[0]>>>(0xfffffff);

#ifdef USE_EVENTS
    CUDA_SAFE_CALL(cudaEventRecord(e[1],stream[0]));
#endif

#ifdef USE_EVENTS
    CUDA_SAFE_CALL(cudaEventCreate(&e[2]));
    CUDA_SAFE_CALL(cudaEventCreate(&e[3]));
    CUDA_SAFE_CALL(cudaEventRecord(e[2],stream[1]));
#endif
    VecAdd<<<1, 32, 0, stream[1]>>>(0xfffffff);

#ifdef USE_EVENTS
    CUDA_SAFE_CALL(cudaEventRecord(e[3],stream[1]));
#endif
    CUDA_SAFE_CALL(cudaDeviceSynchronize());

    for(int i = 0 ; i < 2 ; i++) 
        CUDA_SAFE_CALL(cudaStreamDestroy(stream[i]));

    return 0;

}

Any suggestions in why this might be happening and how to circumvent this serialization will be useful.

2
What is this code supposed to demonstrate? How can we "observe the difference due to concurrent execution vs. serial execution"? if you answer includes using the visual profiler or Nsight, then you already have a problem, because the performance counters those codes use serialize the API anyway..... Also, the kernel you are using will compile to a null stub because of compiler optimization.talonmies
@talonmies: With respect, your comment sounds kind of harsh. This person did take the trouble to provide a complete reproducible example, which is all too rare. Maybe the example is flawed, but a good faith attempt was made.Roger Dahl
In which way do you determine that the kernels were serialized when USE_EVENTS is defined?Roger Dahl
@talonmies, Parallel Nsight has supported concurrent kernel trace as the default method since 2.0. In addition the timeline will show whenever the tool causes serialization. The NVIDIA Visual Profiler serializes kernels.Greg Smith
(1) Move your cudaEventCreate calls to the loop that creates the streams. The host API overhead may be causing your problem. (2) Increase the duration of your kernel. The current kernel execution may be too small to capture. (3) Can you specify your OS (and if WinVista/7 if you are using TCC or WDDM).Greg Smith

2 Answers

3
votes

The above example issues work in the following order:

1 event record on stream A
2 launch on stream A
3 event record on Stream A
4 event record on stream B
5 launch on stream B
6 event record on stream B

CUDA operations on the same stream execute in issue order. CUDA operations in different streams may run concurrently.

By the programming model definition there should be concurrency. However, on current devices this work is issued to the GPU through a single push buffer. This causes the GPU to wait for operation 2 to complete before issuing operation 3 and operation 4 to complete before issuing 5, ... If the event records are removed then the operations are

1 launch on stream A
2 launch on stream B

Operation 1 and 2 are on different streams so it is possible for the GPU to execute the two operations concurrently.

Parallel Nsight and the CUDA command line profiler (v4.2) can be used to time concurrent operation. The command line profiler option is "conckerneltrace". This feature should appear in a future version of the NVIDIA Visual Profiler.

1
votes

I was debugging fundamentally the same issue. Greg's answer was very helpful, though the explanation does not seem complete. The real problem is that op 3 is waiting on 2 when 4 is issued. Even though 4 is in a different stream, if there is already a kernel/event waiting in the issue queue, it cannot be issued. This is similar to the case where more than one kernel is issued consecutively per stream. This can be solved by delaying the end-of-stream event as follows:

  1. event record on stream A (start timer)
  2. launch on stream A
  3. event record on stream B (start timer)
  4. launch on stream B
  5. event record on stream A (end timer)
  6. event record on stream B (end timer)

Since launches are asynchronous, the end-of-stream events will wait until both the previous kernel launch in that stream is done, and kernel issues for all other streams have been launched. Clearly, this will cause the end-timers to be issued too late if there are more streams than can be issued concurrently on the given hardware.