Trouble measuring the elapsed time of a CUDA program and CUDA kernels

Question

I currently have three methos of measuring the elapsed time, two using CUDA events and the other recording start and end UNIX. The ones using CUDA events measure two things, one measures the entire outer loop time, and the other sum all kernel execution times.

Here's the code:

int64 x1, x2;

cudaEvent_t start;
cudaEvent_t end;
cudaEvent_t s1, s2;
float timeValue;


 #define timer_s cudaEventRecord(start, 0);
 #define timer_e cudaEventRecord(end, 0);   cudaEventSynchronize(end); cudaEventElapsedTime( &timeValue, start, end ); printf("time:  %f  ms \n", timeValue);


cudaEventCreate( &start );
cudaEventCreate( &end );
cudaEventCreate( &s1 );
cudaEventCreate( &s2 );

cudaEventRecord(s1, 0);   
x1 = GetTimeMs64();

for(int r = 0 ; r < 2 ; r++)
{
    timer_s
    kernel1<<<1, x>>>(gl_devdata_ptr);
    cudaThreadSynchronize();
    timer_e
    sum += timeValue;

    for(int j = 0 ; j < 5; j++)
    {
        timer_s
        kernel2<<<1,x>>>(gl_devdata_ptr);
        cudaThreadSynchronize();
        timer_e
        sum += timeValue;

        timer_s
        kernel3<<<1,x>>>(gl_devdata_ptr);
        cudaThreadSynchronize();
        timer_e
        sum += timeValue;
    }

    timer_s
    kernel4<<<y, x>>> (gl_devdata_ptr);
    cudaThreadSynchronize();
    timer_e
    sum += timeValue;
}

x2 = GetTimeMs64();

cudaEventRecord(s2, 0);   
cudaEventSynchronize(s2); 
cudaEventElapsedTime( &timeValue, s1, s2 ); 
printf("elapsed cuda :       %f  ms \n", timeValue);
printf("elapsed sum :       %f  ms \n", sum);
printf("elapsed win :       %d  ms \n", x2-x1);

The GetTimeMs64 is something I found here on StackOverflow:

int64 GetTimeMs64()
{
 /* Windows */
 FILETIME ft;
 LARGE_INTEGER li;
 uint64 ret;

 /* Get the amount of 100 nano seconds intervals elapsed since January 1, 1601 (UTC) and copy it
  * to a LARGE_INTEGER structure. */
 GetSystemTimeAsFileTime(&ft);
 li.LowPart = ft.dwLowDateTime;
 li.HighPart = ft.dwHighDateTime;

 ret = li.QuadPart;
 ret -= 116444736000000000LL; /* Convert from file time to UNIX epoch time. */
 ret /= 10000; /* From 100 nano seconds (10^-7) to 1 millisecond (10^-3) intervals */

 return ret;
}

Those aren't the real variable names nor the right kernel names, I just removed some to make the code smaller.

So the problem is, every measure gives me a really different total time.

Some examples I just ran:

elapsed cuda : 21.076832    
elapsed sum :  4.177984     
elapsed win :  27

So why is there such a huge difference? The sum of all kernel calls is around 4 ms, where are the other 18ms? CPU time?

Greg Smith Greg Smith · Accepted Answer · 2012-08-21T04:58:20

cudaThreadSynchronize is a very high overhead operation as it has to wait for all work on the GPU to complete.

You should get the correct result if you structure you code as follows:

int64 x1, x2;

cudaEvent_t start;
cudaEvent_t end;
const int k_maxEvents = 5 + (2 * 2) + (2 * 5 * 2);
cudaEvent_t events[k_maxEvents];
int eIdx = 0;
float timeValue;

for (int e = 0; e < 5; ++e)
{
    cudaEventCreate(&events[e]);
}

x1 = GetTimeMs64();
cudaEventRecord(events[eIdx++], 0);       
for(int r = 0 ; r < 2 ; r++)
{
    cudaEventRecord(events[eIdx++], 0);
    kernel1<<<1, x>>>(gl_devdata_ptr);

    for(int j = 0 ; j < 5; j++)
    {
        cudaEventRecord(events[eIdx++], 0);
        kernel2<<<1,x>>>(gl_devdata_ptr);

        cudaEventRecord(events[eIdx++], 0);
        kernel3<<<1,x>>>(gl_devdata_ptr);
    }

    cudaEventRecord(events[eIdx++], 0);
    kernel4<<<y, x>>> (gl_devdata_ptr);
}

cudaEventRecord(eIdx++, 0);   
cudaDeviceSynchronize(); 

x2 = GetTimeMs64();

cudaEventElapsedTime( &timeValue, events[0], events[k_maxEvents - 1] ); 
printf("elapsed cuda :       %f  ms \n", timeValue);
// TODO the time between each events is the time to execute each kernel.
// On WDDM a context switch may occur between any of the kernels leading
// to higher than expected results.
// printf("elapsed sum :       %f  ms \n", sum);
printf("elapsed win :       %d  ms \n", x2-x1);

On Windows an easier way to measure time is to use QueryPerformanceCounter and QueryPerformanceFrequency.

If you write the above example without the events as

#include "NvToolsExt.h"
nvtxRangePushA("CPU Time");
for(int r = 0 ; r < 2 ; r++)
{
    kernel1<<<1, x>>>(gl_devdata_ptr);

    for(int j = 0 ; j < 5; j++)
    {
        kernel2<<<1,x>>>(gl_devdata_ptr); 
        kernel3<<<1,x>>>(gl_devdata_ptr);
    }
    kernel4<<<y, x>>> (gl_devdata_ptr);
}

cudaDeviceSynchronize(); 
nvtxRangePop();

and run in Nsight Visual Studio Edition 1.5-2.2 CUDA Trace Activity or Visual Profiler 4.0+ all of the times will be available. The GPU times will be more accurate than what you can collect using cudaEvents API. Using nvtxRangePush to measure the CPU time range is optional.This can also be accomplished by measuring from the first CUDA API in the example to the end of cudaDeviceSynchronize.

Trouble measuring the elapsed time of a CUDA program and CUDA kernels

1 Answers