0
votes

hi every one im currently working on timing some of my CUDA code. I was able to time them using events. My kernel ran for 19 ms. Somehow I find this doubtful because when I ran a sequential implementation of this, it was at around 5000 ms. I know the code should run faster, but should it be this fast?

I'm using wrapper functions to call cuda kernels in my cpp program. Am I supposed to be calling them there or in the .cu file? Thanks!

3
Speedups of 100x are not very surprising with CUDA. But you should post some code so we can see what you're doing!user703016
Did you use streams? Did you add cudaDeviceSynchronize() after kernel call and before time measuring in case of default stream usage?geek
Since the OP is using events, the OP should use cudaEventSynchronize(), not cudaDeviceSynchronize() (the latter will work, but it's a bit of a heavy hammer for timing...).harrism
How are you calling the cuda kernels from your .cpp file? If you are not using <<<>>>, the CUDA Driver API, or cudaLaunch(), then you are not launching kernels on the device. Posting some example code would help us answer.harrism
Another doubt to be solved is how are you measuring the time of your sequential version?. Some unfair comparisons measure the full time of the sequential code from console, including times of malloc and free, and even worse, times of reading input data from file. So, compared to the CUDA kernel (which does not include cudaMalloc, cudaFree or even transfer data from CPU--GPU, results in impressive speedups.pQB

3 Answers

1
votes

The obvious way to check if your program is working would be to compare the output to that of your CPU based implementation. If you get the same output, it is working by definition, right? :)

If your program is experimental in such a way that it doesn't really produce any verifiable output then there is a good chance that the compiler has optimized out some (or all) of your code. The compiler will remove code that does not contribute to output data. This can cause, for instance, that the entire contents of a kernel is removed if the final statement that stores the calculated value is commented out.

As to your speedup. 5000ms / 19ms = 263x, which is an unlikely increase, even for algorithms that map perfectly to the GPU architecture.

0
votes

Well, if you wrote your CUDA code right, yes, it could be that much faster. Think about it. You moved the code from sequential execution on a single processor to parallel execution on hundreds of processors, depending on your GPU model. My $179 mid range card has 480 cores. Some available now have 1500 cores. It is very possible to get 100x perf jumps with CUDA, particularly if your kernel is much more compute-bound than memory bound.

That said, make sure you are measuring what you think you are measuring. If you are invoking your CUDA kernel without using any explicit streams, then the call is synchronous to the host thread and your timings should be accurate. If you are invoking your kernel using a stream, then you need to call cudaDeviceSynchronise() or have your host code wait on an event signaled by the kernel. Kernel calls invoked on a stream execute asynchronously to the host thread, so time measurements in the host thread will not correctly reflect the kernel time unless you make the host thread wait until the kernel call is complete. You can also use CUDA events to measure elapsed time on the GPU within a given stream. See section 5.1.2 of the CUDA Best Practices Guide in the NVidia GPU Computing SDK 4.2.

-3
votes

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

These can then be used to instrument the device code as follows:

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

You can then read the cuda_timers in the host code.

A few notes:

  • The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
  • The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
  • The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
  • Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
  • The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.