In my own code, I use the clock()
function to get precise timings. For convenience, I have the macros
enum {
tid_this = 0,
tid_that,
tid_count
};
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
#define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
#define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
#define TIMER_TIC
#define TIMER_TOC(tid)
#endif
These can then be used to instrument the device code as follows:
__global__ mykernel ( ... ) {
/* Start the timer. */
TIMER_TIC
/* Do stuff. */
...
/* Stop the timer and store the results to the "timer_this" counter. */
TIMER_TOC( tid_this );
}
You can then read the cuda_timers
in the host code.
A few notes:
- The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
- The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
- The timers can slow down your code a bit, which is why I wrapped them in the
#ifdef USETIMERS
so you can switch them off easily.
- Although
clock()
returns integer values of type clock_t
, I store the accumulated values as float
, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
- The selection
( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) )
is necessary in case the clock counter wraps around.
cudaDeviceSynchronize()
after kernel call and before time measuring in case of default stream usage? – geekmalloc
andfree
, and even worse, times of reading input data from file. So, compared to the CUDA kernel (which does not includecudaMalloc
,cudaFree
or even transfer data from CPU--GPU, results in impressive speedups. – pQB