0
votes

i need to calculate the GPU run time code, and also the total running code (both host and device). in my code i have two gpu kernel running, and in between a host for loop to copy data, below example can show what my code looks like

cuda event start

//FIRST kernel code call <<...>>

// cuda memory copy result back from device to host

CudadeviceSyncronize()

// copy host data to host array (CPU funtion loop)

// cuda memory copy from host to device

// SECOND Kernel call <<...>>

cuda event stop

//memory copy back from device to host

what i know is that i use events to calculate the kernel, Events precisely measure the actual time taken on the GPU for a kernel. so my question & goal is :

1- is my way i put the event calling above shown : will be recording the kernel Only and neglecting the host functions ?

2- will the host loop call affect the cuda events timing?

3- my goal is to calculate the GPU only , and also GPU+CPU together, the above will it achieve it or should i use clock_gettime(CLOCK_REALTIME, timer) to calculate the host ?

1

1 Answers

2
votes

A sequence like this:

float et;
cudaEvent_t start, stop;
cudaEventCreate(&start); cudaEventCreate(&stop);
cudaEventRecord(start);
kernel1<<<...>>>(...);
cudaDeviceSynchronize();
host_code_routine(...);
kernel2<<<...>>>(...);
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&et, start, stop);

will return, in et, the floating-point elapsed time (in milliseconds) that is (approximately) the sum of:

  1. kernel1 execution time
  2. the (host) execution time associated with host_code_routine
  3. kernel2 execution time

If you wish to produce the sum of only 1 and 3 above, you will need to bracket each kernel (only) with a cudaEvent timing sequence, and then manually sum the two values in host code.

To answer your questions, then:

1- is my way i put the event calling above shown : will be recording the kernel Only and neglecting the host functions ?

No, the recording you have depicted will capture both host and device elapsed time in the sequence.

2- will the host loop call affect the cuda events timing?

Yes

3- my goal is to calculate the GPU only , and also GPU+CPU together, the above will it achieve it or should i use clock_gettime(CLOCK_REALTIME, timer) to calculate the host ?

If you want individual times and various sums, I suggest you time the kernels independently, and use some host-based method of timing the host code, and then combine the various components in whichever way you wish.