4
votes

In my application I have implemented the same algorithm for CPU and GPU with CUDA and I have to measure the time needed to perform algorithm on CPU and GPU. I've noticed that there's some time spent for CUDA initialization in GPU version of algorithm and added cudaFree(0); at the beginning of the program code as it recommended here for CUDA initialization, but it still takes more time for the first GPU CUDA algorithm execution, than the second one.

Are there any other CUDA related stuff that have to be initialized at the beginning to measure actual algorithm execution time correctly?

1
I'm not sure who downvoted this question, but it is perfectly valid CUDA question which I don't believe has been asked and answered before.talonmies

1 Answers

4
votes

The heuristics of lazy context initialisation in the CUDA runtime API have subtly changed since the answer you linked to was written in two ways I am aware of:

  1. cudaSetDevice() now initiates a context, whereas earlier on it did not (hence the need for the cudaFree() call discussed in that answer)
  2. Some device code related initialisation which the runtime API used to perform at context initialisation is now done at first call to the kernel

The only solution I am aware of for the second item is to run the CUDA kernel code you want to time once as a "warm up" to absorb the setup latency, and then perform your timing on the code for benchmarking purposes.

Alternatively, you can use the driver API and have much finer grained control over when latency will occur during application start up.