In my application I have implemented the same algorithm for CPU and GPU with CUDA and I have to measure the time needed to perform algorithm on CPU and GPU. I've noticed that there's some time spent for CUDA initialization in GPU version of algorithm and added cudaFree(0);
at the beginning of the program code as it recommended here for CUDA initialization, but it still takes more time for the first GPU CUDA algorithm execution, than the second one.
Are there any other CUDA related stuff that have to be initialized at the beginning to measure actual algorithm execution time correctly?