Well, you might know that I'm partial to this technique.
It will tell you approximate percentages spent by functions, lines of code, anything you can identify.
I assume your main program at some point has to wait for the CUDA kernels to finish processing, so the fraction of samples ending in that wait gives you an estimate of the time spent in CUDA.
Samples not ending in that wait, but doing other things, indicate the time spent doing those other things.
The statistics are pretty simple. If a line of code or function is on the stack for fraction F of time, then it is responsible for that fraction of time. So if you take N samples, the number of samples showing the line of code or function are, on average, NF. The standard deviation is sqrt(NF(1-F)).
So if F is 50% or 0.5, and you take 20 random stack samples, you can expect to see the code on 10 of them, with a standard deviation of sqrt(20*0.5*0.5) = 2.24 samples, or somewhere between 7 and 13 samples, most likely between 9 and 11.
In other words, you get a very rough measurement of the code's cost, but you know precisely what code has that cost.
If you're concerned about speed, you mainly care about the things that have a big cost, so it's good at pointing those out to you.
If you want to know why gprof doesn't tell you those things, there's a lot more to say about that.
cudaDeviceSynchronize()
call at the end of your wrapper, depending on your exact code in the wrapper. – Robert Crovella