Why cudaMemcpy cost so much time?

Question

I am writing cuda programs and after profiling one function like doing dot product on large matrix cost most of the time:

==27530== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 64.90%  2.25369s        23  97.986ms  9.5590us  1.79533s  cudaMemcpy
 21.04%  730.65ms      1422  513.82us  3.0050us  21.028ms  cudaLaunch
  8.72%  302.72ms         5  60.543ms     477ns  170.92ms  cudaFree
  3.64%  126.54ms        18  7.0298ms  4.8882ms  35.518ms  cudaMallocHost
  1.39%  48.292ms        16  3.0182ms  3.0076ms  3.0601ms  cudaFreeHost
  0.11%  3.9026ms        23  169.68us  64.314us  1.7771ms  cudaMalloc
  0.09%  3.0171ms     17661     170ns     144ns  3.1750us  cudaSetupArgument
  0.04%  1.3514ms       810  1.6680us  1.4000us  9.9270us  cudaBindTexture
  0.02%  569.60us       810     703ns     596ns  4.8010us  cudaUnbindTexture
  0.02%  556.24us       945     588ns     484ns  4.2560us  cudaFuncSetCacheConfig
  0.01%  499.67us      1422     351ns     163ns  198.52us  cudaConfigureCall
  0.01%  256.21us      1310     195ns     150ns     335ns  cudaGetLastError
  0.01%  238.26us       166  1.4350us     165ns  49.141us  cuDeviceGetAttribute
  0.01%  175.44us       945     185ns     157ns     755ns  cudaPeekAtLastError
  0.00%  50.787us         2  25.393us  16.700us  34.087us  cuDeviceGetName
  0.00%  45.330us         2  22.665us  19.024us  26.306us  cuDeviceTotalMem
  0.00%  43.289us         2  21.644us  13.641us  29.648us  cudaMemset
  0.00%  43.029us         2  21.514us  14.059us  28.970us  cudaGetDeviceProperties
  0.00%  13.931us        12  1.1600us     339ns  5.5310us  cudaGetDevice
  0.00%  3.4750us         1  3.4750us  3.4750us  3.4750us  cudaDeviceSynchronize
  0.00%  1.5320us         1  1.5320us  1.5320us  1.5320us  cuDriverGetVersion
  0.00%  1.2690us         3     423ns     241ns     753ns  cuDeviceGetCount
  0.00%  1.0080us         1  1.0080us  1.0080us  1.0080us  cuInit
  0.00%  1.0060us         3     335ns     314ns     377ns  cuDeviceGet

It shows that 'cudaMemcpy' cost about more than two seconds. But there are few cudaMemcpy calls in my code and the D->H or H->D memory copy are all pinned memory. I don't think my cudaMemcpy calls will cost so much time.

The function that consumes most of the time:

==27530== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 74.35%  2.34598s       112  20.946ms  20.743ms  21.161ms  knl_convolve_filter(float*, float*, int, int, int, float*)

and the function:

__global__ void knl_convolve_filter(float *feature, float *filter, int width, int height, int cell_size, float *convolution) {
    int x =  blockDim.x * blockIdx.x + threadIdx.x;
    int y =  blockDim.y * blockIdx.y + threadIdx.y;

    if( x < width && y < height) {
        if( x & 1) {
            //odd, imaginary part
            float sum = 0.0f;
            size_t offset = (y * width + x - 1) * cell_size ;
            for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
                float y = *(feature + offset + i) * *(filter + offset + i + 1) + *(feature + offset + i + 1) * *(filter + offset + i);
                sum += y;
            }
            *(convolution + y * width + x) = sum;
        } else {
            //even, real part
            float sum = 0.0f;
            size_t offset = (y * width + x) * cell_size ;
            for(int i = 0, total_cell_size = cell_size * 2; i < total_cell_size ; i += 2) {
                float x = *(feature + offset + i) * *(filter + offset + i) - *(feature + offset + i + 1) * *(filter + offset + i + 1);
                sum += x;
            }
            *(convolution + y * width + x) = sum;
        }

    }
}

I am using GTX760(CC3.0) on Fedora 19 64, cuda 6.0. Am I doing something a big mistake here?

I'm not familiar with cuda, but your knl_convolve_filter function doesn't look like it makes any calls to cuda functions? Also, note that one cudaMemcpy call is taking 1.79 seconds, so maybe that's skewing things? — Roddy
The important point is that one cudaMemcpy call took 1.79533s. That almost certainly means that it is the first runtime API call in the program and incurs the penalty of all the lazy initialisation and setup costs of the runtime API. — talonmies
Removing the kernel call probably reduces the latency of the runtime initialisation. Without a short, complete example which demonstrates the problem it will be impossible to tell you what the exact source of the observed "slowness" actually is — talonmies
stick a cudaFree(0); at the beginning of your main function. Then re-profile. You may get more sensible results, with the cudaFree operation picking up much of the latency due to runtime initialization. — Robert Crovella
@RobertCrovella: Don't stop asking, but maybe keep in mind why I put "almost" answers in comments for a lot of questions. And note I did answer this one anyway, so an upvote will move this off the unanswered list.... — talonmies

talonmies talonmies · Accepted Answer · 2014-06-08T09:00:49

It is very hard to give a definitive answer because we haven't been shown any host code, but it appears that there is, in fact, one very slow cudaMemcpy call in the profiling sequence which is consuming 1.79533 seconds. The other 20 odd calls only take an average of about 20ms each. So the real question is "why does this particular cudaMemcpy call take 1.79533 seconds?", and the answer, I suspect, is that it is absorbing a lot of the lazy setup latency in the CUDA runtime API.

The nvprof profile utility which ships with modern versions of the CUDA toolkit has the option of emitting a detailed API timeline. Analysis of that timeline will answer your question for sure, but in the absence of either host code or that API trace, this is about as specific an answer as it is possible to provide.

Why cudaMemcpy cost so much time?

1 Answers