3
votes

I always have a strange 0.04 ms overhead when working with memory in CUDA on my old GeForce 8800GT. I need to transfer ~1-2K to constant memory of my device, work with that data on it and get only one float value from the device.

I have a typical code using GPU calculation:

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.005 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    my_kernel<<<128, 128>>>(output);
    //several other calls of different kernels
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 
}

I decided to measure the speed of work with GPU memory with this code (commented all kernel calls, added cudaDeviceSynchronize call):

//allocate all the needed memory: pinned, device global
for(int i = 0; i < 1000; i++)
{
    //Do some heavy cpu logic (~0.001 ms long)        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    cudaMemcpyAsync((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    cudaDeviceSynchronize();
    // Do some logic with returned value 
}

I've measured the execution time of the cycle and got ~0.05 sec (so, 0.05 ms per iteration). The strange thing is that when I try to do some more memory work (adding additional cudaMemcpyToSymbolAsync and cudaMemcpyAsync calls) I get additional <0.01 ms time per call. It corresponds with the research of this guy: http://www.cs.virginia.edu/~mwb7w/cuda_support/memory_transfer_overhead.html

He also got these 0.01 ms per transfer of 1K block to GPU. So where that 0.04 ms (0.05 - 0.01) overhead came from? Any ideas? May be I should try this code on a newer card?

It seems to me that after cudaDeviceSynchronize and CPU code my GeForce goes to some power saving mode or something like this.

1
0.05 ms is the average per iteration. If I were you I'll get all the times individually to see if it's a constant value. The first time is used to have an overhead.pQB
How do you assess the elapsed time? Do you call a CUDA function before entering to the loop, to eliminate the initialization cost of the device, like calling cudaFree(0).phoad
@pQB, yes, it's an average, but I think it shows the real picture of the time I get when excecuting the code from my external CPU code. When I change the iterations number from 1000 to, for example, 2000 I have twice the time I had before (as expected). The same happens when I set iterations number to 500 - the time decreases by 2 times.otter
@phoad, I just call GetSystemTime before and after the cycle. About some function call to initialize the device - could you be so kind to give me some links where I can read about it? One of my main versions of what is going on here is that the device is set to some "sleep mode" or something like this. That may happen after a heavy CPU code execution. I'm going to run some tests in which I'll try to eliminate the CPU execution.otter
stackoverflow.com/questions/11704681/… Just check this link. It has information about using a better CUDA-provided timer and how to eliminate initialization cost from timings..phoad

1 Answers

1
votes

I recommend you to increase the number of threads you are implementing

    //Use malloc() to allocate memory on CPU. 
    //Change mem_size to the total memory to be tranferred to GPU.        
    cudaMemcpyToSymbolAsync(const_dev_mem, pinned_host_mem, mem_size, 0, cudaMemcpyHostToDevice);
    dim3 dimBlock(128,2);
    dim3 dimGrid(64000,1);
    my_kernel<<<dimGrid, dimBlock>>>(output);
    //several other calls of different kernels
    //change size field to 1000*sizeof(FLOAT_T)
    cudaMemcpy((void*)&host_output, output, sizeof(FLOAT_T), cudaMemcpyDeviceToHost);
    // Do some logic with returned value 

If the code crashes (because of more threads or more GPU memory), use loops. But, make them less.