Experiment to find out affect of block size on cuda program speed

Question

I want to find out how the number of threads in a block affects the performance and speed of a cuda program. I wrote a simple vector addition code, here is my code:

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}
__global__ void gpuVecAdd(float *a, float *b, float *c, int n) {
    int id = blockIdx.x * blockDim.x + threadIdx.x;
    if (id < n) {
        c[id] = a[id] + b[id];
    }
}
int main() {
    int n = 1000000;
    float  *h_a, *h_b, *h_c, *t;
    srand(time(NULL));
    size_t bytes = n* sizeof(float);
    h_a = (float*) malloc(bytes);
    h_b = (float*) malloc(bytes);
    h_c = (float*) malloc(bytes);
    for (int i=0; i<n; i++)
        {
         h_a[i] =rand()%10;
          h_b[i] =rand()%10;
        }
   float *d_a, *d_b, *d_c;
   cudaMalloc(&d_a, bytes);
   cudaMalloc(&d_b, bytes);
   cudaMalloc(&d_c, bytes);

   gpuErrchk( cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice));
   gpuErrchk( cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice));

   clock_t t1,t2;
   t1 = clock();
   int block_size = 1024;
   gpuVecAdd<<<ceil(float(n/block_size)),block_size>>>(d_a, d_b, d_c, n);
   gpuErrchk( cudaPeekAtLastError() );
   t2 = clock();
   cout<<(float)(t2-t1)/CLOCKS_PER_SEC<<" seconds";
   gpuErrchk(cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost));
cudaFree(d_a);
   cudaFree(d_b);
   cudaFree(d_c);
   free(h_a);
   free(h_b);
   free(h_c);
}

I read this post and Based on the talonmies' answer "The number of threads per block should be a round multiple of the warp size, which is 32 on all current hardware."

I checked the code with a different number of threads per block, for example, 2 and 1024 (which is the multiply of 32 and also the maximum number of thread per block). The average running time for both sizes is almost equal and I don't see a huge difference between them. Why is that? Is my benchmarking incorrect?

You are timing launch overhead only. You are not actually timing the duration of the kernel. After the cudaPeekAtLastError call, add a call to cudaDeviceSynchronize(), which will force the full kernel duration to appear in your timing. — Robert Crovella
@RobertCrovella Yes! that was the problem. Please write your answer so I can mark it as the correct answer. — starrr

Robert Crovella Robert Crovella · Accepted Answer · 2016-10-07T00:54:02

GPU kernel launches in CUDA are asynchronous. This means that control will be returned to the CPU thread before the kernel has finished executing.

If we want the CPU thread to time the duration of the kernel, we must cause the CPU thread to wait until the kernel has finished. We can do this by putting a call to cudaDeviceSynchronize() in the timing region. Then the measured time will include the full duration of kernel execution.

Experiment to find out affect of block size on cuda program speed

1 Answers