Why is my CUDA implementation equally fast as my CPU implementation

Question

I created some code to do a 2D convlution on a 1300x1300 grayscale image and a 15x15 kernel, in standard C++ and in CUDA. Both versions:

CPU:

#include <iostream>
#include <exception>

#define N 1300
#define K 15
#define K2 ((K - 1) / 2)


template<int mx, int my>
inline int index(int x, int y)
{
  return x*my + y;
}

int main() {
  double *image  = new double[N * N];
  double *kernel = new double[K * K];
  double *result = new double[N * N];
  
  for (int x=0; x<N; ++x)
  for (int y=0; y<N; ++y)
  {
    double r = 0;
    for(int i=0; i<K; ++i)
    for(int j=0; j<K; ++j)
    {
      if (x + i - K2 >= 0 and
          x + i - K2 < N  and
          y + j - K2 >= 0 and
          y + j - K2 < N)
      {
        r +=  kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
      }
    }
    result[index<N,N>(x, y)] = r;
  }
  
  delete[] image;
  delete[] kernel;
  delete[] result;
}

GPU:

#include <iostream>
#include <exception>

// ignore, just for error handling
struct ErrorHandler {
  int d_line;
  char const *d_file;
  ErrorHandler(int line, char const *file) : d_line(line), d_file(file) {};
};

#define EH ErrorHandler(__LINE__, __FILE__)

ErrorHandler operator<<(ErrorHandler eh, cudaError_t err)
{
  if (err != cudaSuccess)
  {
    std::cerr << cudaGetErrorString( err ) << " in " << eh.d_file << " at line " << eh.d_line << '\n';
    throw std::exception();
  }
  return eh;
}
// end.

#define N 1300
#define K 15
#define K2 ((K - 1) / 2)


template<int mx, int my>
__device__ inline int index(int x, int y)
{
  return x*my + y;
}

__global__ void kernelkernel(double *image, double *kernel, double *result)
{
  int x = blockIdx.x;
  int y = blockIdx.y; // becomes: int y = threadIdx.x;
  
  double r = 0;
  for(int i=0; i<K; ++i)
  for(int j=0; j<K; ++j)
  {
    if (x + i - K2 >= 0 and
        x + i - K2 < N  and
        y + j - K2 >= 0 and
        y + j - K2 < N)
    {
      r +=  kernel[index<K,K>(i,j)] * image[index<N,N>(x+i-K2, y+j-K2)];
    }
  }
  result[index<N,N>(x, y)] = r;
}

int main() {
  double *image      = new double[N * N];
  double *kernel    = new double[K * K];
  double *result      = new double[N * N];
  
  double *image_cuda;
  double *kernel_cuda;
  double *result_cuda;
  EH << cudaMalloc((void **) &image_cuda,  N*N*sizeof(double));
  EH << cudaMalloc((void **) &kernel_cuda, K*K*sizeof(double));
  EH << cudaMalloc((void **) &result_cuda, N*N*sizeof(double));
  
  EH << cudaMemcpy(image_cuda,     image,     N*N*sizeof(double), cudaMemcpyHostToDevice);
  EH << cudaMemcpy(kernel_cuda,    kernel,    K*K*sizeof(double), cudaMemcpyHostToDevice);
  
  dim3 grid   ( N, N );
  kernelkernel<<<grid, 1>>>(image_cuda, kernel_cuda, result_cuda);
  // replace previous 2 statements with: 
  // kernelkernel<<<N, N>>>(image_cuda, kernel_cuda, result_cuda);
  EH << cudaMemcpy(result, result_cuda, N*N*sizeof(double), cudaMemcpyDeviceToHost);

  cudaFree( image_cuda );
  cudaFree( kernel_cuda );
  cudaFree( result_cuda );
  
  delete[] image;
  delete[] kernel;
  delete[] result;
}

I would expect the cuda code to be a lot faster, however:

$ nvprof ./gpuversion
==17806== NVPROF is profiling process 17806, command: ./gpuversion
==17806== Profiling application: ./gpuversion
==17806== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
99.89%  3.83149s         1  3.83149s  3.83149s  3.83149s  kernelkernel(double*, double*, double*)
  0.07%  2.6420ms         1  2.6420ms  2.6420ms  2.6420ms  [CUDA memcpy DtoH]
  0.04%  1.5111ms         2  755.54us     736ns  1.5103ms  [CUDA memcpy HtoD]

And:

$ time ./cpuversion
real    0m3.382s
user    0m3.371s
sys     0m0.012s

Their difference is statistically insignificant. The CUDA-kernel takes approximately 3-4 seconds, why isn't it a lot faster? Is my code run in parallel?

PS: I'm new to CUDA, so I could be missing something trivial.

SOLUTION

What I found out, is that CUDA does not let you access memory willy-nilly from blocks. I guess the general strategy of CUDA programming is:

allocate and copy memory from RAM to cuda using cudaMalloc and cudaMemCpy
divide the workload among blocks and threads in such a way that the memory accessed by different blocks doesn't overlap much.
If there is overlap between the memory used by blocks, start each block by copying the memory inside a shared array. Notice that:
- the size of this array must be known compile time
- it's size is limited
- this memory is shared by each thread in ONE block, so __shared double foo[10] allocates 10 doubles for each BLOCK.
copy the memory needed by one block to the shared variables inside the kernel. Of course, you use the different threads to do this 'efficiently'
sync the threads, such that all data is there before it is used.
process the data, and write the result. it to the output array of the kernel
synch again, I'm not sure why, but everyone on the internet is doing it :S
copy the GPU memory back to RAM
clean up the GPU memory.

This gives the following code. It is mex-code, for Matlab for the structural similarity, which also works via a sliding kernel, but over 2 images and with a different aggregate than the dot-product.

// author: Herbert Kruitbosch, CC: be nice, include my name in documentation/papers/publications when used
#include <matrix.h>
#include <mex.h>

#include <cmath>
#include <iostream>
#include <fstream>

#include <iostream>
#include <stdio.h>

static void HandleError(
  cudaError_t err,
  const char *file,
  int line )
{
  if (err != cudaSuccess)
  {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ), file, line );
    exit( EXIT_FAILURE );
  }
}

#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
#define TILE_WIDTH 31

__device__ inline double sim(double v0, double v1, double c)
{
  return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}

__device__ inline int index(int rows, int cols, int row, int col)
{
  return row + col*rows;
}

__global__ void ssimkernel(double *test, double *reference, const double * __restrict__ kernel, double *ssim, int k, int rows, int cols, int tile_batches_needed)
{
  int radius = k / 2;
  int block_width = TILE_WIDTH - k + 1;
  __shared__ double tile_test     [TILE_WIDTH][TILE_WIDTH];
  __shared__ double tile_reference[TILE_WIDTH][TILE_WIDTH];
  
  
  
  for(int offset=0; offset < tile_batches_needed; ++offset)
  {
    int dest = block_width*block_width*offset + threadIdx.y * block_width + threadIdx.x;
    int destRow = dest / TILE_WIDTH;
    int destCol = dest % TILE_WIDTH;
    int srcRow = blockIdx.y * block_width + destRow - radius;
    int srcCol = blockIdx.x * block_width + destCol - radius;
    int src  = srcCol * rows + srcRow;
    if (destRow < TILE_WIDTH)
    {
      if (srcRow >= 0 and srcRow < rows and
          srcCol >= 0 and srcCol < cols)
      {
        tile_test     [destRow][destCol] = test     [src];
        tile_reference[destRow][destCol] = reference[src];
      }
      else
      {
        tile_test     [destRow][destCol] = 0;
        tile_reference[destRow][destCol] = 0;
      }
    }
  }
  __syncthreads();
  
  double mean_test = 0;
  double mean_reference = 0;
  for(int i=0; i<k; ++i)
  for(int j=0; j<k; ++j)
  {
    double w = kernel[i * k + j];
    mean_test      +=  w * tile_test     [threadIdx.y+i][threadIdx.x+j];
    mean_reference +=  w * tile_reference[threadIdx.y+i][threadIdx.x+j];
  }
  
  double var_test = 0;
  double var_reference = 0;
  double correlation = 0;
  for(int i=0; i<k; ++i)
  for(int j=0; j<k; ++j)
  {
    double w = kernel[i * k + j];
    double a = (tile_test     [threadIdx.y+i][threadIdx.x+j] - mean_test     );
    double b = (tile_reference[threadIdx.y+i][threadIdx.x+j] - mean_reference);
    var_test      += w * a * a;
    var_reference += w * b * b;
    correlation   += w * a * b;
  }
  
  int destRow = blockIdx.y * block_width + threadIdx.y;
  int destCol = blockIdx.x * block_width + threadIdx.x;
  if (destRow < rows and destCol < cols)
    ssim[destCol * rows + destRow] = sim(mean_test, mean_reference, 0.01) * (0.03 + 2*correlation) / (0.03 + var_test + var_reference);
  
  __syncthreads();
}


template<typename T>
inline T sim(T v0, T v1, T c)
{
  return (c + 2*v0*v1) / (c + v1*v1 + v0*v0);
}

inline int upperdiv(int a, int b) {
  return (a + b - 1) / b;
}

void mexFunction(int nargout, mxArray *argout[], int nargin, const mxArray *argin[])
{
  mwSize rows = mxGetDimensions(argin[0])[0];
  mwSize cols = mxGetDimensions(argin[0])[1];
  mwSize k    = mxGetDimensions(argin[2])[0];
  mwSize channels = mxGetNumberOfDimensions(argin[0]) <= 2 ? 1 : mxGetDimensions(argin[0])[2];
  int dims[] = {rows, cols, channels};
  argout[0] = mxCreateNumericArray(3, dims, mxDOUBLE_CLASS, mxREAL);
  
  double *test      = (double *)mxGetData(argin[0]);
  double *reference = (double *)mxGetData(argin[1]);
  double *gaussian  = (double *)mxGetData(argin[2]);
  double *ssim      = (double *)mxGetData(argout[0]);
  
  double *test_cuda;
  double *reference_cuda;
  double *gaussian_cuda;
  double *ssim_cuda;
  HANDLE_ERROR( cudaMalloc((void **) &test_cuda,      rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &reference_cuda, rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &gaussian_cuda,  k*k*sizeof(double)) );
  HANDLE_ERROR( cudaMalloc((void **) &ssim_cuda,      rows*cols*sizeof(double)) );
  HANDLE_ERROR( cudaMemcpy(gaussian_cuda,  gaussian,  k*k*sizeof(double), cudaMemcpyHostToDevice) );
  
  int block_width = TILE_WIDTH - k + 1;
  int tile_batches_needed = upperdiv(TILE_WIDTH*TILE_WIDTH, block_width*block_width);
  
  for(int c=0; c<channels; ++c)
  {
    HANDLE_ERROR( cudaMemcpy(test_cuda,      test      + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
    HANDLE_ERROR( cudaMemcpy(reference_cuda, reference + rows*cols*c, rows*cols*sizeof(double), cudaMemcpyHostToDevice) );
    dim3 dimGrid(upperdiv(cols, block_width), upperdiv(rows, block_width), 1);
    dim3 dimBlock(block_width, block_width, 1);
    
    ssimkernel<<<dimGrid, dimBlock>>>(test_cuda, reference_cuda, gaussian_cuda, ssim_cuda, k, rows, cols, tile_batches_needed);
    
    HANDLE_ERROR( cudaMemcpy(ssim + rows*cols*c, ssim_cuda, rows*cols*sizeof(double), cudaMemcpyDeviceToHost) );
  }
  cudaFree( test_cuda );
  cudaFree( reference_cuda );
  cudaFree( gaussian_cuda );
  cudaFree( ssim_cuda );
}

You are running one thread per block. That wastes about 96% of the available GPU instruction throughput. It should be no surprise the performance is suboptimal.... — talonmies
Your edited profile data shows no kernel execution time. Are you should you haven't broken something else? — talonmies
Yes, i would worry about @talonmies' observation. Have you validated the output of your kernel? I would also put an error check just after launching the kernel; e.g. EH << cudaGetLastError();. I'm not sure if cudaMemcpy will warn you if something fishy happened. — user1084944
@Hurkyl: I did validate, unfortunately I couldn't notice the difference because (so I think) the valid result was already calculated and kept in memory, but not overwritten. Thank you for your comment, it is indeed correct that the solution mentioned above does not work. Sincere apologies for this. — Herbert

Unknown Unknown · Accepted Answer · 2015-04-24T08:33:04

kernelkernel<<<grid, 1>>>

This is a significant issue; threads on nVidia GPUs work in warps of 32 threads. However, you've only assigned a single thread to each block, which means 31 of those threads will sit idle while a single thread does work. And usually, for kernels where you have the flexibility, you'll usually want several warps per block rather than just one.

You could get an immediate speedup by using N blocks and N threads per block, rather than using N^2 blocks.

Actually, N might be too big, since there's an upper limit on the number of threads per block. Although you could choose a suitable M so that that you use N/M threads per block, and N * M blocks.

In fact, you'll probably get the best results in this regard by picking some M (I'm guessing 256 will probably be near optimal) and launching with L=ceiling(N*N/M) blocks and M blocks per thread. Then each thread figures reconstructs an index in [0, M*L) based on its block and thread ID, and then those whose index is in [0,N*N) will proceed to split that index into an x and y coordinate and do work.

Why is my CUDA implementation equally fast as my CPU implementation

SOLUTION

3 Answers