I am trying to loop over 2-dimensional array on CUDA efficiently. In host code I have
double **h_matrix; // Matrix on host of size Nx by Ny
double tmp;
...
for(i = 0; i < Nx; i++) {
for(j = 0; j < Ny; j++) {
tmp = h_matrix[i][j];
... // Perform some operation on tmp
h_matrix[i][j] = tmp;
}
}
To perform similar task efficiently in CUDA, I understand that I have to use cudaMallocPitch()
to allocate memory for 2D array, as shown in CUDA Programming guide (scroll a bit for example). That example doesn't really help much, since that kernel doesn't use any information about grid, block or thread performing it even though it is launched as <<<100, 512>>>
.
NVidia'a Parallel forall blog suggests using a grid stride loops to write flexible & scalable kernels, however, their examples use only 1D arrays. How can I write grid stride loops for 2D arrays allocated using cudaMallocPitch()
to parallelize code shown above? Should I use 2D dimGrid and dimBlock, and if so, how?