10
votes

I am trying to loop over 2-dimensional array on CUDA efficiently. In host code I have

double **h_matrix; // Matrix on host of size Nx by Ny
double tmp;
...
for(i = 0; i < Nx; i++) {
    for(j = 0; j < Ny; j++) {
        tmp = h_matrix[i][j];
        ... // Perform some operation on tmp
        h_matrix[i][j] = tmp;
    }
}

To perform similar task efficiently in CUDA, I understand that I have to use cudaMallocPitch() to allocate memory for 2D array, as shown in CUDA Programming guide (scroll a bit for example). That example doesn't really help much, since that kernel doesn't use any information about grid, block or thread performing it even though it is launched as <<<100, 512>>>.

NVidia'a Parallel forall blog suggests using a grid stride loops to write flexible & scalable kernels, however, their examples use only 1D arrays. How can I write grid stride loops for 2D arrays allocated using cudaMallocPitch() to parallelize code shown above? Should I use 2D dimGrid and dimBlock, and if so, how?

3

3 Answers

7
votes

Here is a complete compilable example I created based on JackOLantern's answer.

#include <stdio.h>
#include <assert.h>

#define N 11
#define M 3

__global__ void kernel(float * d_matrix, size_t pitch) {
    for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < N; j += blockDim.y * gridDim.y) {
        float* row_d_matrix = (float*)((char*)d_matrix + j*pitch);
        for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < M; i += blockDim.x * gridDim.x) {
            row_d_matrix[i] = (j * M + i) + (j * M + i);
        }
    }
}

void verify(float *h, float *d, int size) {
    for (int i = 0; i < size; i++) {
        assert(h[i] == d[i]);
    }
    printf("Results match\n");
}

int main() {

    float *h_matrix;
    float *d_matrix;
    float *dc_matrix;

    h_matrix = (float *) malloc(M * N * sizeof(float));
    dc_matrix = (float *) malloc(M * N * sizeof(float));

    for (int j = 0; j < N; j++) {
        for (int i = 0; i < M; i++) {
            h_matrix[j * M + i] = (j * M + i) + (j * M + i);
        }
    }

    size_t pitch;
    cudaMallocPitch(&d_matrix, &pitch, M * sizeof(float), N);

    dim3 grid(1, 1, 1);
    dim3 block(3, 3, 1);

    kernel<<<grid, block>>>(d_matrix, pitch);

    cudaMemcpy2D(dc_matrix, M * sizeof(float), d_matrix, pitch, M * sizeof(float), N, cudaMemcpyDeviceToHost);

    verify(h_matrix, dc_matrix, M * N);

    free(h_matrix);
    cudaFree(d_matrix);
    free(dc_matrix);
}
6
votes

It is a bit of an old question but I have some suggestions to improve the accepted answer. I am somewhat disregarding the pitch part as that part has been covered.

The accepted answer, while it iterates through all values, doesn't take into account any type of balance between threads.

Lets take an small example. Lets say we start a kernel up with <<<1, block>>> where block is dim3 block(2,2). Then we are doing work on a 5x5 matrix. Now as per the above suggestion the work distribution would end up being so that that the thread with id (0,0) gets 9 of the runs while threads (0,1) and (1,0) get 6 each and (1,1) gets 4 runs in total.

So my suggestion to balance the load better would be to flatten the loop and calculate the indices from the flattened loop.

So my suggestion is something more along the line of

int n=11;
int m=3;
int i, j, k;
for(i = (blockIdx.y * blockDim.y + threadIdx.y) * blockDim.x * gridDim.x + 
        (blockIdx.x * blockDim.x + threadIdx.x); 
    i < m*n; 
    i += blockDim.y * gridDim.y * blockDim.x * gridDim.x) {
  j = i/m;
  k = i%m;
  //Calculations here
}

This would then apply to the above example as before in regards to pitch that you can find the row from the value of j.

5
votes

Perhaps an extension of the grid-stride loop concept to the 2D case in connection to 2D matrices allocated by cudaMallocPitch could look like:

#define N 11
#define M 3

__global__ void kernel(float * d_matrix, size_t pitch) {

    int idx = blockIdx.x*blockDim.x + threadIdx.x;
    int idy = blockIdx.y*blockDim.y + threadIdx.y;

    for (int j = blockIdx.y * blockDim.y + threadIdx.y; j < N; j += blockDim.y * gridDim.y) 
    {
        float* row_d_matrix = (float*)((char*)d_matrix + idy*pitch);
        for (int i = blockIdx.x * blockDim.x + threadIdx.x; i < M; i += blockDim.x * gridDim.x) {
            row_d_matrix[i] = ....
       } 

    }

}

int main()
{

    float *d_matrix;

    size_t pitch;
    cudaMallocPitch(&d_matrix,&pitch,M*sizeof(float),N);

    kernel<<<GridSize,BlockSize>>>(d_matrix,pitch);

    // Other stuff

}