CUDA: Using grid-strided loop with reduction in shared memory

Question

I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels. Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). In this case you will write the kernel of this kind:

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;

    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Do smth...
    }
}

Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one:

#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)

// ...

__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    int i;

    // Initialize shared memory array for the each block
    __shared__ double data[BLOCK_SIZE];

    // N is a total number of elements in the global_1D_array array
    for (i = idx; i < N; i += blockDim.x * gridDim.x)
    {
        // Load data from global to shared memory
        data[threadIdx.x] = global_1D_array[i];
        __syncthreads();

        // Do reduction in shared memory ...
    }

    // Copy MAX value for each block into global memory
}

It is clear that some values in the data will be overwritten, i.e. you need longer shared memory array or have to organize the kernel in another way. What is the best (most efficient) way to use reduction in shared memory and strided loop together?

Thanks in advance.

have you studied the cuda parallel reduction sample and PDF ? It covers all of this, ultimately demonstrating a parallel reduction in shared memory with a grid-strided loop. — Robert Crovella
Yes, I saw this presentation and currently use one of the discussed algorithms from it for the reduction but stride in this paper means stride for reduction level and implicitly It is thought that you have threads in the grid more than elements in the input global array. In this post it is opposite situation and I want to use grid stride instead of stride for reduction level. — kpa
You should read it again. " It is thought that you have threads in the grid more than elements in the input global array." This is not true. The paper covers the opposite case also (by the end of the paper). If your statement were true, there would be no need to have a grid strided loop. — Robert Crovella
For example, take a look at slide 32 of the pdf. That is a grid-strided loop. The size of the grid (in threads) is smaller than the number of data elements in the input global array. — Robert Crovella

Robert Crovella Robert Crovella · Accepted Answer · 2014-08-11T01:47:33

A reduction using a grid strided loop is documented here. Referring to slide 32, the grid-strided loop looks like this:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n){
  sdata[tid] += g_idata[i] + g_idata[i+blockSize];
  i += gridSize;
  }
__syncthreads();

Note that each iteration of the while-loop increases the index by gridSize, and this while-loop will continue until the index (i) exceeds the (global) data size (n). We call this a grid-strided loop. In this example, the remainder of the threadblock-local reduction operation is not impacted by grid-size looping, thus only the "front-end" is shown. This particular reduction is doing a sum-reduction, but a max-reduction would simply replace the operation with something like:

unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize*gridDim.x;
sdata[tid] = 0;
while (i < n){
  sdata[tid] = (sdata[tid] < g_idata[i]) ? + g_idata[i]:sdata[tid];
  i += gridSize;
  }
__syncthreads();

And the remainder of the threadblock level reduction would have to be modified in a similar fashion, replacing the summing operation with a max-finding operation.

The full parallel reduction CUDA sample code is available as part of any full cuda samples install, or here.

CUDA: Using grid-strided loop with reduction in shared memory

1 Answers