I have the following question concerning usage of grid-strided loops and optimized reduction algorithms in shared memory together in CUDA kernels. Imagine that you have 1D array with number of element more than threads in the grid (BLOCK_SIZE * GRID_SIZE). In this case you will write the kernel of this kind:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Do smth...
}
}
Now you want to look for maximum element in the global_1D_array using reduction in shared memory and the above kernel will be look like this one:
#define BLOCK_SIZE (8)
#define GRID_SIZE (8)
#define N (2000)
// ...
__global__ void gridStridedLoop_kernel(double *global_1D_array)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
int i;
// Initialize shared memory array for the each block
__shared__ double data[BLOCK_SIZE];
// N is a total number of elements in the global_1D_array array
for (i = idx; i < N; i += blockDim.x * gridDim.x)
{
// Load data from global to shared memory
data[threadIdx.x] = global_1D_array[i];
__syncthreads();
// Do reduction in shared memory ...
}
// Copy MAX value for each block into global memory
}
It is clear that some values in the data will be overwritten, i.e. you need longer shared memory array or have to organize the kernel in another way.
What is the best (most efficient) way to use reduction in shared memory and strided loop together?
Thanks in advance.