CUDA: Shift arrays on shared memory

Question

I am trying to load a flattened 2D matrix into shared memory, shift the data along x, write back to global memory shifting also along y. The input data is therefore shifted along x and y. What I have:

__global__ void test_shift(float *data_old, float *data_new)

{

uint glob_index = threadIdx.x + blockIdx.y*blockDim.x;

__shared__ float VAR;
__shared__ float VAR2[NUM_THREADS];

// load from global to shared

VAR = data_old[glob_index];

// do some stuff on VAR 

if (threadIdx.x < NUM_THREADS - 1)
{
VAR2[threadIdx.x + 1] = VAR; // shift (+1) along x
}

__syncthreads();

// write to global memory

if (threadIdx.y < ny - 1)
{
glob_index = threadIdx.x + (blockIdx.y + 1)*blockDim.x; // redefine glob_index to shift along y (+1)
data_new[glob_index] = VAR2[threadIdx.x];
}

The call to the kernel:

test_shift <<< grid, block >>> (data_old, data_new);

and grid and blocks (blockDim.x is equal to the matrix width, i.e. 64):

dim3 block(NUM_THREADS, 1);
dim3 grid(1, ny);

I am not able to achieve it. Could someone please point out what's wrong with this? Should I use a strided index or an offset?

tera tera · Accepted Answer · 2012-11-30T11:05:52

VAR should not have been declared as shared, because in the current form all threads scribble over each other's data when you load from global memory: VAR = data_old[glob_index];.

You also have an out-of-bounds access when you access VAR2[threadIdx.x + 1], so your kernel never finishes (depending on the compute capability of the device - 1.x devices didn't check shared memory accesses as rigorously).

You could have detected the latter by checking the return codes of all calls to CUDA functions for errors.

CUDA: Shift arrays on shared memory

2 Answers