CUDA shared memory broadcast and __syncthreads behavior

Question

I am experiencing a strange issue, well at least to me it looks strange, and I was hoping someone might be able to shed some light on it. I have a CUDA kernel which relies on shared memory for fast local accesses. To the limits of my knowledge, if all the threads within a half-warp access the same shared memory bank then the value will be broadcast to the threads in the warp. Also, access from multiple warps to the same bank do not cause bank conflicts, they will just be serialized. Keeping this in mind, I have created a small kernel to test this out (after encountering issues in my original kernel). Here's the snippet:

#define NUM_VALUES 16
#define NUM_LOOPS  1024

__global__ void shared_memory_test(float *output)
{
    // Create some shared memory
    __shared__ int dm_delays[NUM_VALUES];

    // Loop over NUM_LOOPS
    float accumulator = 0;
    for(unsigned c = 0; c < NUM_LOOPS; c++)
    {
        // Force shared memory update
        for(int d = threadIdx.x; d < NUM_VALUES; d++)
            dm_delays[d] = c * d;

        // __syncthreads();
        for(int d = 0; d < NUM_VALUES; d++)
            accumulator += dm_delays[d];
}

    // Store accumulated value to global memory
    for(unsigned d = 0; d < NUM_VALUES; d++)
        output[d] = accumulator;
}

I've run this with a block dimension of 16 (half a warp, not terribly efficient but it's just for testing purposes). All the threads should be addressing the same shared memory bank, so there should be no conflicts. However, the opposite seems to be true. I'm using Parallel Nsight on Visual Studio 2010 for this testing.

What is even more mysterious to me is the fact that if I uncomment the __syncthreads call in the outer loop then the number of bank conflicts increases dramatically.

Just some number to give you an idea (this is for a grid containing one block with 16 threads, so a single half-warp, NUM_VALUES = 16, NUM_LOOPS = 1024):

without __syncthreads: 4 bank conflicts
with __syncthreads : 4,096 bank conflicts

I'm running this on a GTX 670, set at compute_capability 3.0

Thank you in advance

UPDATE: It was pointed out that without __syncthreads the NUM_LOOPS reads in the outer loop were being optimised away by the compiler since the values of dm_delays never change. Now I get a constant 4,096 bank conflicts in both cases, which still doesn't play well with the broadcast behavior for shared memory.

Some corrections: access to the same address is broadcast (assuming a newer device), but access to the same bank (assuming the address is not identical across threads) is serialized, and this serialization is the defined effect of a bank conflict. So to say access to a shared memory location results in serialization but is not a bank conflict doesn't make much sense. — Robert Crovella
Yes you're right. The access in the code snippet should all be to the same shared memory address across the threadblock (all the threads should be accessing dm_delays[d] at the same time, and there are only 16 floating point values allocated to dm_delays) — lessju

Jeffrey Sax Jeffrey Sax · Accepted Answer · 2013-03-25T15:35:51

Since the value of dm_delays does not change, this may be a case where the compiler optimizes away the 1024 reads to shared memory if the __syncthreads is not present. With the __syncthreads there, it may assume that the value could be changed by another thread, and so it reads the value over and over again.

CUDA shared memory broadcast and __syncthreads behavior

1 Answers