CUDA shared memory not faster than global?

Question

Hi i have kernel function, where i need to compare bytes. Area where i want to search is divided into blocks, so array of 4k bytes is divided to 4k/256 = 16 blocks. Each thread in block reads array on idx and compare it with another array, where is what i want to search. I've done this by two ways:

1.Compare data in global memory, but often threads in block need to read the same address.

2.Copy data from global memory to shared memory, and compare bytes in shared memory in the same way as mentioned above. Still problem with same address read. Copy to shared memory looks like this:

myArray[idx] = global[someIndex-idx];
whatToSearch[idx] = global[someIndex+idx];

Rest of the code is the same. Only operations on data in example 2 are performed in shared arrays.

But first option is about 10% faster, than that with the shared memory, why?? Thank you for explanations.

Please post a complete example. Without it all of the current answers are pure speculation. Your comments on the answers below are not enough to make it clear what you are doing. — harrism

Brendan Wood Brendan Wood · Accepted Answer · 2012-04-20T18:18:40

If you are only using the data once and there is no data reuse between different threads in a block, then using shared memory will actually be slower. The reason is that when you copy data from global memory to shared, it still counts as a global transaction. Reads are faster when you read from shared memory, but it doesn't matter because you already had to read the memory once from global, and the second step of reading from shared memory is just an extra step that doesn't provide anything of value.

So, the key point is that using shared memory is only useful when you need to access the same data more than once (whether from the same thread, or from different threads in the same block).

CUDA shared memory not faster than global?

2 Answers